Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 5, 2012

Olympic medal winners: every one since 1896 as open data

Filed under: Data,Data Mining,Data Source — Patrick Durusau @ 5:21 am

Olympic medal winners: every one since 1896 as open data

The Guardian Datablog has posted Olympic medal winner data for download.

Admitting to some preference I was pleased to see that OpenDocument Format was one of the download choices. 😉

It may just be my ignorance of Olympic events but it seems odd for the gender of competitors to be listed along with the gender of the event?

A brief history of Olympic Sports (from Wikipedia). Military patrol was a demonstration sport in 1928, 1936 and 1948. Is that likely to make a return in 2016? Or would terrorist spotting be more appropriate?

July 4, 2012

Living with Imperfect Data

Filed under: Data,Data Governance,Data Quality,Topic Maps — Patrick Durusau @ 5:00 pm

Living with Imperfect Data by Jim Ericson.

From the post:

In a keynote at our MDM & Data Governance conference in Toronto a few days ago, an executive from a large analytical software company said something interesting that stuck with me. I am paraphrasing from memory, but it was very much to the effect of, “Sometimes it’s better to have everyone agreeing on numbers that aren’t entirely accurate than having everyone off doing their own numbers.”

Let that sink in for a moment.

After I did, the very idea of this comment struck me at a few levels. It might have the same effect on you.

In one sense, admitting there is an acceptable level of shared inaccuracy is anathema to the way we like to describe data governance. It was especially so at a MDM-centric conference where people are pretty single-minded about what constitutes “truth.”

As a decision support philosophy, it wouldn’t fly at a health care conference.

I rather like that: “Sometimes it’s better to have everyone agreeing on numbers that aren’t entirely accurate than having everyone off doing their own numbers.”

I suspect because it is the opposite of how I really like to see data. I don’t want rough results, say in a citation network but rather all the relevant citations. Even if it isn’t possible to review all the relevant citations. Still need to be complete.

But completeness is the enemy of results or at least published results. Sure, eventually, assuming a small enough data set, it is possible to map it in its entirety. But that means that whatever good would have come from it being available sooner, has been lost.

I don’t want to lose the sense of rough agreement posed here, because that is important as well. There are many cases where, despite Fed and economists protests to the contrary, the numbers are almost fictional anyway. Pick some, they will be different soon enough. What counts is that we have agreed on numbers for planning purposes. Can always pick new ones.

The same is true for topic maps and perhaps even more so for topic maps. They are a view into an infoverse, fixed at a moment in time by authoring decisions.

Don’t like the view? Create another one.

June 30, 2012

Inside the Open Data white paper: what does it all mean?

Filed under: Data,Open Data — Patrick Durusau @ 6:49 pm

Inside the Open Data white paper: what does it all mean?

The Guardian reviews a recent white paper on open data in the UK:

Does anyone disagree with more open data? It’s a huge part of the coalition government’s transparency strategy, championed by Francis Maude in the Cabinet Office and key to the government’s self-image.

And – following on from a less-than enthusiastic NAO report on its achievements in April – today’s Open Data White Paper is the government’s chance to seize the inititative.

Launching the paper, Maude said:

Today we’re at a pivotal moment – where we consider the rules and ways of working in a data‑rich world and how we can use this resource effectively, creatively and responsibly. This White Paper sets out clearly how the UK will continue to unlock and seize the benefits of data sharing in the future in a responsible way

And this one comes with a spreadsheet too – a list of each department’s commitments.

So, what does it actually include? White Papers are traditionally full of official, yet positive-sounding waffle, but what about specific announcements? We’ve extracted the key commitments below.

Just in case you are interested in open data from the UK or open data more generally.

it is amusing that the Guardian touts privacy concerns while at the same time bemoaning that access to “The Postcode Address File (PAFÂŽ) is a database that lists all known UK Postcodes and addresses.” remains in doubt.

I would rather a little less privacy and a little less junk mail if you please.

June 27, 2012

The Scourge of Data Silos

Filed under: Data,Data Silos — Patrick Durusau @ 1:28 pm

The Scourge of Data Silos by Rick Sherman

From the post:

“Those who cannot remember the past are condemned to repeat it.” [1]

Over the years there have been many technology waves related to the design, development and deployment of Business Intelligence (BI). As BI technologies evolved, they have been able to significantly expand their functionality by leveraging the incredible capacity growth of CPUs, storage, disk I/O, memory and network bandwidth. New technologies have emerged as enterprises’ data needs keep expanding in variety, volume and velocity.

Technology waves are occurring more frequently than ever. Current technology waves include Big Data, data virtualization, columnar databases, BI appliances, in-memory analytics, predictive analytics, and self-service BI.

Common Promises

Each wave brings with it the promise of faster, easier to use and cheaper BI solutions. Each wave promises to be the breakthrough that makes the “old ways” archaic, and introduces a new dawn of pervasive BI responsive to business needs. No more spreadsheets or reports needed!

IT and product vendors are ever hopeful that the latest technology wave will be the magic elixir for BI, however, people seem to miss that it is not technology that is the gating factor to pervasive BI. What has held back BI has been the reluctance to address the core issues of establishing enterprise data management, information architecture and data governance. Those core issues are hard and the perpetual hope is that one of these technology waves will be the Holy Grail of BI and allow enterprises to skip the hard work of transforming and managing information. We have discussed these issues many times (and will again), but what I want to discuss is the inevitable result in the blind faith in the latest technology wave.

Rick does a good job at pointing out “the inevitable result in the blind faith in the latest technology wave.”

His cool image of silos at the top is a hint about his conclusion:

silos

I have railed about data silos, along with everyone else, for years. But the line of data silos seems to be endless. As indeed I have come to believe it is.

Endless that is. We can’t build data structures or collections of data without building data silos. Some times with enough advantages to justify a new silo, sometimes not.

Rather than “kick against the bricks” of data silos, our time would be better spent making our data silos as transparent as need be.

Not completely and in some cases not at all. Simply not wrote the effort. In those cases, we can always fall back on ETL, or simply ignore the silo altogether.

I posted recently about open data passing the one millionth data set. Data that is trapped in data silos of one sort or another.

We can complain about the data that is trapped inside or we can create mechanisms to free it and data that will inevitably be contained in future data silos.

Even topic map syntaxes and/or models are data silos. But that’s the point isn’t it? We are silo builders and that’s ok.

What we need to add to our skill set is making windows in silos and sharing those windows with others.

Kiss the Weatherman [Weaponizing Data]

Filed under: BigData,Data,Dataset,Weather Data — Patrick Durusau @ 8:05 am

Kiss the Weatherman by James Locus.

From the post:

Weather Hurts

Catastrophic weather events like the historic 2011 floods in Pakistan or prolonged droughts in the horn of Africa make living conditions unspeakably harsh for tens of millions of families living in these affected areas. In the US, the winter storms of 2009-2010 and 2010-2011 brought record-setting snowfall, forcing mighty metropolises into an icy standstill. Extreme weather can profoundly impact the landscape of the planet.

The effects of extreme weather can send terrible ripples throughout an entire community. Unexpected cold snaps or overly hot summers can devastate crop yields and forcing producers to raise prices. When food prices rise, it becomes more difficult for some people to earn enough money to provide for their families, creating even larger problems for societies as a whole.

The central problem is the inability of current measuring technologies to more accurately predict large-scale weather patterns. Weathermen are good at predicting weather but poor at predicting climate. Weather occurs over a shorter period of time and can be reliability predicted within a 3-day timeframe. Climate stretches many months, years, or even centuries. Matching historical climate data with current weather data to make future weather and climate is a major challenge for scientists.

James has a good survey of both data sources and researchers working on using “big data” (read historical weather data) for both weather (short term) and climate (longer term) prediction.

Weather data by itself is just weather data.

What other data would you combine with it and on what basis to weaponize the data?

No one can control the weather but you can control your plans for particular weather events.

June 25, 2012

Data citation initiatives and issues

Filed under: Data,Data Citation,Data Management — Patrick Durusau @ 3:57 pm

Data citation initiatives and issues by Matthew S. Mayernik (Bulletin of the American Society for Information Science and Technology Volume 38, Issue 5, pages 23–28, June/July 2012)

Abstract:

The importance of formally citing scientific research data has been recognized for decades but is only recently gaining momentum. Several federal government agencies urge data citation by researchers, DataCite and its digital object identifier registration services promote the practice of citing data, international citation guidelines are in development and a panel at the 2012 ASIS&T Research Data Access and Preservation Summit focused on data citation. Despite strong reasons to support data citation, the lack of individual user incentives and a pervasive cultural inertia in research communities slow progress toward broad acceptance. But the growing demand for data transparency and linked data along with pressure from a variety of stakeholders combine to fuel effective data citation. Efforts promoting data citation must come from recognized institutions, appreciate the special characteristics of data sets and initially emphasize simplicity and manageability.

This is an important and eye-opening article on the state of data citations and issues related to it.

I found it surprising in part because citation of data in radio and optical astronomy has long been commonplace. In part because for decades now, the astronomical community has placed a high value on public archiving of research data as it is acquired, both in raw and processed formats.

As pointed out in this paper, without public archiving, there can be no effective form of data citation. Sad to say, the majority of data never makes it to public archives.

Given the reliance on private and public sources of funding for research, public archiving and access should be guaranteed as a condition of funding. Researchers would be free to continue to not make their data publicly accessible, should they choose to fund their own work.

If that sounds harsh, consider the well deserved amazement at the antics over access to the Dead Sea Scrolls.

If the only way for your opinion/analysis to prevail is to deny others access to the underlying data, that is all the commentary the community needs on your work.

June 12, 2012

Open Content (Index Data)

Filed under: Data,Data Source — Patrick Durusau @ 3:20 pm

Open Content

From the webpage:

The searchable indexes below expose public domain ebooks, open access digital repositories, Wikipedia articles, and miscellaneous human-cataloged Internet resources. Through standard search protocols, you can make these resources part of your own information portals, federated search systems, catalogs etc. Connection instructions for SRU and Z39.50 are provided. If you have comments, questions, or suggestions for resources you would like us to add, please contact us, or consider joining the mailing list.. This service is powered by Index Data’s Zebra and Metaproxy

Looking around after reading the post on the interview with Sebastian Hammer on Federated Search I found this listing of resources.

Database name #records Description
gutenberg 22194

Project Gutenberg.
High-quality clean-text ebooks, some audio-books.

oaister 9988376

OAIster. A Union catalog of digital resources, chiefly open archives of journals, etc.

oca-all 135673 All of the ebooks made available by the Internet Archive
as part of the Open Content Alliance (OCA). Includes high-quality, searchable PDFs, online book-readers,
audio books, and much more. Excludes the Gutenberg sub-collection, which is available as a
separate database.
oca-americana 49056 The American
Libraries
collection of the Open Content Alliance.
oca-iacl 669 The Internet Archive Children’s Library. Books for children from around the world.
oca-opensource 2616 Collection of community-contributed books at the Internet Archive.
oca-toronto 37241 The Canadian Libraries
collection of the Open
Content Alliance
.
oca-universallibrary 30888 The Universal Library, a digitzation
project founded at Carnegie-Mellon University. Content hosted at the Internet Archive.
wikipedia 1951239 Titles and abstracts from Wikipedia, the open encyclopedia.
wikipedia-da 66174 The Danish Wikipedia. Many thanks to Fujitsu Denmark for their support for the indexing of the national Wikipedias.
wikipedia-sv 243248 The Swedish Wikipedia.

Latency is an issue but I wonder what my reaction would be if a search quickly offered 3 or 4 substantive resources and invited me to read/manipulate them, while it seeks additional information/data?

Most of the articles you see cited in this blog aren’t the sort of thing you can skim and some take more than one pass to jell.

I suppose I could be offered 50 highly relevant articles in milli-seconds but I am not capable of assimalating them that quickly.

So how many resources have been wasted to give me a capacity I can’t effectively use?

June 8, 2012

Data Science Summit 2012

Filed under: BigData,Data,Data Science — Patrick Durusau @ 8:58 pm

Data Science Summit 2012

From Greenplum, videos from the most recent data summit:

June 7, 2012

Cascading 2.0

Filed under: Cascading,Data,Data Integration,Data Management,Data Streams — Patrick Durusau @ 2:16 pm

Cascading 2.0

From the post:

We are happy to announce that Cascading 2.0 is now publicly available for download.

http://www.cascading.org/downloads/

This release includes a number of new features. Specifically:

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

We have also created a new top-level project on GitHub for all community sponsored Cascading projects:

https://github.com/Cascading

From the documentation:

What is Cascading?

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading’s “local mode” can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.

Cascading homepage.

Don’t miss the extensions to Cascading: Cascading Extensions. Any summary would be unfair. Take a look for yourself. Coverage of any of these you would like to point out?

I first spotted Cascading 2.0 at Alex Popescu’s myNoSQL.

June 5, 2012

Are You a Bystander to Bad Data?

Filed under: Data,Data Quality — Patrick Durusau @ 7:58 pm

Are You a Bystander to Bad Data? by Jim Harris.

From the post:

In his recent Harvard Business Review blog post “Break the Bad Data Habit,” Tom Redman cautioned against correcting data quality issues without providing feedback to where the data originated.

“At a minimum,” Redman explained, “others using the erred data may not spot the error. There is no telling where it might turn up or who might be victimized.” And correcting bad data without providing feedback to its source also denies the organization an opportunity to get to the bottom of the problem.

“And failure to provide feedback,” Redman continued, “is but the proximate cause. The deeper root issue is misplaced accountability — or failure to recognize that accountability for data is needed at all. People and departments must continue to seek out and correct errors. They must also provide feedback and communicate requirements to their data sources.”

In his blog post, “The Secret to an Effective Data Quality Feedback Loop,” Dylan Jones responded to Redman’s blog post with some excellent insights regarding data quality feedback loops and how they can help improve your data quality initiatives.

[I removed two incorrect links in the quoted portion of Jim’s article. Were pointers to the rapper “Redman” and not Tom Redman. And I posted a comment on Jim’s blog about the error.]

Take the time to think about providing feedback on bad data.

Would bad data get corrected more often if correction was easier?

What if a data stream could be intercepted and corrected? Would that make correction easier?

May 28, 2012

5 Hidden Skills for Big Data Scientists

Filed under: Data,Data Science — Patrick Durusau @ 6:44 pm

5 Hidden Skills for Big Data Scientists by Matthew Hurst.

Matthew outlines five (5) skills for data scientists:

  1. Be Clear: Is Your Problem Really A Big Data Problem?
  2. Communicating About Your Data
  3. Invest in Interactive Analytics, not Reporting
  4. Understand the Role and Quality of Human Evaluations of Data
  5. Spend Time on the Plumbing

Turn off your email/cellphone and spend a few minutes jotting down your ideas on these points.

Then compare your ideas/comments to Matthew’s.

Not a question of better/worse but of forming a habit of thinking about data.

May 18, 2012

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Filed under: Data,Knowledge,Machine Learning,Stream Analytics — Patrick Durusau @ 3:06 pm

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

From the post:

Here is the first series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!).The titles of the talks are linked to the presentation slides. The full program which ends tomorrow is here.. All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

Posted by Igor Carron at Nuit Blanche.

Finding enough hours to watch all of these is going to be a problem!

Which ones do you like best?

May 14, 2012

CDG – Community Data Generator

Filed under: Ctools,Data — Patrick Durusau @ 5:50 pm

CDG – Community Data Generator

From the post:

CDG is a datawarehouse generator and the newest member of the Ctools family. Given the definition of dimensions that we want, CDG will randomize data within certain parameters and output 3 different things:

  • Database and table ddl for the fact table
  • A file with inserts for the fact table
  • Mondrian schema file to be used within pentaho

While most of the documentation mentions the usage within the scope of Pentaho there’s absolutely nothing that prevents the resulting database to be used in different contexts.

I had mentioned ctools before but not in any detail. This was the additional resource that made me pick them back up.

It isn’t hard to see how this data generator will be useful.

For subject-centric software, generating files with known “same subject” characteristics would be more useful.

Thoughts, suggestions or pointers to work on generation of such files?

May 11, 2012

Data journalism handbook: Tips for Working with Numbers in the News

Filed under: Data,News — Patrick Durusau @ 6:40 pm

Michael Blastland writes in Data journalism handbook: Tips for Working with Numbers in the News some short tips that will ease you towards becoming a data journalist.

You might want to print out Michael’s tips and keep them close at hand.

After a while you may want to add your own tips about particular data sources.

Or better yet, share them with others!

Oh, btw, the Data Journalism Handbook.

May 8, 2012

Intent vs. Inference

Filed under: Data,Data Analysis,Inference,Intent — Patrick Durusau @ 3:03 pm

Intent vs. Inference by David Loshin.

David writes:

I think that the biggest issue with integrating external data into the organization (especially for business intelligence purposes) is related to the question of data repurposing. It is one thing to consider data sharing for cross-organization business processes (such as brokering transactions between two different trading partners) because those data exchanges are governed by well-defined standards. It is another when your organization is tapping into a data stream created for one purpose to use the data for another purpose, because there are no negotiated standards.

In the best of cases, you are working with some published metadata. In my previous post I referred to the public data at www.data.gov, and those data sets are sometimes accompanied by their data layouts or metadata. In the worst case, you are integrating a data stream with no provided metadata. In both cases, you, as the data consumer, must make some subjective judgments about how that data can be used.

A caution about “intent” or as I knew it, the intentional fallacy in literary criticism. It is popular in some legal circles in the United States as well.

One problem is that there is no common basis for determining authorial intent.

Another problem is that “intent” is often used to privilege one view over others as representing the “intent” of the author. The “original” view is beyond questioning or criticism because it is the “intent” of the original author.

It should come as no surprise that for law (Scalia and the constitution) and the Bible (you pick’em), “original intent” means agrees with the speaker.

It isn’t entirely clear where David is going with this thread but I would simply drop the question of intent and ask two questions:

  1. What is the purpose of this data?
  2. Is the data suited to that purpose?

Where #1 may include what inferences we want to make, etc.

Cuts to the chase as it were.

May 4, 2012

Titles from Springer collection cover wide range of disciplines on Apple’s iBookstore

Filed under: Books,Data,Springer — Patrick Durusau @ 3:44 pm

Titles from Springer collection cover wide range of disciplines on Apple’s iBookstore

From the post:

Springer Science+Business Media now offers one of the largest scientific, technical and medical (STM) book collections on the iBookstore with more than 20,000 individual Springer titles. Cornerstone works in disciplines like mathematics, medicine and engineering are now available, along with selections in other fields such as business and economics. Titles include the Springer Handbook of Nanotechnology, Pattern Recognition and Machine Learning, Bergey’s Manual of Systematic Bacteriology and the highly regarded book series Graduate Texts in Mathematics.

Springer is currently undertaking an exhaustive effort to digitize all of its books dating back to the mid-nineteenth century. By making most of its entire collection – both new and archived titles – available through its SpringerLink platform, Springer offers STM researchers far more opportunities than ever to obtain and apply content.

Gee, do you think the nomenclature has changed since the mid-nineteenth century until now? Just a bit? To say nothing across languages.

Prime topic map territory, both for traditional build and sell versions as well as topic trails through literature.

Will have to check to see how far back the current Spring API goes.

Bridging the Data Science Gap (DataKind)

Filed under: Data,Data Analysis,Data Science,Data Without Borders,DataKind — Patrick Durusau @ 3:43 pm

Bridging the Data Science Gap

From the post:

Data Without Borders connects data scientists with social organizations to maximize their impact.

Data scientists want to contribute to the public good. Social organizations often boast large caches of data but neither the resources nor the skills to glean insights from them. In the worst case scenario, the information becomes data exhaust, lost to neglect, lack of space, or outdated formats. Jake Porway, Data Without Borders [DataKind] founder and The New York Times data scientist, explored how to bridge this gap during the second Big Data for the Public Good seminar, hosted by Code for America and sponsored by Greenplum, a division of EMC.

Code for America founder Jennifer Pahlka opened the seminar with an appeal to the data practitioners in the room to volunteer for social organizations and civic coding projects. She pointed to hackathons such the ones organized during the nationwide event Code Across America as being examples of the emergence of a new kind of “third place”, referencing sociologist Ray Oldenburg’s theory that the health of a civic society depends upon shared public spaces that are neither home nor work. Hackathons, civic action networks like the recently announced Code for America Brigade, and social organizations are all tangible third spaces where data scientists can connect with community while contributing to the public good.

These principles are core to the Data Without Borders [DataKind] mission. “Anytime there’s a process, there’s data,” Porway emphasized to the audience. Yet much of what is generated is lost, particularly in the third world, where a great amount of information goes unrecorded. In some cases, the social organizations that often operate on shoestring budgets may not even appreciate the value of what they’re losing. Meanwhile, many data scientists working in the private sector want to contribute their skills for the social good in their off-time. “On the one hand, we have a group of people who are really good at looking at data, really good at analyzing things, but don’t have a lot of social outputs for it,” Porway said. “On the other hand, we have social organizations that are surrounded by data and are trying to do really good things for the world but don’t have anybody to look at it.”

The surplus of free work to be done is endless but thought you might find this interesting.

Data Without Borders – name change -> DataKind, Facebook page, @datakind on Twitter.

Good opportunity to show off your topic mappings skills!

May 2, 2012

Google BigQuery and the Github Data Challenge

Filed under: Contest,Data,Google BigQuery — Patrick Durusau @ 10:54 am

Google BigQuery and the Github Data Challenge

Deadline May 21, 2012

From the post:

Github has made data on its code repositories, developer updates, forks etc. from the public GitHub timeline available for analysis, and is offering prizes for the most interesting visualization of the data. Sounds like a great challenge for R programmers! The R language is currently the 26th most popular on GitHub (up from #29 in December), and it would be interesting to visualize the usage of R compared to other languages, for example. The deadline for submissions to the contest is May 21.

Interestingly, GitHub has made this data available on the Google BigQuery service, which is available to the public today. BigQuery was free to use while it was in beta test, but Google is now charging for storage of the data: $0.12 per gigabyte per month, up to $240/month (the service is limited to 2TB of storage – although there a Premier offering that supports larger data sizes … at a price to be negotiated). While members of the public can run SQL-like queries on the GitHub data for free, Google is charging subscribers to the service 3.5 cents per Gb processed in the query: this is measured by the source data accessed (although columns of data not referenced aren't counted); the size of the result set doesn't matter.

Watch your costs but thoughts on how you would visualize the data?

May 1, 2012

Researchers Turn Data into Dynamic Demographics

Filed under: Data,Demographics,Foursquare — Patrick Durusau @ 4:46 pm

Researchers Turn Data into Dynamic Demographics

From the post:

Aside from showing off how their travel, culinary and nightlife habits, users of the geolocated “check-in” service Foursquare could shed light on the character of a particular city and its neighborhoods.

Researchers at Carnegie Mellon University’s School of Computer Science say that instead of relying on stagnant, unyielding census and neighborhood zoning data to take the temperature of a given community, Foursquare checkin data can provide the much –needed layer of dynamic city life.

The researchers have developed developed an algorithm that takes the check-ins generated when foursquare members visit participating businesses or venues, and clusters them based on a combination of the location of the venues and the groups of people who most often visit them. This information is then mapped to reveal a city’s Livehoods, a term coined by the SCS researchers.

All of the Livehoods analysis is based on foursquare check-ins that users have shared publicly via social networks such as Twitter. This dataset of 18 million check-ins includes user ID, time, latitude and longitude, and the name and category of the venue for each check-in.

“Our goal is to understand how cities work through the lens of social media,” said Justin Cranshaw, a Ph.D. student in SCS’s Institute for Software Research.

The researchers analyzed data from foursquare, but the same computational techniques could be applied to several other databases of location information. The researchers are exploring applications to city planning, transportation and real estate development. Livehoods also could be useful for businesses developing marketing campaigns or for public health officials tracking the spread of disease.

A good example of remapping data. The data was collected and “mapped” for one purpose but subsequently was re-mapped and re-purposed.

Mapping the semantics of data empowers its re-use/re-purposing, which creates further opportunities for re-use and re-purposing.

See also: http://livehoods.org/

April 28, 2012

City Dashboard: Aggregating All Spatial Data for Cities in the UK

Filed under: Aggregation,Data,News — Patrick Durusau @ 6:09 pm

City Dashboard: Aggregating All Spatial Data for Cities in the UK

You need to try this out for yourself before reading the rest of this post.

Go ahead, I’ll wait…, …, …, ok.

To some extent this “aggregation” may reflect on the sort of questions we ask users about topic maps.

It’s possible to aggregate data about anything number of things. But even if you could, would you want to?

Take the “aggregation” for Birmingham, UK, this evening. One of the components informed me a choir director was arrested for rape. Concerns the choir director a good bit but why it would interest me?

Isn’t that the problem of aggregation? The definition of “useful” aggregation varies from person to person, even task to task.

Try London while you are at the site. There is a Slightly Unhappier/Significantly Unhappier, “Mood” indicator. It has what turns out to be a “count down” timer, for the next reset on the indicator.

I thought the changing count reflected people becoming more and more unhappy.

Looked like London was going to “flatline” while I was watching. 😉

Fortunately turned out to not be the case.

There are dangers to personalization but aggregation without relevance just pumps up the noise.

Not sure that helps either.

Suggestions?

April 27, 2012

Data and visualization blogs worth following

Filed under: Data,Graphics,Visualization — Patrick Durusau @ 6:11 pm

Data and visualization blogs worth following

Nathan Yau has posted a list of 38 blogs he follows on:

  • Design and Aesthetics
  • Statistical and Analytical Visualization
  • Journalism
  • General Visualization
  • Maps
  • Data and Statistics

Thought you would enjoy a weekend of updating your blog readers!

April 26, 2012

The Shades of Time Project

Filed under: Data,Dataset,Diversity — Patrick Durusau @ 6:31 pm

The Shades of TIME project by Drew Conway.

Drew writes:

A couple of days ago someone posted a link to a data set of all TIME Magazine covers, from March, 1923 to March, 2012. Of course, I downloaded it and began thumbing through the images. As is often the case when presented with a new data set I was left wondering, “What can I ask of the data?”

After thinking it over, and with the help of Trey Causey, I came up with, “Have the faces of those on the cover become more diverse over time?” To address this questions I chose to answer something more specific: Has the color values of skin tones in faces on the covers changed over time?

I developed a data visualization tool, I’m calling the Shades of TIME, to explore the answer to that question.

An interesting data set and an illustration of why topic map applications are more useful if they have dynamic merging (user selected).

Presented with the same evidence, the covers of TIME magazine I most likely would have:

  • Mapped people on the covers to historical events
  • Mapped people on the covers to additional historical resources
  • Mapped covers into library collections
  • etc.

I would not have set out to explore the diversity in skin color on the covers. In part because I remember when it changed. That is part of my world knowledge. I don’t have to go looking for evidence of it.

My purpose isn’t to say authors, even topic map authors, should avoid having a point of view. Isn’t possible in any event. What I am suggesting is that to the extent possible, users be enabled to impose their views on a topic map as well.

April 24, 2012

Data Virtualization

Filed under: BigData,Data,Data Analysis,Data Virtualization — Patrick Durusau @ 7:17 pm

David Loshin has a series of excellent posts on data virtualization:

Fundamental Challenges in Data Reusability and Repurposing (Part 1 of 3)

Simplistic Approaches to Data Federation Solve (Only) Part of the Puzzle – We Need Data Virtualization (Part 2 of 3)

Key Characteristics of a Data Virtualization Solution (Part 3 of 3)

In part 3, David concludes:

In other words, to truly provision high quality and consistent data with minimized latency from a heterogeneous set of sources, a data virtualization framework must provide at least these capabilities:

  • Access methods for a broad set of data sources, both persistent and streaming
  • Early involvement of the business user to create virtual views without help from IT
  • Software caching to enable rapid access in real time
  • Consistent views into the underlying sources
  • Query optimizations to retain high performance
  • Visibility into the enterprise metadata and data architectures
  • Views into shared reference data
  • Accessibility of shared business rules associated with data quality
  • Integrated data profiling for data validation
  • Integrated application of advanced data transformation rules that ensure consistency and accuracy

What differentiates a comprehensive data virtualization framework from simplistic layering of access and caching services via data federation is that the comprehensive data virtualization solution goes beyond just data federation. It is not only about heterogeneity and latency, but must incorporate the methodologies that are standardized within the business processes to ensure semantic consistency for the business. If you truly want to exploit the data virtualization layer for performance and quality, you need to have aspects of the meaning and differentiation between use of the data engineered directly into the implementation. And most importantly, also make sure the business user signs-off on the data that is being virtualized for consumption. (emphasis added)

David makes explicit a number of issues, such as integration architectures needing to peer into enterprise metadata and data structures, making it plain that not only data, but the ways we contain/store data has semantics as well.

I would add: Consistency and accuracy should be checked on a regular basis with specified parameters for acceptable correctness.

The heterogeneous data sources that David speaks of are ever changing, both in form and semantics. If you need proof of that, consider the history of ETL at your company. If either form or semantics were stable, that would be a once or twice in a career event. I think we all know that is not the case.

Topic maps can disclose the data and rules for the virtualization decisions that David enumerates. Which has the potential to make those decisions themselves auditable and reusable.

Reuse being an advantage in a constantly changing and heterogeneous semantic environment. Semantics seen once, are very likely to be seen again. (Patterns anyone?)

April 22, 2012

Open Government Data

Filed under: Data,Government Data,Open Data — Patrick Durusau @ 7:06 pm

Open Government Data by Joshua Tauberer.

From the website:

This book is the culmination of several years of thinking about the principles behind the open government data movement in the United States. In the pages within, I frame the movement as the application of Big Data to civics. Topics include principles, uses for transparency and civic engagement, a brief legal history, data quality, civic hacking, and paradoxes in transparency.

Johshua’s book can be ordered in hard copy, ebook, or viewed online for free.

You may find this title useful in discussions of open government data.

April 20, 2012

What Makes Good Data Visualization?

Filed under: Data,Graphics,Visualization — Patrick Durusau @ 6:24 pm

What Makes Good Data Visualization?

Panel discussion at the New York Public Library.

Panelists:

Kaiser Fung, Blogger, junkcharts.typepad.com/numbersruleyourworld
Andrew Gelman, Director, Applied Statistics Center, Columbia University
Mark Hansen, Artist; Professor of Statistics, UCLA
Tahir Hemphill, Creative Director; Founder, Hip Hop Word Count Project
Manuel Lima, Founder, VisualComplexity.com; Senior UX Design Lead, Microsoft Bing

Infovis and Statistical Graphics: Different Goals, Different Looks by Andrew Gelman and Antony Unwin is said in the announcement to be relevant. (It’s forty-four pages so don’t try to read it while watching the video. It is, however, worth your time.)

Unfortunately, the sound quality is very uneven. Ranges from very good to almost inaudible.

@16:04 the show is finally about to begin.

The screen is nearly impossible to see. I have requested that the slides be posted.

The parts of the discussion that are audible are very good, which makes it even more disappointing that so much of it is missing.

Of particular interest (at least to me) were the early comments and illustrations (not visible in the video) of how current graphic efforts are re-creating prior efforts to illustrate data.

April 19, 2012

Knoema Launches the World’s First Knowledge Platform Leveraging Data

Filed under: Data,Data Analysis,Data as Service (DaaS),Data Mining,Knoema,Statistics — Patrick Durusau @ 7:13 pm

Knoema Launches the World’s First Knowledge Platform Leveraging Data

From the post:

DEMO Spring 2012 conference — Today at DEMO Spring 2012, Knoema launched publicly the world’s first knowledge platform that leverages data and offers tools to its users to harness the knowledge hidden within the data. Search and exploration of public data, its visualization and analysis have never been easier. With more than 500 datasets on various topics, gallery of interactive, ready to use dashboards and its user friendly analysis and visualization tools, Knoema does for data what YouTube did to videos.

Millions of users interested in data, like analysts, students, researchers and journalists, struggle to satisfy their data needs. At the same time there are many organizations, companies and government agencies around the world collecting and publishing data on various topics. But still getting access to relevant data for analysis or research can take hours with final outcomes in many formats and standards that can take even longer to get it to a shape where it can be used. This is one of the issues that the search engines like Google or Bing face even after indexing the entire Internet due to the nature of statistical data and diversity and complexity of sources.

One-stop shop for data. Knoema, with its state of the art search engine, makes it a matter of minutes if not seconds to find statistical data on almost any topic in easy to ingest formats. Knoema’s search instantly provides highly relevant results with chart previews and actual numbers. Search results can be further explored with Dataset Browser tool. In Dataset Browser tool, users can get full access to the entire public data collection, explore it, visualize data on tables/charts and download it as Excel/CSV files.

Numbers made easier to understand and use. Knoema enables end-to-end experience for data users, allowing creation of highly visual, interactive dashboards with a combination of text, tables, charts and maps. Dashboards built by users can be shared to other people or on social media, exported to Excel or PowerPoint and embedded to blogs or any other web site. All public dashboards made by users are available in dashboard gallery on home page. People can collaborate on data related issues participating in discussions, exchanging data and content.

Excellent!!!

When “other” data becomes available, users will want to integrate it with their data.

But “other” data will have different or incompatible semantics.

So much for attempts to wrestle semantics to the ground (W3C) or build semantic prisons (unnamed vendors).

What semantics are useful to you today? (patrick@durusau.net)

April 17, 2012

Download 10,000 Days of Free Weather Data for Almost Any Location Worldwide

Filed under: Data,Dataset,PowerPivot — Patrick Durusau @ 7:12 pm

Download 10,000 Days of Free Weather Data for Almost Any Location Worldwide

A very cool demonstration of PowerPivot with weather data.

I don’t have PowerPivot (or Office 2010) but will be correcting that in the near future.

Pointers to importing diverse data into PowerPivot?

April 16, 2012

Data Documentation Initiative (DDI)

Filed under: Data,Data Documentation Initiative (DDI),Vocabularies — Patrick Durusau @ 7:12 pm

Data Documentation Initiative (DDI)

From the website:

The Data Documentation Initiative (DDI) is an effort to create an international standard for describing data from the social, behavioral, and economic sciences. Expressed in XML, the DDI metadata specification now supports the entire research data life cycle. DDI metadata accompanies and enables data conceptualization, collection, processing, distribution, discovery, analysis, repurposing, and archiving.

Two current development lines:

DDI-Lifecycle

Encompassing all of the DDI-Codebook specification and extending it, DDI-Lifecycle is designed to document and manage data across the entire life cycle, from conceptualization to data publication and analysis and beyond. Based on XML Schemas, DDI-Lifecycle is modular and extensible.

Users new to DDI are encouraged to use this DDI-Lifecycle development line as it incorporates added functionality. Use DDI-Lifecycle if you are interested in:

  • Metadata reuse across the data life cycle
  • Metadata-driven survey design
  • Question banks
  • Complex data, e.g., longitudinal data
  • Detailed geographic information
  • Multiple languages
  • Compliance with other metadata standards like ISO 11179
  • Process management and automation

The current version of the DDI-L Specification is Version 3.1.  DDI 3.1 was published in October 2009, superseding DDI 3.0 (published in April 2008). 

DDI-Codebook

DDI-Codebook is a more light-weight version of the standard, intended primarily to document simple survey data. Originally DTD-based, DDI-C is now available as an XML Schema.

The current version of DDI-C is 2.5.

Be aware that micro-data in DDI was mentioned in The RDF Data Cube Vocabulary draft as a possible target for “extension” of that proposal.

Suggestions of other domain specific data vocabularies?

Unlike the W3C I don’t see the need for an embrace and extent strategy.

There are enough vocabularies, from ancient to present-day to keep us all busy for the foreseeable future. Without trying to restart every current vocabulary effort.

April 14, 2012

7 Big Winners in the U.S. Big Data Drive

Filed under: BigData,Data,Funding — Patrick Durusau @ 6:25 pm

7 Big Winners in the U.S. Big Data Drive by Nicole Hemsoth.

As we pointed out in Big Data is a Big Deal, the U.S. government is ponying up $200 million in new data projects.

Nicole covers seven projects that are of particular interest:

  1. DARPA’s XDATA – See XDATA for details – Closes May 30, 2012.
  2. SDAV Institute (DOE) – SDAV: Scalable Data Management, Analysis and Visualization (has a toolkit and other resources I need to cover separately)
  3. Biological and Environmental Research Program (BER) has created the Atmospheric Radiation Measurement (ARM) Climate Research Facility. Lots of data.
  4. John Wesley Powell Center for Analysis and Synthesis (USGS). Data + tools.
  5. PURVAC Purdue University – Homeland Security
  6. Biosense 2.0 – CDC project
  7. Machine Reading (DARPA) – usual goals:

    developing learning systems that process natural text and insert the resulting semantic representation into a knowledge bases rather than relying on expensive and time-consuming current processes for knowledge representation that require expert and associated knowledge engineers to hand-craft information.

I suppose one lesson to be learned is how quickly the bulk of $200 million can be sucked up by current projects.

The second lesson is to become an ongoing (large ongoing) research project so that you too can suck up new funding.

The third lesson is to use these ostensible goals of these projects as actual goals for your projects. The difference between trying to reach a goal and reaching it may make a difference.

April 12, 2012

30 Places to Find Open Data on the Web

Filed under: Data,Dataset — Patrick Durusau @ 7:04 pm

30 Places to Find Open Data on the Web by Romy Misra.

From the post:

Finding an interesting data set and a story it tells can be the most difficult part of producing an infographic or data visualization.

Data visualization is the end artifact, but it involves multiple steps – finding reliable data, getting the data in the right format, cleaning it up (an often underestimated step in the amount of time it takes!) and then finding the story you will eventually visualize.

Following is a list useful resources for finding data. Your needs will vary from one project to another, but this list is a great place to start — and bookmark.

A very good collection of data sources.

From the comments as of April 10, 2012, you may also want to consider:

http://data.gov.uk/

http://thedatahub.org/

http://www.freebase.com/

(The photography link in the comments is spam, don’t bother.)

Other data sources that you would suggest?

« Newer PostsOlder Posts »

Powered by WordPress