Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 13, 2011

Unstructured data ‘out of control’: survey

Filed under: Data,Data Mining — Patrick Durusau @ 7:28 pm

Unstructured data ‘out of control’: survey

Joe McKendrick writes:

Many organizations are becoming overwhelmed with the volumes of unstructured information — audio, video, graphics, social media messages — that falls outside the purview of their “traditional” databases. Organizations that do get their arms around this data will gain significant competitive edge.

As part of my work with Unisphere Research, a division of Information Today, Inc., I helped conduct a new survey that finds unstructured data is growing at a faster clip than relational data — driving the “Big Data” explosion. Thirty-five percent of respondents say unstructured information has already surpassed or will surpass the volume of traditional relational data in the next 36 months. Sixty-two percent say this is inevitable within the next decade. The survey gathered input from 446 data managers and professionals who are readers of Database Trends and Applications magazine, and was underwritten by MarkLogic.

A majority of survey respondents acknowledge that unstructured information is growing out of control and is driving the big data explosion – 91% say unstructured information already lives in their organizations, but many aren’t sure what to do about it.

I mention this survey because unstructured data has few contenders for the attribution, discovery, extraction of semantics and topic maps may find less competition from traditional solutions.

July 1, 2011

ScraperWiki

Filed under: Data,Data Mining,Data Source,Text Extraction — Patrick Durusau @ 2:49 pm

ScraperWiki

From the About page:

What is ScraperWiki?

There’s lots of useful data on the internet – crime statistics, government spending, missing kittens…

But getting at it isn’t always easy. There’s a table here, a report there, a few web pages, PDFs, spreadsheets… And it can be scattered over thousands of different places on the web, making it hard to see the whole picture and the story behind it. It’s like trying to build something from Lego when someone has hidden the bricks all over town and you have to find them before you can start building!

To get at data, programmers write bits of code called ‘screen scrapers’, which extract the useful bits so they can be reused in other apps, or rummaged through by journalists and researchers. But these bits of code tend to break, get thrown away or forgotten once they have been used, and so the data is lost again. Which is bad.

ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code.

Something to keep an eye on and whenever possible, to contribute to.

People make data difficult to access for a reason. Let’s disappoint them.

June 23, 2011

Personal Analytics

Filed under: Analytics,Conferences,Data,Data Analysis — Patrick Durusau @ 1:49 pm

Personal Analytics

An O’Reilly Online Strata Conference.

Free

July 12, 2011

16:00 – 18:30am UTC

From the website:

It’s only in the past decade that we’ve become aware of how much of our lives is recorded. From phone companies to merchants, social networks to employers, everyone’s building a record of us―except us. That’s changing. Once, recording every aspect of your life might have seemed obsessive. Now, armed with the latest smartphones and comfortable with visualizations and analytics, life-logging is no longer fringe behavior. In this Strata OLC, we’ll look at the rapidly growing field of personal analytics. We’ll discuss tool stacks for recording lives, and hear surprising stories about what happens when introspection meets technology.

O’Reilly Strata Online is a fast-paced, web-based conference series tackling the impact of a data-driven, always-on world. It combines thorough tutorials, provocative panel discussions, real-world case studies, and deep-dives into technology stacks.

This could be fun, not to mention a model for mini-conferences perhaps for topic maps.

June 12, 2011

If You Have Too Much Data, then “Good Enough” Is Good Enough

Filed under: BigData,Data,Data Models — Patrick Durusau @ 4:10 pm

If You Have Too Much Data, then “Good Enough” Is Good Enough by Pat Helland.

This is a must read article where the author concludes:

The database industry has benefited immensely from the seminal work on data theory started in the 1970s. This work changed the world and continues to be very relevant, but it is apparent now that it captures only part of the problem.

We need a new theory and taxonomy of data that must include:

  • Identity and versions. Unlocked data comes with identity and optional versions.
  • Derivation. Which versions of which objects contributed to this knowledge? How is their schema interpreted? Changes to the source would drive a recalculation just as in Excel. If a legal reason means the source data may not be used, you should forget about using the knowledge derived from it.
  • Lossyness of the derivation. Can we invent a bounding that describes the inaccuracies introduced by derived data? Is this a multidimensional inaccuracy? Can we differentiate loss from the inaccuracies caused by sheer size?
  • Attribution by pattern. Just like a Mulligan stew, patterns can be derived from attributes that are derived from patterns (and so on). How can we bound taint from knowledge that we are not legally or ethically supposed to have?
  • Classic locked database data. Let’s not forget that any new theory and taxonomy of data should include the classic database as a piece of the larger puzzle.

The example of data relativity, a local “now” in data systems, which may not be consistent with the state at some other location, was particularly good.

May 30, 2011

Social Data on the Web (SDoW2011)

Filed under: Conferences,Data,Semantic Web — Patrick Durusau @ 6:55 pm

Social Data on the Web (SDoW2011)

Important Dates:

Submission deadline: Aug 15, 2011 (23:59 pm Hawaii time, GMT-10)
Notification of acceptance: Sep 05, 2011
Camera-ready paper submission: Sep 15, 2011
Camera-ready proceedings: Oct 07, 2011
Workshop: Oct 23/24, 2011

From the website:

Aim and Scope

The 4th international workshop Social Data on the Web (SDoW2011) co-located with the 10th International Semantic Web Conference (ISWC2011) aims to bring together researchers, developers and practitioners involved in semantically-enhancing social media websites, as well as academics researching more formal aspect of these interactions between the Semantic Web and Social Web.

It is now widely agreed in the community that the Semantic Web and the Social Web can benefit from each other. One the one hand, the speed at which data is being created on the Social Web is growing at exponential rate. Recent statistics showed that about 100 million Tweets are created per day and that Facebook has now 500 million users. Yet, some issues still have to be tackled, such as how to efficiently make sense of all this data, how to ensure trust and privacy on the Social Web, how to interlink data from different systems, whether it is on the Web or in the enterprise, or more recently, how to link Social Network and sensor networks to enable Semantic Citizen Sensing.

Prior Proceedings:

SDoW2008

SDoW2009

SDoW2010

May 29, 2011

Pew Research raw survey data now available

Filed under: Data,Data Source — Patrick Durusau @ 7:05 pm

Pew Research raw survey data now available

Actually the data sets pointed to by FlowingData are part of the Pew Internet (Pew Internet & American Life Project).

For all Pew raw data sets, see: Pew Research Center The Databank

Data is available in the following formats:

  1. Raw survey data file in both SPSS and comma-delimited (.csv) formats. To protect the privacy of respondents, telephone numbers, county of residence and zip code have been removed from all public data files.
  2. Cross tabulation file of questions with basic demographics in Word format. Standard demographic categories include sex, race, age, household income, educational attainment, parental status and geographic location (i.e. urban/rural/suburban).
  3. Survey instrument/questionnaire in Word format. The survey questionnaire provides question and response labels for the raw data file. It also includes all interviewer prompts and programming filters for outside researchers who would like to see how our questions are constructed or use our questions in their own surveys.
  4. Topline data file in Word format that includes trend data to previous surveys in which we have asked each question, where applicable.

As far as I know, the use of topic maps with survey and other data to create “profiles” of particular communities remains unexplored. May not be able to predict the actions of any individual but probabilistic predictions about members of a group may be close enough. Interesting. Predicting the actions of any individual may be NP-Hard but also irrelevant for most purposes.

May 17, 2011

Data Serialization

Filed under: Data,Data Streams — Patrick Durusau @ 2:47 pm

Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB

From the post:

I recently evaluated several serialization frameworks including Thrift, Protocol Buffers, and Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.

If you are in need of a data serialization framework this is a good place to start reading.

May 15, 2011

April 26, 2011

Data Beats Math

Filed under: Data,Mathematics,Subject Identity — Patrick Durusau @ 2:17 pm

Data Beats Math

A more recent post by Jeff Jonas.

Topic maps can capture observations, judgments, conclusions from human analysts.

Do those beat math as well?

« Newer Posts

Powered by WordPress