Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 26, 2013

The Wikidata revolution is here:…

Filed under: Data,Wikidata,Wikipedia — Patrick Durusau @ 5:52 pm

The Wikidata revolution is here: enabling structured data on Wikipedia by Tilman Bayer.

From the post:

A year after its announcement as the first new Wikimedia project since 2006, Wikidata has now begun to serve the over 280 language versions of Wikipedia as a common source of structured data that can be used in more than 25 million articles of the free encyclopedia.

By providing Wikipedia editors with a central venue for their efforts to collect and vet such data, Wikidata leads to a higher level of consistency and quality in Wikipedia articles across the many language editions of the encyclopedia. Beyond Wikipedia, Wikidata’s universal, machine-readable knowledge database will be freely reusable by anyone, enabling numerous external applications.

“Wikidata is a powerful tool for keeping information in Wikipedia current across all language versions,” said Wikimedia Foundation Executive Director Sue Gardner. “Before Wikidata, Wikipedians needed to manually update hundreds of Wikipedia language versions every time a famous person died or a country’s leader changed. With Wikidata, such new information, entered once, can automatically appear across all Wikipedia language versions. That makes life easier for editors and makes it easier for Wikipedia to stay current.”

This is a great source of curated data!

Once Under Wraps, Supreme Court Audio Trove Now Online

Filed under: Data,History,Law,Law - Sources — Patrick Durusau @ 3:09 pm

Once Under Wraps, Supreme Court Audio Trove Now Online

From the post:

On Wednesday, the U.S. Supreme Court heard oral arguments in the final cases of the term, which began last October and is expected to end in late June after high-profile rulings on gay marriage, affirmative action and the Voting Rights Act.

Audio from Wednesday’s arguments will be available at week’s end at the court’s website, but that’s a relatively new development at an institution that has historically been somewhat shuttered from public view.

The court has been releasing audio during the same week as arguments only since 2010. Before that, audio from one term generally wasn’t available until the beginning of the next term. But the court has been recording its arguments for nearly 60 years, at first only for the use of the justices and their law clerks, and eventually also for researchers at the National Archives, who could hear — but couldn’t duplicate — the tapes. As a result, until the 1990s, few in the public had ever heard recordings of the justices at work.

But as of just a few weeks ago, all of the archived historical audio — which dates back to 1955 — has been digitized, and almost all of those cases can now be heard and explored at an online archive called the Oyez Project.

A truly incredible resources for U.S. history in general and legal history in particular.

The transcripts and tapes are synchronized so your task, if you are interested, is to map these resources to other historical accounts and resources. 😉

The only disappointment is that the recordings begin with the October term of 1955. One of the most well known cases of the 20th century, Brown v. Board of Education, was argued in 1952 and re-argued in 1953. Hearing Thurgood Marshall argue that case would be a real treat.

I first saw this at: NPR: oyez.org finishes Supreme Court oral arguments project.

April 25, 2013

A different take on data skepticism

Filed under: Algorithms,Data,Data Models,Data Quality — Patrick Durusau @ 1:26 pm

A different take on data skepticism by Beau Cronin.

From the post:

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

…Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

Beau make several good points on questioning data methods.

I would extend those “…more fundamental questions…” to data as well.

Data, at least as far as I know, doesn’t drop from the sky. It is collected, generated, sometimes both, by design.

That design had some reason for collecting that data, in some particular way and in a given format.

Like methods, data stands mute with regard to those designs, what choices were made, by who and for what reason?

Giving voice what can be known about methods and data falls to human users.

April 23, 2013

Data Socializing

Filed under: Data,Social Media — Patrick Durusau @ 6:48 pm

If you need more opportunities for data socializing, KDNuggets has complied: Top 30 LinkedIn Groups for Analytics, Big Data, Data Mining, and Data Science.

Here’s an interesting test:

Write down your LinkedIn groups and compare your list to this one.

Enjoy!

Resources and Readings for Big Data Week DC Events

Filed under: BigData,Data,Data Mining,Natural Language Processing — Patrick Durusau @ 6:33 pm

Resources and Readings for Big Data Week DC Events

This is Big Data week in DC and Data Community DC has put together a list of books articles and posts to keep you busy all week.

Very cool!

April 20, 2013

Data Storytelling: The Ultimate Collection of Resources

Filed under: Communication,Data,Data Storytelling — Tags: — Patrick Durusau @ 9:47 am

Data Storytelling: The Ultimate Collection of Resources by Zach Gemignani.

From the post:

The hot new concept in data visualization is “data storytelling”; some are calling it the next evolution of visualization (I’m one of them). However, we’re early in the discussion and there are more questions than answers:

  • Is data storytelling more than a catchy phrase?
  • Where does data storytelling fit into the broader landscape of data exploration, visualization, and presentation?
  • How can the traditional tools of storytelling improve how we communicate with data?
  • Is it more about story-telling or story-finding?

Many of the bright minds in the data visualization field have started to tackle these questions — and it is something that we’ve been exploring at Juice in our work. Below you’ll find a collection of some of the best blog posts, presentations, research papers, and other resources that take on this topic.

I count ten (10) blog posts, four (4) presentations, five (5) papers and eight (8) tools, examples and other resources.

Get yourself a fresh cup of coffee. You are going to be here a while.

PS: I don’t know that “data storytelling” is new or if the last century or so suffered a real drought in “data storytelling.”

Medieval cathedrals were exercises in storytelling but a modern/literate audience fails to appreciate them as designed.

Data Computation Fundamentals [Promoting Data Literacy]

Filed under: Data,Data Science,R — Patrick Durusau @ 8:08 am

Data Computation Fundamentals by Daniel Kaplan and Libby Shoop.

From the first lesson:

Teaching the Grammar of Data

Twenty years ago, science students could get by with a working knowledge of a spreadsheet program. Those days are long gone, says Danny Kaplan, DeWitt Wallace Professor of Mathematics and Computer Science. “Excel isn’t going to cut it,” he says. “In today’s world, students can’t escape big data. Though it won’t be easy to teach it, it will only get harder as they move into their professional training.”

To that end, Kaplan and computer science professor Libby Shoop have developed a one-credit class called Data Computation Fundamentals, which is being offered beginning this semester. Though Kaplan doesn’t pretend the course can address all the complexities of specific software packages, he does hope it will provide a framework that students can apply when they come across databases or data-reliant programs in biology, chemistry, and physics. “We believe we can give students that grammar of data that they need to use these modern capabilities,” he says.

Not quite “have developed.” Should say, “are developing, in conjunction with a group of about 20 students.”

Data literacy impacts the acceptance and use of data and tools for using data.

Teaching people to read and write is not a threat to commercial authors.

By the same token, teaching people to use data is not a threat to competent data analysts.

Help the authors and yourself by reviewing the course and offering comments for its improvement.

I first saw this at: A Course in Data and Computing Fundamentals.

April 19, 2013

Preliminary evaluation of the CellFinder literature…

Filed under: Curation,Data,Entity Resolution,Named Entity Mining — Patrick Durusau @ 2:18 pm

Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts by Mariana Neves, Alexander Damaschun, Nancy Mah, Fritz Lekschas, Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas Kurtz, and Ulf Leser. (Database (2013) 2013 : bat020 doi: 10.1093/database/bat020)

Abstract:

Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ∼50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction.

Database URL: http://www.cellfinder.org/.

Another extremely useful data curation project.

Do you get the impression that curation projects will continue to be outrun by data production?

And that will be the case, even with machine assistance?

Is there an alternative to falling further and further behind?

Such as abandoning some content (CNN?) to simply forever go uncurated? Or the same to be true for government documents/reports?

I am sure we all have different suggestions for what data to dump alongside the road to make room for the “important” stuff.

Suggestions on solutions other than simply dumping data?

April 18, 2013

Four decades of US terror attacks listed and detailed

Filed under: Data — Patrick Durusau @ 4:49 am

Four decades of US terror attacks listed and detailed by Simon Rogers.

I was disappointed to read:

The horrors of the Boston Marathon explosions have focussed attention on terror attacks in the United States. But how common are they?

The Global Terrorism Database has recorded terror attacks across the world – with data from 1970 covering up to the end of 2011. It’s a huge dataset: over 104,000 attacks, including around 2,600 in the US – and its collection is funded by an agency of the US government: the Science and Technology Directorate of the US Department of Homeland Security through a Center of Excellence program based at the University of Maryland.

There’s a lot of methodology detailed on the site and several definitions of what is terrorism. At its root, the GTD says that terrorism is:

The threatened or actual use of illegal force and violence by a non-state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation

I thought from the headlines that there would be a listing of four decades of US terror attacks against other peoples, countries and other groups.

A ponderous list that the US has labored long and hard over the past several decades.

A data set that contrasts “terror” attacks in the US with US terrorist attacks against others would make a better data set.

Starting just after WWII.

Data: Continuous vs. Categorical

Filed under: Data,Data Types — Patrick Durusau @ 4:29 am

Data: Continuous vs. Categorical by Robert Kosara.

From the post:

Data comes in a number of different types, which determine what kinds of mapping can be used for them. The most basic distinction is that between continuous (or quantitative) and categorical data, which has a profound impact on the types of visualizations that can be used.

The main distinction is quite simple, but it has a lot of important consequences. Quantitative data is data where the values can change continuously, and you cannot count the number of different values. Examples include weight, price, profits, counts, etc. Basically, anything you can measure or count is quantitative.

Categorical data, in contrast, is for those aspects of your data where you make a distinction between different groups, and where you typically can list a small number of categories. This includes product type, gender, age group, etc.

Both quantitative and categorical data have some finer distinctions, but I will ignore those for this posting. What is more important, is: why do those make a difference for visualization?

I like the use of visualization to reinforce the notion of difference between continuous and categorical data.

Makes me wonder about using visualization to explore the use of different data types for detecting subject sameness.

It may seem trivial to use the TMDM’s sameness of subject identifiers (simple string matching) to say two or more topics represent the same subject.

But what if subject identifiers match but other properties, say gender (modeled as an occurrence), do not?

Illustrating a mistake in the use of a subject identifier but also a weakness in reliance on a subject identitier (data type URI) for subject identity.

That data type relies only one string matching for identification purposes. Which may or may not agree with your subject sameness requirements.

April 17, 2013

Practical tools for exploring data and models

Filed under: Data,Data Mining,Data Models,Exploratory Data Analysis — Patrick Durusau @ 2:37 pm

Practical tools for exploring data and models by Hadley Alexander Wickham. (PDF)

From the introduction:

This thesis describes three families of tools for exploring data and models. It is organised in roughly the same way that you perform a data analysis. First, you get the data in a form that you can work with; Section 1.1 introduces the reshape framework for restructuring data, described fully in Chapter 2. Second, you plot the data to get a feel for what is going on; Section 1.2 introduces the layered grammar of graphics, described in Chapter 3. Third, you iterate between graphics and models to build a succinct quantitative summary of the data; Section 1.3 introduces strategies for visualising models, discussed in Chapter 4. Finally, you look back at what you have done, and contemplate what tools you need to do better in the future; Chapter 5 summarises the impact of my work and my plans for the future.

The tools developed in this thesis are firmly based in the philosophy of exploratory data analysis (Tukey, 1977). With every view of the data, we strive to be both curious and sceptical. We keep an open mind towards alternative explanations, never believing we
have found the best model. Due to space limitations, the following papers only give a glimpse at this philosophy of data analysis, but it underlies all of the tools and strategies that are developed. A fuller data analysis, using many of the tools developed in this thesis, is available in Hobbs et al. (To appear).

Has a focus on R tools, including ggplot2 and Wilkinson’s The Grammar of Graphics.

The “…never believing we have found the best model” approach works for me!

You?

I first saw this at Data Scholars.

April 16, 2013

The Costs and Profits of Poor Data Quality

Filed under: Data,Data Quality — Patrick Durusau @ 7:10 pm

The Costs and Profits of Poor Data Quality by Jim Harris.

From the post:

Continuing the theme of my two previous posts, which discussed when it’s okay to call data quality as good as it needs to get and when perfect data quality is necessary, in this post I want to briefly discuss the costs — and profits — of poor data quality.

Loraine Lawson interviewed Ted Friedman of Gartner Research about How to Measure the Cost of Data Quality Problems, such as the costs associated with reduced productivity, redundancies, business processes breaking down because of data quality issues, regulatory compliance risks, and lost business opportunities. David Loshin blogged about the challenge of estimating the cost of poor data quality, noting that many estimates, upon close examination, seem to rely exclusively on anecdotal evidence.

As usual, Jim does a very good job of illustrating costs and profits from poor data quality.

I have a slightly different question:

What could you know about data to spot that it is of poor quality?

It is one thing to find out after a space ship crashes that poor data quality was responsible, but it would be better to spot the error before hand. As in before the launch.

Probably data specific but are there any general types of information that would help you spot poor quality data?

Before you are 1,000 meters off the lunar surface. 😉

April 13, 2013

Wikileaks: Kissinger Cables

Filed under: Data,Wikileaks — Patrick Durusau @ 2:13 pm

Wikileaks: Kissinger Cables

The code behind the Public Library of US Diplomacy.

Another rich source of information for anyone creating a mapping of relationships and events in the early 1970’s.

My only puzzle over Wikileaks is their apparent focus on US diplomatic cables.

Where are the diplomatic cables of the former government in Egypt? Or the USSR? Or of any of the many existing regimes around the globe?

Surely those aren’t more difficult to obtain than those of the US?

Perhaps that would make an interesting topic map.

Those who could be exposed by Wikileaks but aren’t.

I first saw this as: Wikileaks ProjectK Code (Github) on Nat Torkington’s Four short links: 12 April 2013.

April 6, 2013

Ultimate library challenge: taming the internet

Filed under: Data,Indexing,Preservation,Search Data,WWW — Patrick Durusau @ 3:40 pm

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

April 3, 2013

Intrade Archive: Data for Posterity

Filed under: Data,Finance Services,Prediction — Patrick Durusau @ 4:07 am

Intrade Archive: Data for Posterity by Panos Ipeirotis.

From the post:

A few years back, I have done some work on prediction markets. For this line of research, we have been collecting data from Intrade, to perform our experimental analysis. Some of the data is available through the Intrade Archive, a web app that I wrote in order to familiarize myself with the Google App Engine.

In the last few weeks, through, after the effective shutdown of Intrade, I started receiving requests on getting access to the data stored in the Intrade Archive. So, after popular demand, I gathered all the data from the Intrade Archive, and also all the past data that I had about all the Intrade contracts going back to 2003, and I put them all on GitHub for everyone to access and download.

If you don’t know about Intrade, see: How Intrade Works.

Not sure why you would need the data but it is unusual enough to merit notice.

March 28, 2013

…The Analytical Sandbox [Topic Map Sandbox?]

Filed under: Analytics,Data,Data Analysis — Patrick Durusau @ 6:38 pm

Analytics Best Practices: The Analytical Sandbox by Rick Sherman.

From the post:

So this situation sounds familiar, and you are wondering if you need an analytical sandbox…

The goal of an analytical sandbox is to enable business people to conduct discovery and situational analytics. This platform is targeted for business analysts and “power users” who are the go-to people that the entire business group uses when they need reporting help and answers. This target group is the analytical elite of the enterprise.

The analytical elite have been building their own makeshift sandboxes, referred to as data shadow systems or spreadmarts. The intent of the analytical sandbox is to provide the dedicated storage, tools and processing resources to eliminate the need for the data shadow systems.

Rick outlines what he thinks is needed for an analytical sandbox.

What would you include in a topic map sand box?

March 26, 2013

If you want to talk about the weather…

Filed under: Data,Weather Data — Patrick Durusau @ 4:24 pm

Forecast for Developers

From the webpage:

The same API that powers Forecast.io and Dark Sky for iOS can provide accurate short­term and long­term weather predictions to your business, application, or crazy idea.

We’re developers too, and we like playing with new APIs, so we want you to be able to try ours hassle-free: all you need is an email address.

First thousand API calls a day are free.

Every 10,000 API calls after that are $1.

It could be useful/amusing to merge personal weather observations based on profile characteristics.

Like a recommendation system except for how you are going to experience the weather.

March 22, 2013

The Shape of Data

Filed under: Data,Mathematics,Topology — Patrick Durusau @ 1:17 pm

The Shape of Data by Jesse Johnson.

From the “about” page:

Whether your goal is to write data intensive software, use existing software to analyze large, high dimensional data sets, or to better understand and interact with the experts who do these things, you will need a strong understanding of the structure of data and how one can try to understand it. On this blog, I plan to explore and explain the basic ideas that underlie modern data analysis from a very intuitive and minimally technical perspective: by thinking of data sets as geometric objects.

When I began learning about machine learning and data mining, I found that the intuition I had formed while studying geometry was extremely valuable in understanding the basic concepts and algorithms. My main obstacle has been to figure out what types of problems others are interested in solving, and what types of solutions would make the most difference. I hope that by sharing what I know (and what I continue to learn) from my own perspective, others will help me to figure out what are the major questions that drive this field.

A new blog that addresses the topology of data, in an accessible manner.

Making It Happen:…

Filed under: Data,Data Preservation,Preservation — Patrick Durusau @ 9:45 am

Making It Happen: Sustainable Data Preservation and Use by Anita de Waard.

Great set of overview slides on why research data should be preserved.

Not to mention making the case that semantic diversity, in systems for capturing research data, between researchers, etc., needs to be addressed by any proffered solution.

If you don’t know Anita de Waard’s work, search for “Anita de Waard” on Slideshare.

As of today, I am getting one hundred and forty (140) presentations.

All of which you will find useful on a variety of data related topics.

American Geophysical Union (AGU)

Filed under: Data,Geophysical,Science — Patrick Durusau @ 9:25 am

American Geophysical Union (AGU)

The mission of the AGU:

The purpose of the American Geophysical Union is to promote discovery in Earth and space science for the benefit of humanity.

While I was hunting down information on DataONE, I ran across the AGU site.

Like all disciplines, data analysis, collection, collation, sharing, etc. are ongoing concerns at the AGU.

My interest in more in the data techniques than the subject matter.

Seeking to avoid re-inventing the wheel and learning new insights than has yet to reach more familiar areas.

March 21, 2013

Data.ac.uk

Filed under: Data,Open Data,RDF — Patrick Durusau @ 2:38 pm

Data.ac.uk

From the website:

This is a landmark site for academia providing a single point of contact for linked open data development. It not only provides access to the know-how and tools to discuss and create linked data and data aggregation sites, but also enables access to, and the creation of, large aggregated data sets providing powerful and flexible collections of information.
Here at Data.ac.uk we’re working to inform national standards and assist in the development of national data aggregation subdomains.

I can’t imagine a greater contrast between my poor web authoring skills and a website than this one.

But having said that, I think you will be as disappointed as I was when you start looking for data on this “landmark site.”

There is some but not nearly enough to match the promise of such a cleverly designed website.

Perhaps they are hoping that someday RDF data (they also offer comma and tab delimited versions) will catch up to the site design.

I first saw this in a tweet by Frank van Harmelen.

March 14, 2013

#Tweets4Science

Filed under: Data,Government,Tweets — Patrick Durusau @ 9:35 am

#Tweets4Science

From the manifesto:

User generated content has experienced an explosive growth both in the diversity of available services and the volume of topics covered by the users. Content published in micro-blogging sites such as Twitter is a rich, heterogeneous, and, above all, huge sample of the daily musings of our fellow citizens across the world.

Once qualified as inane chatter, more and more researchers are turning to Twitter data to better understand our social behavior and, no doubt, that global chatter will provide a first person account of our times to future historians.

Thus, initiatives such as the one lead by the Library of the US Congress to collect the entire Twitter Archive are laudable. However, as of today, no researcher has been granted access to that archive, there is no estimation on when such access would be possible and, on top of that, access would only be granted on site.

As researchers we understand the legal compromises one must reach with private sector, and we understand that it is fair that Twitter and resellers offer access to Twitter data, including historical data, for a fee (a rather large one, by the way). However, without the data provided by each of Twitter users such a business would be impossible and, hence, we believe that such data belongs to the users individually and as a group.

Includes links on how to download and donate your tweets.

The researchers appeal to altruism, aggregating your tweets with others may advance human knowledge.

I have a much more pragmatic reason:

While I trust the Library of Congress, I don’t trust their pay masters.

Not to sound paranoid but the delay in anyone accessing the twitter data at the Library of Congress seems odd. The astronomy community has been providing access to much larger data sets long before the first tweet.

So why is it taking so long?

While we are waiting on multiple versions of that story, download your tweets and donate them to this project.

March 13, 2013

squirro

Filed under: Data,Filters,Findability,Searching — Patrick Durusau @ 3:14 pm

squirro

I am not sure how “hard” the numbers are but CRM application claims:

Up to 15% increase in revenues

66% less time wasted on finding and re-finding information

15% increase in win rates

I take this as evidence there is a market for less noisy data streams.

If filtered search can produce this kind of ROI, imagine what curated search can do.

Yes?

March 4, 2013

Data, Data, Data: Thousands of Public Data Sources

Filed under: Data,Dataset — Patrick Durusau @ 5:05 pm

Data, Data, Data: Thousands of Public Data Sources

From the post:

We love data, big and small and we are always on the lookout for interesting datasets. Over the last two years, the BigML team has compiled a long list of sources of data that anyone can use. It’s a great list for browsing, importing into our platform, creating new models and just exploring what can be done with different sets of data.

A rather remarkable list of data sets. You are sure to find something of interest!

March 2, 2013

Hellerstein: Humans are the Bottleneck [Not really]

Filed under: Data,Subject Identity,Topic Maps — Patrick Durusau @ 5:06 pm

Hellerstein: Humans are the Bottleneck by Isaac Lopez.

From the post:

Humans are the bottleneck right now in the data space, commented database systems luminary, Joe Hellerstein during an interview this week at Strata 2013.

“As Moore’s law drives the cost of computing down, and as data becomes more prevalent as a result, what we see is that the remaining bottleneck in computing costs is the human factor,” says Hellerstein, one of the fathers of adaptive query processing and a half dozen other database technologies.

Hellerstein says that recent research studies conducted at Stanford and Berkeley have found that 50-80 percent of a data analyst’s time is being used for the data grunt work (with the rest left for custom coding, analysis, and other duties).

“Data prep, data wrangling, data munging are words you hear over and over,” says Hellerstein. “Even with very highly skilled professionals in the data analysis space, this is where they’re spending their time, and it really is a big bottleneck.”

Just because humans gather at a common location, in “data prep, data wrangling, data munging,” doesn’t mean they “are the bottleneck.”

The question to ask is: Why are people spending so much time at location X in data processing?

Answer: poor data quality and/or rather the inability of machines to process effectively data from different origins. That’s the bottleneck.

A problem that management of subject identities for data and its containers is uniquely poised to solve.

Kepler Data Tutorial : What can you do?

Filed under: Astroinformatics,Data,Data Analysis — Patrick Durusau @ 4:55 pm

Kepler Data Tutorial : What can you do?

The Kepler mission was designed to hunt for planets orbiting foreign stars. When a planet passes between the Kepler satellite and its home star, the brightness of the light from the star dips.

That isn’t the only reason for changes in brightness but officially, Kepler has to ignore those other reasons. Unofficially, Kepler has encouraged professional and amateur astronomers to search the Kepler data for other reasons for light curves.

As I mentioned last year, Kepler Telescope Data Release: The Power of Sharing Data, a group of amateurs discovered the first system with four (4) suns and at least one (1) planet.

The Kepler Data Tutorial introduces you to analysis of this data set.

February 27, 2013

School of Data

Filed under: Data,Education,Marketing,Topic Maps — Patrick Durusau @ 2:55 pm

School of Data

From their “about:”

School of Data is an online community of people who are passionate about using data to improve our understanding of the world, in particular journalists, researchers and analysts.

Our mission

Our aim is to spread data literacy through the world by offering online and offline learning opportunities. With School of Data you’ll learn how to:

  • scout out the best data sources
  • speed up and hone your data handling and analysis
  • visualise and present data creatively

Readers of this blog are very unlikely to find something they don’t know at this site.

However, readers of this blog know a great deal that doesn’t appear on this site.

Such as information on topic maps? Yes?

Something to think about.

I can’t really imagine data literacy without some awareness of subject identity issues.

Once you get to subject identity issues, semantic diversity, topic maps are just an idle thought away!

I first saw this at Nat Torkington’s Four Short Links: 26 Feb 2013.

February 26, 2013

PyData Videos

Filed under: Data,Python — Patrick Durusau @ 1:53 pm

PyData Videos

All great but here are five (5) to illustrate the range of what awaits:

Connecting Data Science to business value, Josh Hemann.

GPU and Python, Andreas Klöckner, Ph.D.

Network X and Gephi, Gilad Lotan.

NLTK and Text Processing, Andrew Montalenti.

Wikipedia Indexing And Analysis, Didier Deshommes.

Forty-seven (47) videos in all so my list is missing forty-two (42) other great ones!

Which ones are your favorites?

February 22, 2013

…Obtaining Original Data from Published Graphs and Plots

Filed under: Data,Data Mining,Graphs — Patrick Durusau @ 2:13 pm

A Simple Method for Obtaining Original Data from Published Graphs and Plots

From the post:

Was thinking of how to extract data points for infant age and weight distribution from a printed graph and I landed at this old paper http://www.ajronline.org/content/174/5/1241.full . it pointed me to NIH Image which reminds me of an old software i used to use for lab practicals as an undergrad .. and upon reaching the NIH Image site, Indeed! imageJ is an ‘update’ of sorts to the NIH Image software ..

The “old paper?” “A Simple Method for Obtaining Original Data from Published Graphs and Plots,” by Chris L. Sistrom and Patricia J. Mergo, American Journal of Roentgenology, May 2000 vol. 174 no. 5 1241-1244.

Update to the URI in the article, http://rsb.info.nih.gov/nih-image/ is correct. (Original URI is missing a hyphen, “-“.)

The mailing list archives don’t show much traffic for the last several years.

When you need to harvest data from published graphs/plots, what do you use?

February 17, 2013

Download all your tweets [Are You An Outlier/Drone Target?]

Filed under: Data,Tweets — Patrick Durusau @ 8:17 pm

Download all your tweets by Ajay Ohri.

From the post:

Now that the Government of the United States of America has the legal power to request your information without a warrant (The Chinese love this!)

Anyways- you can also download your own twitter data. Liberate your data.

Have you looked at your own data? Go there at https://twitter.com/settings/account and review the changes.

Modern governments invent evidence out of whole clothe, enough to topple other governments, so whether my communications are secure or not may be a moot point.

It may make a difference on whether your communications stand out, such that they focus on inventing evidence about you.

In that case, having all your tweets, particularly with the tweets of others, could be a useful thing.

With enough data a profile could be constructed so that your tweets come within, +- some percentage, of the normal tweets for your demographic.

I don’t ever tweet about American Idol (#idol) so I am already an outlier. 😉

Mapping the demographics to content and hash tags, along with dates, events, etc. would make for a nice graph/topic map type application.

Perhaps a deviation warning system if your tweets started to curve away from the pack.

Hiding from data mining isn’t an option.

The question is how to hide in plain sight?

« Newer PostsOlder Posts »

Powered by WordPress