Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 27, 2013

Trouble at the lab [Data Skepticism]

Filed under: Data,Data Quality,Skepticism — Patrick Durusau @ 4:39 pm

Trouble at the lab, Oct. 19, 2013, The Economist.

From the web page:

“I SEE a train wreck looming,” warned Daniel Kahneman, an eminent psychologist, in an open letter last year. The premonition concerned research on a phenomenon known as “priming”. Priming studies suggest that decisions can be influenced by apparently irrelevant actions or events that took place just before the cusp of choice. They have been a boom area in psychology over the past decade, and some of their insights have already made it out of the lab and into the toolkits of policy wonks keen on “nudging” the populace.

Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate experiments had not managed to reproduce the results of a famous study from 1998 purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan.

The idea that the same experiments always get the same results, no matter who performs them, is one of the cornerstones of science’s claim to objective truth. If a systematic campaign of replication does not lead to the same results, then either the original research is flawed (as the replicators claim) or the replications are (as many of the original researchers on priming contend). Either way, something is awry.

The numbers will make you a militant data skeptic:

  • Original results could be duplicated for only 6 out of 53 landmark studies of cancer.
  • Drug company could reproduce only 1/4 of 67 “seminal studies.”
  • NIH official estimates at least three-quarters of publishing biomedical finding would be hard to reproduce.
  • Three-quarter of published paper in machine learning are bunk due to overfitting.

Those and more examples await you in this article from The Economist.

As the sub-heading for the article reads:

Scientists like to think of science as self-correcting. To an alarming degree, it is not

You may not mind misrepresenting facts to others, but do you want other people misrepresenting facts to you?

Do you have a professional data critic/skeptic on call?

October 18, 2013

…A new open Scientific Data journal

Filed under: Data,Dataset,Science — Patrick Durusau @ 12:40 pm

Publishing one’s research data : A new open Scientific Data journal

From the post:

A new Journal called ‘Scientific Data‘ to be launched by Nature in May 2014 has made a call for submissions. What makes this publication unique is that it is open-access, online-only publication for descriptions of scientifically valuable datasets, which aims to foster data sharing and reuse, and ultimately to accelerate the pace of scientific discovery.

Sample publications, 1 and 2.

From the journal homepage:

Launching in May 2014 and open now for submissions, Scientific Data is a new open-access, online-only publication for descriptions of scientifically valuable datasets, initially focusing on the life, biomedical and environmental science communities

Scientific Data exists to help you publish, discover and reuse research data and is built around six key principles:

  • Credit: Credit, through a citable publication, for depositing and sharing your data
  • Reuse: Complete, curated and standardized descriptions enable the reuse of your data
  • Quality: Rigorous community-based peer review
  • Discovery: Find datasets relevant to your research
  • Open: Promotes and endorses open science principles for the use, reuse and distribution of your data, and is available to all through a Creative Commons license
  • Service: In-house curation, rapid peer-review and publication of your data descriptions

Possibly an important source of scientific data in the not so distant future.

October 1, 2013

Data Skepticism in Action

Filed under: Data — Patrick Durusau @ 8:01 pm

I want to call your attention to a headline I saw today:

Research: Big data pays off Summary: Tech Pro Research’s latest survey shows that 82 percent of those who have implemented big data have seen discernible benefits by Teena Hammond.

Some people will read only the summary.

That’s a bad idea, and here’s why:

First, the survey reached only 144 respondents worldwide.

Hmmm, current world population is approximately 7,182,895,100 (it will be higher by the time you check the link).

Not all of them IT people but does 144 sound like a high percentage of IT people to you?

Let’s see (all data from 2010):

Database Administrators: 110,800

IT Managers: 310,000

Programmers: 363,100

Systems Analysts: 544,000

Software Developers: 913,000

That’s what? Almost 2 million IT people just in the United States?

And the survey reached 144 worldwide?

But if you read the pie chart carefully, only 8% of the 144 have implemented Big Data.

I am assuming you have to implement Big Data to claim to see any benefits from Big Data.

Hmmm, 8% of 144 is 11.52 to let’s round that up to 12.

Twelve people reached by the survey have implemented Big Data.

Of those twelve, 82% “report seeing at least some payoff in terms of goals achieved.”

So, 82% of 12 = 9.84 or round to 10.

If the headline had read: Tech Pro Research’s latest survey shows that 10 people world wide, who have implemented big data have seen discernible benefits, would your reaction have been the same?

Yes? No difference? Don’t care?

If you are Tech Pro Research member you can get a free copy of the report that uses ten people to make conclusions about your world.

A Tech Pro Research membership is $299/year.

If you are paying $299/year for ten person survey results, follow my Donations and support this blog.

Suggestions on other posts or reports that need a data skeptical review?

September 23, 2013

DBpedia 3.9 released…

Filed under: Data,DBpedia,RDF,Wikipedia — Patrick Durusau @ 7:08 pm

DBpedia 3.9 released, including wider infobox coverage, additional type statements, and new YAGO and Wikidata links by Christopher Sahnwaldt.

From the post:

we are happy to announce the release of DBpedia 3.9.

The most important improvements of the new release compared to DBpedia 3.8 are:

1. the new release is based on updated Wikipedia dumps dating from March / April 2013 (the 3.8 release was based on dumps from June 2012), leading to an overall increase in the number of concepts in the English edition from 3.7 to 4.0 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner concept descriptions.

3. we extended the DBpedia type system to also cover Wikipedia articles that do not contain an infobox.

4. we provide links pointing from DBpedia concepts to Wikidata concepts and updated the links pointing at YAGO concepts and classes, making it easier to integrate knowledge from these sources.

The English version of the DBpedia knowledge base currently describes 4.0 million things, out of which 3.22 million are classified in a consistent Ontology, including 832,000 persons, 639,000 places (including 427,000 populated places), 372,000 creative works (including 116,000 music albums, 78,000 films and 18,500 video games), 209,000 organizations (including 49,000 companies and 45,000 educational institutions), 226,000 species and 5,600 diseases.

We provide localized versions of DBpedia in 119 languages. All these versions together describe 24.9 million things, out of which 16.8 million overlap (are interlinked) with the concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 12.6 million unique things in 119 different languages; 24.6 million links to images and 27.6 million links to external web pages; 45.0 million external links into other RDF datasets, 67.0 million links to Wikipedia categories, and 41.2 million YAGO categories.

Altogether the DBpedia 3.9 release consists of 2.46 billion pieces of information (RDF triples) out of which 470 million were extracted from the English edition of Wikipedia, 1.98 billion were extracted from other language editions, and about 45 million are links to external data sets.

Detailed statistics about the DBpedia data sets in 24 popular languages are provided at Dataset Statistics.

The main changes between DBpedia 3.8 and 3.9 are described below. For additional, more detailed information please refer to the Change Log.

Almost like an early holiday present isn’t it? 😉

I continue to puzzle over the notion of “extraction.”

Not that I have an alternative but extracting data only kicks the data can one step down the road.

When someone wants to use my extracted data, they are going to extract data from my extraction. And so on.

That seems incredibly wasteful and error-prone.

Enough money is spend doing the ETL shuffle every year that research on ETL avoidance should be a viable proposition.

September 22, 2013

India…1,745 datasets for agriculture

Filed under: Agriculture,Data,Open Data — Patrick Durusau @ 2:09 pm

Open Data Portal India launched: Already 1,745 datasets for agriculture

From the post:

The Government of India has launched its open Data Portal India (data.gov.in), a portal for the public to access and use datasets and applications provided by ministries and departments of the Government of India.

Aim: “To increase transparency in the functioning of Government and also open avenues for many more innovative uses of Government Data to give different perspective.” (“About portal,” data.gov.in)

The story goes on to report there are more than 4,000 data sets from over 51 offices. An adviser to the prime minister of India is hopeful there will be more than 10,000 data sets in six months.

Not quite as much fun as the IMDB, but on the other hand, the data is more likely to be of interest to business types.

September 19, 2013

Over $1 Trillion in Tax Breaks…

Filed under: Data,Government,News — Patrick Durusau @ 6:51 pm

Over $1 Trillion in Tax Breaks Are Detailed in New Report by Jessica Schieder.

From the post:

Tax breaks cost the federal government approximately $1.13 trillion in fiscal year 2013, according to a new report by the National Priorities Project (NPP). That is just slightly less than all federal discretionary spending in FY 2013 combined.

So, the headline got your attention? It certainly got mine.

But unlike many alarmist headlines (can you say CNN?), this story came with data to back up its statements.

How much data you ask?

Well, from 1974 to present, tax break data, described as:

NPP has created the first time series tax break dataset by obtaining archived budget requests, converting them to electronic format, and standardizing the categories and names over time. We’ve also added several calculations and normalizations to make these data more useful to researchers.

What you will find in this dataset:

  • Tax break names, standardized over time
  • Tax break categories, standardized over time
  • Estimated annual tax break costs (both real dollars and adjusted for inflation)
  • Annual tax break costs as a percent change from the previous year
  • Annual tax break costs as a percentage of Gross Domestic Product (GDP)
  • Annual tax break costs as a percentage of their corresponding category

The full notes and sources, including our methodology and a data dictionary, are here.

The original report: The Big Money in Tax Breaks, Report: Exposing the Big Money in Tax Breaks by Mattea Kramer. Data support by Becky Sweger and Asher Dvir-Djerassi.

Sponsored by the National Priorities Project.

Do your main sources of news distribute relevant data? To enable you to reach your own conclusions?

If not, you should start asking why not?

September 1, 2013

Sane Data Updates Are Harder Than You Think

Filed under: Data,Data Collection,Data Quality,News — Patrick Durusau @ 6:35 pm

Sane Data Updates Are Harder Than You Think by Adrian Holovaty.

From the post:

This is the first in a series of three case studies about data-parsing problems from a journalist’s perspective. This will be meaty, this will be hairy, this will be firmly in the weeds.

We’re in the middle of an open-data renaissance. It’s easier than ever for somebody with basic tech skills to find a bunch of government data, explore it, combine it with other sources, and republish it. See, for instance, the City of Chicago Data Portal, which has hundreds of data sets available for immediate download.

But the simplicity can be deceptive. Sure, the mechanics of getting data are easy, but once you start working with it, you’ll likely face a variety of rather subtle problems revolving around data correctness, completeness, and freshness.

Here I’ll examine some of the most deceptively simple problems you may face, based on my eight years’ experience dealing with government data in a journalistic setting —most recently as founder of EveryBlock, and before that as creator of chicagocrime.org and web developer at washingtonpost.com. EveryBlock, which was shut down by its parent company NBC News in February 2013, was a site that gathered and sorted dozens of civic data sets geographically. It gave you a “news feed for your block”—a frequently updated feed of news and discussions relevant to your home address. In building this huge public-data-parsing machine, we dealt with many different data situations and problems, from a wide variety of sources.

My goal here is to raise your awareness of various problems that may not be immediately obvious and give you reasonable solutions. My first theme in this series is getting new or changed records.

A great introduction to deep problems that are lurking just below the surface of any available data set.

Not only do data sets change but reactions to and criticisms of data sets change.

What would you offer as an example of “stable” data?

I tried to think of one for this post and came up empty.

You could claim the text of the King Jame Bible is “stable” data.

But only from a very narrow point of view.

The printed text is stable but the opinions, criticisms, commentaries, all on the King James Bible have been anything but stable.

Imagine that you have a stock price ticker application and all it reports are the current prices for some stock X.

Is that sufficient or would it be more useful if it reported the price over the last four hours as a percentage of change?

Perhaps we need a modern data Heraclitus to proclaim:

“No one ever reads the same data twice”

August 28, 2013

CORE

Filed under: Data,Open Access — Patrick Durusau @ 2:28 pm

CORE

From the about:

CORE (COnnecting REpositories) aims to facilitate free access to scholarly publications distributed across many systems. As of today, CORE gives you access to millions of scholarly articles aggregated from many Open Access repositories.

We believe in free access to information. The mission of CORE is to:

  • Support the right of citizens and general public to access the results of research towards which they contributed by paying taxes.
  • Facilitate access to Open Access content for all by targeting general public, software developers, researchers, etc., by improving search and navigation using state-of-the-art technologies in the field of natural language processing and the Semantic Web.
  • Provide support to both content consumers and content providers by working with digital libraries and institutional repositories.
  • Contribute to a cultural change by promoting Open Access.

BTW, CORE also allows you to harvest their data.

As of today, August 28, 2013, 13,639,485 articles.

Excellent resource for scholarly publications!

Not to mention a useful yardstick for other publication indexing projects.

What does your indexing project offer that CORE does not?

That is rather than duplicating indexing we already possess, where it the value-add of your indexing?

August 24, 2013

Citing data (without tearing your hair out)

Filed under: Citation Practices,Data — Patrick Durusau @ 7:00 pm

Citing data (without tearing your hair out) by Bonnie Swoger

From the post:

The changing nature of how and where scientists share raw data has sparked a growing need for guidelines on how to cite these increasingly available datasets.

Scientists are producing more data than ever before due to the (relative) ease of collecting and storing this data. Often, scientists are collecting more than they can analyze. Instead of allowing this un-analyzed data to die when the hard drive crashes, they are releasing the data in its raw form as a dataset. As a result, datasets are increasingly available as separate, stand-alone packages. In the past, any data available for other scientists to use would have been associated with some other kind of publication – printed as table in a journal article, included as an image in a book, etc. – and cited as such.

Now that we can find datasets “living on their own,” scientists need to be able to cite these sources.

Unfortunately, the traditional citation manuals do a poor job of helping a scientist figure out what elements to include in the reference list, either ignoring data or over-complicating things.

If you are building a topic map that relies upon data sets you didn’t create, get ready to cite data sets.

Citations, assuming they are correct, can give your users confidence in the data you present.

Bonnie does a good job providing basic rules that you should follow when citing data.

You can always do more than she suggests but you should never do any less.

August 14, 2013

Intrinsic vs. Extrinsic Structure

Filed under: Data,Data Structures — Patrick Durusau @ 2:50 pm

Intrinsic vs. Extrinsic Structure by Jesse Johnson.

From the post:

At this point, I think it will be useful to introduce an idea from geometry that is very helpful in pure mathematics, and that I find helpful for understanding the geometry of data sets. This idea is difference between the intrinsic structure of an object (such as a data set) and its extrinsic structure. Have you ever gone into a building, walked down a number of different halls and through different rooms, and when you finally got to where you’re going and looked out the window, you realized that you had no idea which direction you were facing, or which side of the building you were actually on? The intrinsic structure of a building has to do with how the rooms, halls and staircases connect up to each other. The extrinsic structure is how these rooms, halls and staircases sit with respect to the outside world. So, while you’re inside the building you may be very aware of the intrinsic structure, but completely lose track of the extrinsic structure.

You can see a similar distinction with subway maps, such as the famous London tube map. This map records how the different tube stops connect to each other, but it distorts how the stops sit within the city. In other words, the coordinates on the tube map do not represent the physical/GPS coordinates of the different stops. But while you’re riding a subway, the physical coordinates of the different stops are much less important than the inter-connectivity of the stations. In other words, the intrinsic structure of the subway is more important (while you’re riding it) than the extrinsic structure. On the other hand, if you were walking through a city, you would be more interested in the extrinsic structure of the city since, for example, that would tell you the distance in miles (or kilometers) between you and your destination.

Data sets also have both intrinsic and extrinsic structure, though there isn’t a sharp line between where the intrinsic structure ends and the extrinsic structure begins. These are more intuitive terms than precise definitions. In the figure below, which shows three two-dimensional data sets, the set on the left has an intrinsic structure very similar to that of the middle data set: Both have two blobs of data points connected by a narrow neck of data points. However, in the data set on the left the narrow neck forms a roughly straight line. In the center, the tube curves around, so that the entire set roughly follows a circle.

I am looking forward to this series of posts from Jesse.

Bearing in mind that the structure of a data set is impacted by collection goals, methods, and other factors.

Matters that are not (usually) represented in the data set per se.

August 12, 2013

Greek New Testament (with syntax trees)

Filed under: Bible,Data — Patrick Durusau @ 4:07 pm

Greek New Testament (with syntax trees)

If you are tired of the same old practice data sets, I may have a treat for you!

The Asia Bible Society has produced syntax tress for the New Testament, using the SBL Greek New Testament text.

To give you an idea of the granularity of the data, the first sentence in Matthew is spread over forty-nine (49) lines of markup.

Not big data in the usual sense but important data.

August 9, 2013

60+ R resources to improve your data skills

Filed under: Data,R — Patrick Durusau @ 1:12 pm

60+ R resources to improve your data skills by Sharon Machlis.

Great collection of R resources! Some you will know, others are likely to be new to you.

Definitely worth the time to take a quick scan of Sharon’s listing.

I first saw this in Vincent Granville’s 60+ R resources.

August 3, 2013

Unpivoting Data with Excel, Open Refine and Python

Filed under: Data,Excel,Google Refine,Python — Patrick Durusau @ 4:09 pm

Unpivoting Data with Excel, Open Refine and Python by Tariq Khokhar.

From the post:

“How can I unpivot or transpose my tabular data so that there’s only one record per row?”

I see this question a lot and I thought it was worth a quick Friday blog post.

Data often aren’t quite in the format that you want. We usually provide CSV / XLS access to our data in “pivoted” or “normalized” form so they look like this:

Manipulating data is at least as crucial a skill to authoring a topic map as being able to model data.

Here are some quick tips for your toolkit.

July 30, 2013

Large File/Data Tools

Filed under: BigData,Data,Data Mining — Patrick Durusau @ 3:12 pm

Essential tools for manipulating big data files by Daniel Rubio.

From the post:

You can leverage several tools that are commonly used to manipulate big data files, which include: Regular expressions, sed, awk, WYSIWYG editors (e.g. Emacs, vi and others), scripting languages (e.g. Bash, Perl, Python and others), parsers (e.g. Expat, DOM, SAX and others), compression utilities (e.g. zip, tar, bzip2 and others) and miscellaneous Unix/Linux utilities (e.g. split, wc, sort, grep)

And,

10 Awesome Examples for Viewing Huge Log Files in Unix by Ramesh Natarajan.

Viewing huge log files for trouble shooting is a mundane routine tasks for sysadmins and programmers.

In this article, let us review how to effectively view and manipulate huge log files using 10 awesome examples.

cover the same topic but with very little overlap (only grep as far as I can determine).

Are there other compilations of “tools” that would be handy for large data files?

July 23, 2013

Sites and Services for Accessing Data

Filed under: Data,Open Data — Patrick Durusau @ 2:21 pm

Sites and Services for Accessing Data by Andy Kirk.

From the site:

This collection presents the key sites that provide data, whether through curated collections, offering access under the Open Data movement or through Software/Data-as-a-Service platforms. Please note, I may not have personally used all the services, sites or tools presented but have seen sufficient evidence of their value from other sources. Also, to avoid re-inventing the wheel, descriptive text may have been reproduced from the native websites for many resources.

You will see there is clearly a certain bias towards US and UK based sites and services. This is simply because they are the most visible, most talked about, most shared and/or useful resources on my radar. I will keep updating this site to include as many other finds and suggestions as possible, extending (ideally) around the world.

I count ninety-nine (99) resources.

A well organized listing but like many other listings, you have to explore each resource to discover its contents.

A mapping of resources across collections would be far more useful.

July 20, 2013

11 Billion Clues in 800 Million Documents:…

Filed under: Data,Freebase,Precision,Recall — Patrick Durusau @ 2:16 pm

11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard.

From the post:

When you type in a search query — perhaps Plato — are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval — you have to know what the string actually refers to. The Knowledge Graph and Freebase are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.

We’ve previously released data to help with disambiguation and recently awarded $1.2M in research grants to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.

These Freebase Annotations of the ClueWeb Corpora (FACC) consist of ClueWeb09 FACC and ClueWeb12 FACC. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (Freebase MID’s). …

(…)

Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%….

(…)

Evaluate precision and recall by asking:

Your GPS gives you relevant directions on an average eight (8) times out of ten and it finds relevant locations on average of seven (7) times out of ten (10). (Wikipedia on Precision and Recall)

Is that a good GPS?

A useful data set but still a continuation of the approach of guessing what authors meant when they authored documents.

What if by some yet unknown technique, precision goes to nine (9) out of ten (10) and recall goes to nine (9) out of ten (10) as well?

The GPS question becomes:

Your GPS gives you relevant directions on an average nine (9) times out of ten and it finds relevant locations on average of nine (9) times out of ten (10).

Is that a good GPS?

Not that any automated technique has shown that level of performance.

Rather than focusing on data post-authoring, why not enable authors to declare their semantics?

Author declared semantics would reduce the cost and uncertainty of post-authoring semantic solutions.

I first saw this in a tweet by Nicolas Torzec.

July 16, 2013

Abbot MorphAdorner collaboration

Filed under: Data,Language — Patrick Durusau @ 9:12 am

Abbot MorphAdorner collaboration

From the webpage:

The Center for Digital Research in the Humanities at the University of Nebraska and Northwestern University’s Academic and Academic Research Technologies are pleased to announce the first fruits of a collaboration between the Abbot and EEBO-MorphAdorner projects: the release of some 2,000 18th century texts from the TCP-ECCO collections in a TEI-P5 format and with linguistic annotation. More texts will follow shortly, subject to the access restrictions that will govern the use of TCP texts for the remainder of this decade.

The Text Creation Partnership (TCP) collection currently consists of about 50,000 fully transcribed SGML texts from the first three centuries of English print culture. The collection will grow to approximately 75,000 volumes and will contain at least one copy of every book published before 1700 as well as substantial samples of 18th century texts published in the British Isles or North America. The ECCO-TCP texts are already in the public domain. The other texts will follow them between 2014 and 2015. The Evans texts will be released in June 2014, followed by a release of some 25,000 EEBO texts in 2015.

It is a major goal of the Abbot and EEBO MorphAdorner collaboration to turn the TCP texts into the foundation for a “Book of English,” defined as

  • a large, growing, collaboratively curated, and public domain corpus
  • of written English since its earliest modern form
  • with full bibliographical detail
  • and light but consistent structural and linguistic annotation

Texts in the annotated TCP corpus will exist in more than one format so as to facilitate different uses to which they are likely to be put. In a first step, Abbot transforms the SGML source text into a TEI P5 XML format. Abbot, a software program designed by Brian Pytlik Zillig and Stephen Ramsay, can read arbitrary XML files and convert them into other XML formats or a shared format. Abbot generates its own set of conversion routines at runtime by reading an XML schema file and programmatically effecting the desired transformations. It is an excellent tool for creating an environment in which texts originating in separate projects can acquire a higher degree of interoperability. A prototype of Abbot was used in the MONK project to harmonize texts from several collections, including the TCP, Chadwyck-Healey’s Nineteenth-Century Fiction, the Wright Archive of American novels 1851-1875, and Documenting the American South.

This first transformation maintains all the typographical data recorded in the original SGML transcription, including long ‘s’, printer’s abbreviations, superscripts etc. In a second step MorphAdorner tokenizes this file. MorphAdorner was developed by Philip R. Burns. It is a multi-purpose suite of NLP tools with special features for the tokenization, analysis, and annotation of historical corpora. The tokenization uses algorithms and heuristics specific to the practices of Early Modern print culture, wraps every word token in a <w> element with a unique ID, and explicitly marks sentence boundaries.

In the next step (conceptually different but merged in practice with the previous), some typographical features are removed from the tokenized text, but all such changes are recorded in a change log and may therefore be reversed. The changes aim at making it easier to manipulate the corpus with software tools that presuppose modern printing practices. They involve such things as replacing long ‘s’ with plain ‘s’, or resolving unambiguous printer’s abbreviations and superscripts.

Talk about researching across language as it changes!

This is way cool!

Lots of opportunities for topic map-based applications.

For more information:

Abbot Text Interoperability Tool

Download texts here

July 8, 2013

Online College Courses (Academic Earth)

Filed under: Data,Education — Patrick Durusau @ 4:04 pm

Online College Courses (Academic Earth)

No new material but a useful aggregation of online course materials at forty-nine (49) institutions. (as of today)

Not that hard to imagine topic map-based value add services that link up materials and discussions with course materials.

Most courses are offered on a regular cycle and knowing what has helped before may be useful to you.

July 7, 2013

import.io

Filed under: Data,ETL,WWW — Patrick Durusau @ 4:18 pm

import.io

The steps listed by import.io on its “How it works” page:

Find: Find an online source for your data, whether it’s a single web page or a search engine within a site. Import•io doesn’t discriminate; it works with any web source.

Extract: When you have identified the data you want, you can begin to extract it. The first stage is to highlight the data that you want. You can do this by giving us a few examples and our algorithms will identify the rest. The next stage is to organise your data. This is as simple as creating columns to sort parts of the data into, much like you would do in a spreadsheet. Once you have done that we will extract the data into rows and columns.

If you want to use the data once, or infrequently, you can stop here. However, if you would like a live connection to the data or want to be able to access it programatically, the next step will create a real-time connection to the data.

Connect: This stage will allow you to create a real-time connection to the data. First you have to record how you obtained the data you extracted. Second, give us a couple of test cases so we can ensure that, if the website changes, your connection to the data will remain live.

Mix: One of the most powerful features of the platform is the ability to mix data from many sources to form a single data set. This allows you to create incredibly rich data sets by combing hundred of underlying data points from many different websites and access them via the application or API as a single source. Mixing is as easy a clicking the sources you want to mix together and saving that mix as a new real-time data set.

Use: Simply copy your data into your favourite spreadsheet software or use our APIs to access it in an application.

Developer preview but interesting for a couple of reasons.

First simply as an import service. I haven’t tried it (yet) so your mileage may vary. Reports welcome.

Second, I like the (presented) ease of use approach.

Imagine a topic map application for some specific domain that was as matter-of-fact as what I quote above.

Something to think about.

July 2, 2013

A resource for fully calibrated NASA data

Filed under: Astroinformatics,Data — Patrick Durusau @ 12:46 pm

A resource for fully calibrated NASA data by Scott Fleming, an astronomer at Space Telescope Science Institute.

From the post:

The Mikulski Archive for Space Telescopes (MAST) maintains, among other things, a database of fully calibrated, community-contributed spectra, catalogs, models, and images from UV and optical NASA missions. These High Level Science Products (HLSPs) range from individual objects to wide-field surveys from MAST missions such as Hubble, Kepler, GALEX, FUSE, and Swift UVOT. Some well-known surveys archived as HLSPs include CANDELS, CLASH, GEMS, GOODS, PHAT, the Hubble Ultra Deep Fields, the ACS Survey of Galactic Globular Clusters. (Acronym help here: DOOFAS). And it’s not just Hubble projects: we have HLSPs from GALEX, FUSE, and IUE, to name a few, and some of the HLSPs include data from other missions or ground-based observations. A complete listing can be found on our HLSP main page.

How do I navigate the HLSP webpages?

Each HLSP has a webpage that, in most cases, includes a description of the project, relevant documentation, and previews of data. For example, the GOODS HLSP page has links to the current calibrated and mosaiced FITS data files, the multi-band source catalog, a Cutout Tool for use with images, a Browse page where you can view multi-color, drizzled images, and a collection of external links related to the GOODS survey.

You can search many HLSPs based on target name or coordinates. If you’ve ever used the MAST search forms to access HST, Kepler, or GALEX data, this will look familiar. The search form is great for checking whether your favorite object is part of a MAST HLSP. You can also upload a list of objects through the “File Upload Form” link if you want to check multiple targets. You may also access several of the Hubble-based HLSPs through the Hubble Legacy Archive (HLA). Click on “advanced search” in green, then in the “Proposal ID” field, enter the name of the HLSP product to search for, e.g., “CLASH”. A complete list of HLSPs available through the HLA may be found here where you can also click on the links in the Project Name column to initiate a search within that HLSP.

(…)

More details follow on how to contribute your data.

I suggest following @MAST_News for updates on data and software!

June 28, 2013

Pricing Dirty Data

Filed under: Data,Data Quality — Patrick Durusau @ 3:00 pm

Putting a Price on the Value of Poor Quality Data by Dylan Jones.

From the post:

When you start out learning about data quality management, you invariably have to get your head around the cost impact of bad data.

One of the most common scenarios is the mail order catalogue business case. If you have a 5% conversion rate on your catalogue orders and the average order price is £20 – and if you have 100,000 customer contacts – then you know that with perfect-quality data you should be netting about £100,000 per mail campaign.

However, we all know that data is never perfect. So if 20% of your data is inaccurate or incomplete and the catalogue cannot be delivered, then you’ll only make £80,000.

I always see the mail order scenario as the entry-level data quality business case as it’s common throughout textbooks, but there is another case I prefer: that of customer churn, which I think is even more compelling.

(…)

The absence of the impact of dirty data as a line item in the budget makes it difficult to argue for better data.

Dylan finds a way to relate dirty data to something of concern to every commercial enterprise, customers.

How much customers spend and how long they are retained, can be translated into line items (negative ones) in the budget.

Suggestions on how to measure the impact of a topic maps-based solution for delivery of information to customers?

May 30, 2013

Distributing the Edit History of Wikipedia Infoboxes

Filed under: Data,Dataset,Wikipedia — Patrick Durusau @ 7:44 pm

Distributing the Edit History of Wikipedia Infoboxes by Enrique Alfonseca.

From the post:

Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building disambiguation resources, parallel corpora and structured knowledge from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.

(…)

For this reason, we released, in collaboration with Wikimedia Deutschland e.V., a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is available for download both from Google and from Wikimedia Deutschland’s Toolserver page. A description of the dataset can be found in our paper WHAD: Wikipedia Historical Attributes Data, accepted for publication at the Language Resources and Evaluation journal.

How much data do you need beyond the infoboxes of Wikipedia?

And knowing what values were in the past … isn’t that like knowing prior identifiers for subjects?

re3data.org

Filed under: Data,Dataset,Semantic Diversity — Patrick Durusau @ 1:12 pm

re3data.org

From the post:

An increasing number of universities and research organisations are starting to build research data repositories to allow permanent access in a trustworthy environment to data sets resulting from research at their institutions. Due to varying disciplinary requirements, the landscape of research data repositories is very heterogeneous. This makes it difficult for researchers, funding bodies, publishers, and scholarly institutions to select an appropriate repository for storage of research data or to search for data.

The re3data.org registry allows the easy identification of appropriate research data repositories, both for data producers and users. The registry covers research data repositories from all academic disciplines. Information icons display the principal attributes of a repository, allowing users to identify the functionalities and qualities of a data repository. These attributes can be used for multi-faceted searches, for instance to find a repository for geoscience data using a Creative Commons licence.

By April 2013, 338 research data repositories were indexed in re3data.org. 171 of these are described by a comprehensive vocabulary, which was developed by involving the data repository community (http://doi.org/kv3).

The re3data.org search at can be found at: http://www.re3data.org
The information icons are explained at: http://www.re3data.org/faq

Does this sound like any of these?:

DataOne

The Dataverse Network Project

IOGDS: International Open Government Dataset Search

PivotPaths: a Fluid Exploration of Interlinked Information Collections

Quandl [> 2 million financial/economic datasets]

Just to name five (5) that came to mind right off hand?

Addressing the heterogeneous nature of data repositories by creating another, semantically different data repository, seems like a non-solution to me.

What would be useful would be to create a mapping of this “new” classification, which I assume works for some group of users, against the existing classifications.

That would allow users of the “new” classification to access data in existing repositories, without having to learn their classification systems.

The heterogeneous nature of information is never vanquished but we can incorporate it into our systems.

May 25, 2013

Semantics as Data

Filed under: Data,Semantics — Patrick Durusau @ 4:28 pm

Semantics as Data by Oliver Kennedy.

From the post:

Something I’ve been getting drawn to more and more is the idea of computation as data.

This is one of the core precepts in PL and computation: any sort of computation can be encoded as data. Yet, this doesn’t fully capture the essence of what I’ve been seeing. Sure you can encode computation as data, but then what do you do with it? How do you make use of the fact that semantics can be encoded?

Let’s take this question from another perspective. In Databases, we’re used to imposing semantics on data. Data has meaning because we chose to give it meaning. The number 100,000 is meaningless, until I tell you that it’s the average salary of an employee at BigCorporateCo. Nevertheless, we can still ask questions in the abstract. Whatever semantics you use, 100,000 < 120,000. We can create abstractions (query languages) that allow us to ask questions about data, regardless of their semantics.

By comparison, an encoded computation carries its own semantics. This makes it harder to analyze, as the nature of those semantics is limited only by the type of encoding used to store the computation. But this doesn’t stop us from asking questions about the computation.

The Computation’s Effects

The simplest thing we can do is to ask a question about what it will compute. These questions span the range from the trivial to the typically intractable. For example, we can ask about…

  • … what the computation will produce given a specific input, or a specific set of inputs.
  • … what inputs will produce a given (range of) output(s).
  • … whether a particular output is possible.
  • … whether two computations are equivalent.

One particularly fun example in this space is Oracle’s Expression type [1]. An Expression stores (as a datatype) an arbitrary boolean expression with variables. The result of evaluating this expression on a given valuation of the variables can be injected into the WHERE clause of any SELECT statement. Notably, Expression objects can be indexed based on variable valuations. Given 3 such expressions: (A = 3), (A = 5), (A = 7), we can build an index to identify which expressions are satisfied for a particular valuation of A.

I find this beyond cool. Not only can Expression objects themselves be queried, it’s actually possible to build index structures to accelerate those queries.

Those familiar with probabilistic databases will note some convenient parallels between the expression type and Condition Columns used in C-Tables. Indeed, the concepts are almost identical. A C-Table encodes the semantics of the queries that went into its construction. When we compute a confidence in a C-Table (or row), what we’re effectively asking about is the fraction of the input space that the C-Table (row) produces an output for.

At every level of semantics there is semantic diversity.

Whether it is code or data, there are levels of semantics, each with semantic diversity.

You don’t have to resolve all semantic diversity, just enough to give you an advantage over others.

May 19, 2013

UNESCO Publications and Data (Open Access)

Filed under: Data,Government,Government Data — Patrick Durusau @ 8:49 am

UNESCO to make its publications available free of charge as part of a new Open Access policy

From the post:

The United Nations Education Scientific and Cultural Organisation (UNESCO) has announced that it is making available to the public free of charge its digital publications and data. This comes after UNESCO has adopted an Open Access Policy, becoming the first agency within the United Nations to do so.

The new policy implies that anyone can freely download, translate , adapt, and distribute UNESCO’s publications and data. The policy also states that from July 2013, hundreds of downloadable digital UNESCO publications will be available to users through a new Open Access Repository with a multilingual interface. The policy seeks also to apply retroactively to works that have been published.

There’s a treasure trove of information for mapping, say against the New York Times historical archives.

If presidential libraries weren’t concerned with helping former administration officials avoid accountability, digitizing presidential libraries for complete access, would be another great treasure trove.

May 4, 2013

Bad Data Report

Filed under: Bibliography,Citation Practices,Data — Patrick Durusau @ 2:40 pm

The accuracy of references in PhD theses: a case study by Fereydoon Azadeh and Reyhaneh Vaez.

Abstract:

Background

Inaccurate references and citations cause confusion, distrust in the accuracy of a report, waste of time and unnecessary financial charges for libraries, information centres and researchers.

Objectives

The aim of the study was to establish the accuracy of article references in PhD theses from the Tehran and Tabriz Universities of Medical Sciences and their compliance with the Vancouver style.

Methods

We analysed 357 article references in the Tehran and 347 in the Tabriz. Six bibliographic elements were assessed: authors’ names, article title, journal title, publication year, volume and page range. Referencing errors were divided into major and minor.

Results

Sixty two percent of references in the Tehran and 53% of those in the Tabriz were erroneous. In total, 164 references in the Tehran and 136 in the Tabriz were complete without error. Of 357 reference articles in the Tehran, 34 (9.8%) were in complete accordance with the Vancouver style, compared with none in the Tabriz. Accuracy of referencing did not differ significantly between the two groups, but compliance with the Vancouver style was significantly better in the Tehran.

Conclusions

The accuracy of referencing was not satisfactory in both groups, and students need to gain adequate instruction in appropriate referencing methods.

Now that’s bad data!

I have noticed errors on CS paper citations but not as high as reported here.

The ACM Digital Library could report for a given paper or conference the number of unknown citations, with a list, for checking.

May 2, 2013

Create and Manage Data: Training Resources

Filed under: Archives,Data,Preservation — Patrick Durusau @ 2:07 pm

Create and Manage Data: Training Resources

From the webpage:

Our Managing and Sharing Data: Training Resources present a suite of flexible training materials for people who are charged with training researchers and research support staff in how to look after research data.

The Training Resources complement the UK Data Archive’s popular guide on ‘Managing and Sharing Data: best practice for researchers’, the most recent version published in May 2011.

They  have been designed and used as part of the Archive’s daily work in supporting ESRC applicants and award holders and have been made possible by a grant from the ESRC Researcher Development Initiative (RDI).

The Training Resources are modularised following the UK Data Archive’s seven key areas of managing and sharing data:

  • sharing data – why and how
  • data management planning for researchers and research centres
  • documenting data
  • formatting data
  • storing data, including data security, data transfer, encryption, and file sharing
  • ethics and consent
  • data copyright

Each section contains:

  • introductory powerpoint(s)
  • presenter’s guide – where necessary
  • exercises and introduction to exercises
  • quizzes
  • answers

The materials are presented as used in our own training courses  and are mostly geared towards social scientists. We anticipate trainers will create their own personalised and more context-relevant example, for example by discipline, country, relevant laws and regulations.

You can download individual modules from the relevant sections or download the whole resource in pdf format. Updates to pages were last made on 20 June 2012.

Download all resources.

Quite an impressive set of materials that will introduce you to some aspects of research data in the UK. Not all but some aspects.

What you don’t learn here you will pickup from interaction with people actively engaged with research data.

But it will give you a head start on understanding the research data community.

Unlike some technologies, topic maps are more about a community’s world view than the world view of topic maps.

April 30, 2013

The Dataverse Network Project

Filed under: Data,Dataverse Network — Patrick Durusau @ 1:48 pm

The Dataverse Network Project sponsored by the Institute for Quantitative Social Science, Harvard University.

Described on its homepage:

A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.

Dataverses currently in operation:

One shortfall I hope is corrected quickly is the lack of searching across instances of the Dataverse software.

For example, if I go to UC Davis and choose the Center for Poverty Research dataverse, I can find: “The Research Supplemental Poverty Measure Public Use Research Files” by Kathleen Short (a study).

But, if I search at the Harvard Dataverse Advanced Search by “Kathleen Short,” or “The Research Supplemental Poverty Measure Public Use Research Files,” I get no results.

An isolated dataverse is more of a data island than a dataverse.

We have lots of experience with data islands. It’s time for something different.

PS: Semantic integration issues need to be addressed as well.

Harvard Dataverse Network

Filed under: Data,Dataverse Network — Patrick Durusau @ 1:23 pm

Harvard Dataverse Network

From the webpage:

The Harvard Dataverse Network is open to all scientific data from all disciplines worldwide. It includes the world’s largest collection of social science research data. If you would like to upload your research data, first create a dataverse and then create a study. If you already have a dataverse, log in to add new studies.

Sharing of data that underlies published research.

Dataverses (520 of those) contain studies (52,289) which contain files (722,615).

For example, following the link for the Tom Clark dataverse, provides a listing of five (5) studies, ordered by their global ids.

Following the link to the Locating Supreme Court Opinions in Doctrine Space study, defaults to detailed cataloging information for the study.

The interface is under active development.

One feature that I hope is added soon is the ability to browse dataverses by author and self-assigned subjects.

Searching works, but is more reliable if you know the correct search terms to use.

I didn’t see any plans to deal with semantic ambiguity/diversity.

Quandl – Update

Filed under: Data,Dataset — Patrick Durusau @ 4:52 am

Quandl

When I last wrote about Quandl, they were at over 2,000,000 datasets.

Following a recent link to their site, I found they are now over 5,000,000 data sets.

No mean feat, but among the questions that remain:

How do I judge the interoperability of data sets?

Where do I find the information needed to make data sets interoperable?

And just as importantly,

Where do I write down information I discovered or created to make a data set interoperable? (To avoid doing the labor over again.)

« Newer PostsOlder Posts »

Powered by WordPress