Archive for the ‘Data’ Category
Thursday, May 30th, 2013
Distributing the Edit History of Wikipedia Infoboxes by Enrique Alfonseca.
From the post:
Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building disambiguation resources, parallel corpora and structured knowledge from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.
(…)
For this reason, we released, in collaboration with Wikimedia Deutschland e.V., a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is available for download both from Google and from Wikimedia Deutschland’s Toolserver page. A description of the dataset can be found in our paper WHAD: Wikipedia Historical Attributes Data, accepted for publication at the Language Resources and Evaluation journal.
How much data do you need beyond the infoboxes of Wikipedia?
And knowing what values were in the past … isn’t that like knowing prior identifiers for subjects?
Posted in Data, Dataset, Wikipedia | No Comments »
Thursday, May 30th, 2013
re3data.org
From the post:
An increasing number of universities and research organisations are starting to build research data repositories to allow permanent access in a trustworthy environment to data sets resulting from research at their institutions. Due to varying disciplinary requirements, the landscape of research data repositories is very heterogeneous. This makes it difficult for researchers, funding bodies, publishers, and scholarly institutions to select an appropriate repository for storage of research data or to search for data.
The re3data.org registry allows the easy identification of appropriate research data repositories, both for data producers and users. The registry covers research data repositories from all academic disciplines. Information icons display the principal attributes of a repository, allowing users to identify the functionalities and qualities of a data repository. These attributes can be used for multi-faceted searches, for instance to find a repository for geoscience data using a Creative Commons licence.
By April 2013, 338 research data repositories were indexed in re3data.org. 171 of these are described by a comprehensive vocabulary, which was developed by involving the data repository community (http://doi.org/kv3).
The re3data.org search at can be found at: http://www.re3data.org
The information icons are explained at: http://www.re3data.org/faq
Does this sound like any of these?:
DataOne
The Dataverse Network Project
IOGDS: International Open Government Dataset Search
PivotPaths: a Fluid Exploration of Interlinked Information Collections
Quandl [> 2 million financial/economic datasets]
Just to name five (5) that came to mind right off hand?
Addressing the heterogeneous nature of data repositories by creating another, semantically different data repository, seems like a non-solution to me.
What would be useful would be to create a mapping of this “new” classification, which I assume works for some group of users, against the existing classifications.
That would allow users of the “new” classification to access data in existing repositories, without having to learn their classification systems.
The heterogeneous nature of information is never vanquished but we can incorporate it into our systems.
Posted in Data, Dataset, Semantic Diversity | No Comments »
Saturday, May 25th, 2013
Semantics as Data by Oliver Kennedy.
From the post:
Something I’ve been getting drawn to more and more is the idea of computation as data.
This is one of the core precepts in PL and computation: any sort of computation can be encoded as data. Yet, this doesn’t fully capture the essence of what I’ve been seeing. Sure you can encode computation as data, but then what do you do with it? How do you make use of the fact that semantics can be encoded?
Let’s take this question from another perspective. In Databases, we’re used to imposing semantics on data. Data has meaning because we chose to give it meaning. The number 100,000 is meaningless, until I tell you that it’s the average salary of an employee at BigCorporateCo. Nevertheless, we can still ask questions in the abstract. Whatever semantics you use, 100,000 < 120,000. We can create abstractions (query languages) that allow us to ask questions about data, regardless of their semantics.
By comparison, an encoded computation carries its own semantics. This makes it harder to analyze, as the nature of those semantics is limited only by the type of encoding used to store the computation. But this doesn’t stop us from asking questions about the computation.
The Computation’s Effects
The simplest thing we can do is to ask a question about what it will compute. These questions span the range from the trivial to the typically intractable. For example, we can ask about…
- … what the computation will produce given a specific input, or a specific set of inputs.
- … what inputs will produce a given (range of) output(s).
- … whether a particular output is possible.
- … whether two computations are equivalent.
One particularly fun example in this space is Oracle’s Expression type [1]. An Expression stores (as a datatype) an arbitrary boolean expression with variables. The result of evaluating this expression on a given valuation of the variables can be injected into the WHERE clause of any SELECT statement. Notably, Expression objects can be indexed based on variable valuations. Given 3 such expressions: (A = 3), (A = 5), (A = 7), we can build an index to identify which expressions are satisfied for a particular valuation of A.
I find this beyond cool. Not only can Expression objects themselves be queried, it’s actually possible to build index structures to accelerate those queries.
Those familiar with probabilistic databases will note some convenient parallels between the expression type and Condition Columns used in C-Tables. Indeed, the concepts are almost identical. A C-Table encodes the semantics of the queries that went into its construction. When we compute a confidence in a C-Table (or row), what we’re effectively asking about is the fraction of the input space that the C-Table (row) produces an output for.
At every level of semantics there is semantic diversity.
Whether it is code or data, there are levels of semantics, each with semantic diversity.
You don’t have to resolve all semantic diversity, just enough to give you an advantage over others.
Posted in Data, Semantics | No Comments »
Sunday, May 19th, 2013
UNESCO to make its publications available free of charge as part of a new Open Access policy
From the post:
The United Nations Education Scientific and Cultural Organisation (UNESCO) has announced that it is making available to the public free of charge its digital publications and data. This comes after UNESCO has adopted an Open Access Policy, becoming the first agency within the United Nations to do so.
The new policy implies that anyone can freely download, translate , adapt, and distribute UNESCO’s publications and data. The policy also states that from July 2013, hundreds of downloadable digital UNESCO publications will be available to users through a new Open Access Repository with a multilingual interface. The policy seeks also to apply retroactively to works that have been published.
There’s a treasure trove of information for mapping, say against the New York Times historical archives.
If presidential libraries weren’t concerned with helping former administration officials avoid accountability, digitizing presidential libraries for complete access, would be another great treasure trove.
Posted in Data, Government, Government Data | No Comments »
Saturday, May 4th, 2013
The accuracy of references in PhD theses: a case study by Fereydoon Azadeh and Reyhaneh Vaez.
Abstract:
Background
Inaccurate references and citations cause confusion, distrust in the accuracy of a report, waste of time and unnecessary financial charges for libraries, information centres and researchers.
Objectives
The aim of the study was to establish the accuracy of article references in PhD theses from the Tehran and Tabriz Universities of Medical Sciences and their compliance with the Vancouver style.
Methods
We analysed 357 article references in the Tehran and 347 in the Tabriz. Six bibliographic elements were assessed: authors’ names, article title, journal title, publication year, volume and page range. Referencing errors were divided into major and minor.
Results
Sixty two percent of references in the Tehran and 53% of those in the Tabriz were erroneous. In total, 164 references in the Tehran and 136 in the Tabriz were complete without error. Of 357 reference articles in the Tehran, 34 (9.8%) were in complete accordance with the Vancouver style, compared with none in the Tabriz. Accuracy of referencing did not differ significantly between the two groups, but compliance with the Vancouver style was significantly better in the Tehran.
Conclusions
The accuracy of referencing was not satisfactory in both groups, and students need to gain adequate instruction in appropriate referencing methods.
Now that’s bad data!
I have noticed errors on CS paper citations but not as high as reported here.
The ACM Digital Library could report for a given paper or conference the number of unknown citations, with a list, for checking.
Posted in Bibliography, Citation Practices, Data | No Comments »
Thursday, May 2nd, 2013
Create and Manage Data: Training Resources
From the webpage:
Our Managing and Sharing Data: Training Resources present a suite of flexible training materials for people who are charged with training researchers and research support staff in how to look after research data.
The Training Resources complement the UK Data Archive’s popular guide on ‘Managing and Sharing Data: best practice for researchers’, the most recent version published in May 2011.
They have been designed and used as part of the Archive’s daily work in supporting ESRC applicants and award holders and have been made possible by a grant from the ESRC Researcher Development Initiative (RDI).
The Training Resources are modularised following the UK Data Archive’s seven key areas of managing and sharing data:
- sharing data – why and how
- data management planning for researchers and research centres
- documenting data
- formatting data
- storing data, including data security, data transfer, encryption, and file sharing
- ethics and consent
- data copyright
Each section contains:
- introductory powerpoint(s)
- presenter’s guide – where necessary
- exercises and introduction to exercises
- quizzes
- answers
The materials are presented as used in our own training courses and are mostly geared towards social scientists. We anticipate trainers will create their own personalised and more context-relevant example, for example by discipline, country, relevant laws and regulations.
You can download individual modules from the relevant sections or download the whole resource in pdf format. Updates to pages were last made on 20 June 2012.
Download all resources.
Quite an impressive set of materials that will introduce you to some aspects of research data in the UK. Not all but some aspects.
What you don’t learn here you will pickup from interaction with people actively engaged with research data.
But it will give you a head start on understanding the research data community.
Unlike some technologies, topic maps are more about a community’s world view than the world view of topic maps.
Posted in Archives, Data, Preservation | No Comments »
Tuesday, April 30th, 2013
The Dataverse Network Project sponsored by the Institute for Quantitative Social Science, Harvard University.
Described on its homepage:
A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.
Dataverses currently in operation:
One shortfall I hope is corrected quickly is the lack of searching across instances of the Dataverse software.
For example, if I go to UC Davis and choose the Center for Poverty Research dataverse, I can find: “The Research Supplemental Poverty Measure Public Use Research Files” by Kathleen Short (a study).
But, if I search at the Harvard Dataverse Advanced Search by “Kathleen Short,” or “The Research Supplemental Poverty Measure Public Use Research Files,” I get no results.
An isolated dataverse is more of a data island than a dataverse.
We have lots of experience with data islands. It’s time for something different.
PS: Semantic integration issues need to be addressed as well.
Posted in Data, Dataverse Network | No Comments »
Tuesday, April 30th, 2013
Harvard Dataverse Network
From the webpage:
The Harvard Dataverse Network is open to all scientific data from all disciplines worldwide. It includes the world’s largest collection of social science research data. If you would like to upload your research data, first create a dataverse and then create a study. If you already have a dataverse, log in to add new studies.
Sharing of data that underlies published research.
Dataverses (520 of those) contain studies (52,289) which contain files (722,615).
For example, following the link for the Tom Clark dataverse, provides a listing of five (5) studies, ordered by their global ids.
Following the link to the Locating Supreme Court Opinions in Doctrine Space study, defaults to detailed cataloging information for the study.
The interface is under active development.
One feature that I hope is added soon is the ability to browse dataverses by author and self-assigned subjects.
Searching works, but is more reliable if you know the correct search terms to use.
I didn’t see any plans to deal with semantic ambiguity/diversity.
Posted in Data, Dataverse Network | No Comments »
Tuesday, April 30th, 2013
Quandl
When I last wrote about Quandl, they were at over 2,000,000 datasets.
Following a recent link to their site, I found they are now over 5,000,000 data sets.
No mean feat, but among the questions that remain:
How do I judge the interoperability of data sets?
Where do I find the information needed to make data sets interoperable?
And just as importantly,
Where do I write down information I discovered or created to make a data set interoperable? (To avoid doing the labor over again.)
Posted in Data, Dataset | No Comments »
Friday, April 26th, 2013
The Wikidata revolution is here: enabling structured data on Wikipedia by Tilman Bayer.
From the post:
A year after its announcement as the first new Wikimedia project since 2006, Wikidata has now begun to serve the over 280 language versions of Wikipedia as a common source of structured data that can be used in more than 25 million articles of the free encyclopedia.
By providing Wikipedia editors with a central venue for their efforts to collect and vet such data, Wikidata leads to a higher level of consistency and quality in Wikipedia articles across the many language editions of the encyclopedia. Beyond Wikipedia, Wikidata’s universal, machine-readable knowledge database will be freely reusable by anyone, enabling numerous external applications.
“Wikidata is a powerful tool for keeping information in Wikipedia current across all language versions,” said Wikimedia Foundation Executive Director Sue Gardner. “Before Wikidata, Wikipedians needed to manually update hundreds of Wikipedia language versions every time a famous person died or a country’s leader changed. With Wikidata, such new information, entered once, can automatically appear across all Wikipedia language versions. That makes life easier for editors and makes it easier for Wikipedia to stay current.”
This is a great source of curated data!
Posted in Data, Wikidata, Wikipedia | No Comments »
Friday, April 26th, 2013
Once Under Wraps, Supreme Court Audio Trove Now Online
From the post:
On Wednesday, the U.S. Supreme Court heard oral arguments in the final cases of the term, which began last October and is expected to end in late June after high-profile rulings on gay marriage, affirmative action and the Voting Rights Act.
Audio from Wednesday’s arguments will be available at week’s end at the court’s website, but that’s a relatively new development at an institution that has historically been somewhat shuttered from public view.
The court has been releasing audio during the same week as arguments only since 2010. Before that, audio from one term generally wasn’t available until the beginning of the next term. But the court has been recording its arguments for nearly 60 years, at first only for the use of the justices and their law clerks, and eventually also for researchers at the National Archives, who could hear — but couldn’t duplicate — the tapes. As a result, until the 1990s, few in the public had ever heard recordings of the justices at work.
But as of just a few weeks ago, all of the archived historical audio — which dates back to 1955 — has been digitized, and almost all of those cases can now be heard and explored at an online archive called the Oyez Project.
A truly incredible resources for U.S. history in general and legal history in particular.
The transcripts and tapes are synchronized so your task, if you are interested, is to map these resources to other historical accounts and resources.
The only disappointment is that the recordings begin with the October term of 1955. One of the most well known cases of the 20th century, Brown v. Board of Education, was argued in 1952 and re-argued in 1953. Hearing Thurgood Marshall argue that case would be a real treat.
I first saw this at: NPR: oyez.org finishes Supreme Court oral arguments project.
Posted in Data, History, Law, Law - Sources | No Comments »
Thursday, April 25th, 2013
A different take on data skepticism by Beau Cronin.
From the post:
Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.
…
…Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.
Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?
Beau make several good points on questioning data methods.
I would extend those “…more fundamental questions…” to data as well.
Data, at least as far as I know, doesn’t drop from the sky. It is collected, generated, sometimes both, by design.
That design had some reason for collecting that data, in some particular way and in a given format.
Like methods, data stands mute with regard to those designs, what choices were made, by who and for what reason?
Giving voice what can be known about methods and data falls to human users.
Posted in Algorithms, Data, Data Models, Data Quality | No Comments »
Tuesday, April 23rd, 2013
Resources and Readings for Big Data Week DC Events
This is Big Data week in DC and Data Community DC has put together a list of books articles and posts to keep you busy all week.
Very cool!
Posted in BigData, Data, Data Mining, Natural Language Processing | No Comments »
Saturday, April 20th, 2013
Data Storytelling: The Ultimate Collection of Resources by Zach Gemignani.
From the post:
The hot new concept in data visualization is “data storytelling”; some are calling it the next evolution of visualization (I’m one of them). However, we’re early in the discussion and there are more questions than answers:
- Is data storytelling more than a catchy phrase?
- Where does data storytelling fit into the broader landscape of data exploration, visualization, and presentation?
- How can the traditional tools of storytelling improve how we communicate with data?
- Is it more about story-telling or story-finding?
Many of the bright minds in the data visualization field have started to tackle these questions — and it is something that we’ve been exploring at Juice in our work. Below you’ll find a collection of some of the best blog posts, presentations, research papers, and other resources that take on this topic.
I count ten (10) blog posts, four (4) presentations, five (5) papers and eight (8) tools, examples and other resources.
Get yourself a fresh cup of coffee. You are going to be here a while.
PS: I don’t know that “data storytelling” is new or if the last century or so suffered a real drought in “data storytelling.”
Medieval cathedrals were exercises in storytelling but a modern/literate audience fails to appreciate them as designed.
Tags: Data Storytelling
Posted in Communication, Data, Data Storytelling | No Comments »
Saturday, April 20th, 2013
Data Computation Fundamentals by Daniel Kaplan and Libby Shoop.
From the first lesson:
Teaching the Grammar of Data
Twenty years ago, science students could get by with a working knowledge of a spreadsheet program. Those days are long gone, says Danny Kaplan, DeWitt Wallace Professor of Mathematics and Computer Science. “Excel isn’t going to cut it,” he says. “In today’s world, students can’t escape big data. Though it won’t be easy to teach it, it will only get harder as they move into their professional training.”
To that end, Kaplan and computer science professor Libby Shoop have developed a one-credit class called Data Computation Fundamentals, which is being offered beginning this semester. Though Kaplan doesn’t pretend the course can address all the complexities of specific software packages, he does hope it will provide a framework that students can apply when they come across databases or data-reliant programs in biology, chemistry, and physics. “We believe we can give students that grammar of data that they need to use these modern capabilities,” he says.
Not quite “have developed.” Should say, “are developing, in conjunction with a group of about 20 students.”
Data literacy impacts the acceptance and use of data and tools for using data.
Teaching people to read and write is not a threat to commercial authors.
By the same token, teaching people to use data is not a threat to competent data analysts.
Help the authors and yourself by reviewing the course and offering comments for its improvement.
I first saw this at: A Course in Data and Computing Fundamentals.
Posted in Data, Data Science, R | No Comments »
Friday, April 19th, 2013
Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts by Mariana Neves, Alexander Damaschun, Nancy Mah, Fritz Lekschas, Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas Kurtz, and Ulf Leser. (Database (2013) 2013 : bat020 doi: 10.1093/database/bat020)
Abstract:
Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ∼50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction.
Database URL: http://www.cellfinder.org/.
Another extremely useful data curation project.
Do you get the impression that curation projects will continue to be outrun by data production?
And that will be the case, even with machine assistance?
Is there an alternative to falling further and further behind?
Such as abandoning some content (CNN?) to simply forever go uncurated? Or the same to be true for government documents/reports?
I am sure we all have different suggestions for what data to dump alongside the road to make room for the “important” stuff.
Suggestions on solutions other than simply dumping data?
Posted in Curation, Data, Entity Resolution, Named Entity Mining | No Comments »
Thursday, April 18th, 2013
Four decades of US terror attacks listed and detailed by Simon Rogers.
I was disappointed to read:
The horrors of the Boston Marathon explosions have focussed attention on terror attacks in the United States. But how common are they?
The Global Terrorism Database has recorded terror attacks across the world – with data from 1970 covering up to the end of 2011. It’s a huge dataset: over 104,000 attacks, including around 2,600 in the US – and its collection is funded by an agency of the US government: the Science and Technology Directorate of the US Department of Homeland Security through a Center of Excellence program based at the University of Maryland.
There’s a lot of methodology detailed on the site and several definitions of what is terrorism. At its root, the GTD says that terrorism is:
The threatened or actual use of illegal force and violence by a non-state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation
I thought from the headlines that there would be a listing of four decades of US terror attacks against other peoples, countries and other groups.
A ponderous list that the US has labored long and hard over the past several decades.
A data set that contrasts “terror” attacks in the US with US terrorist attacks against others would make a better data set.
Starting just after WWII.
Posted in Data | No Comments »
Thursday, April 18th, 2013
Data: Continuous vs. Categorical by Robert Kosara.
From the post:
Data comes in a number of different types, which determine what kinds of mapping can be used for them. The most basic distinction is that between continuous (or quantitative) and categorical data, which has a profound impact on the types of visualizations that can be used.
The main distinction is quite simple, but it has a lot of important consequences. Quantitative data is data where the values can change continuously, and you cannot count the number of different values. Examples include weight, price, profits, counts, etc. Basically, anything you can measure or count is quantitative.
Categorical data, in contrast, is for those aspects of your data where you make a distinction between different groups, and where you typically can list a small number of categories. This includes product type, gender, age group, etc.
Both quantitative and categorical data have some finer distinctions, but I will ignore those for this posting. What is more important, is: why do those make a difference for visualization?
I like the use of visualization to reinforce the notion of difference between continuous and categorical data.
Makes me wonder about using visualization to explore the use of different data types for detecting subject sameness.
It may seem trivial to use the TMDM’s sameness of subject identifiers (simple string matching) to say two or more topics represent the same subject.
But what if subject identifiers match but other properties, say gender (modeled as an occurrence), do not?
Illustrating a mistake in the use of a subject identifier but also a weakness in reliance on a subject identitier (data type URI) for subject identity.
That data type relies only one string matching for identification purposes. Which may or may not agree with your subject sameness requirements.
Posted in Data, Data Types | No Comments »
Wednesday, April 17th, 2013
Practical tools for exploring data and models by Hadley Alexander Wickham. (PDF)
From the introduction:
This thesis describes three families of tools for exploring data and models. It is organised in roughly the same way that you perform a data analysis. First, you get the data in a form that you can work with; Section 1.1 introduces the reshape framework for restructuring data, described fully in Chapter 2. Second, you plot the data to get a feel for what is going on; Section 1.2 introduces the layered grammar of graphics, described in Chapter 3. Third, you iterate between graphics and models to build a succinct quantitative summary of the data; Section 1.3 introduces strategies for visualising models, discussed in Chapter 4. Finally, you look back at what you have done, and contemplate what tools you need to do better in the future; Chapter 5 summarises the impact of my work and my plans for the future.
The tools developed in this thesis are firmly based in the philosophy of exploratory data analysis (Tukey, 1977). With every view of the data, we strive to be both curious and sceptical. We keep an open mind towards alternative explanations, never believing we
have found the best model. Due to space limitations, the following papers only give a glimpse at this philosophy of data analysis, but it underlies all of the tools and strategies that are developed. A fuller data analysis, using many of the tools developed in this thesis, is available in Hobbs et al. (To appear).
Has a focus on R tools, including ggplot2 and Wilkinson’s The Grammar of Graphics.
The “…never believing we have found the best model” approach works for me!
You?
I first saw this at Data Scholars.
Posted in Data, Data Mining, Data Models, Exploratory Data Analysis | No Comments »
Tuesday, April 16th, 2013
The Costs and Profits of Poor Data Quality by Jim Harris.
From the post:
Continuing the theme of my two previous posts, which discussed when it’s okay to call data quality as good as it needs to get and when perfect data quality is necessary, in this post I want to briefly discuss the costs — and profits — of poor data quality.
Loraine Lawson interviewed Ted Friedman of Gartner Research about How to Measure the Cost of Data Quality Problems, such as the costs associated with reduced productivity, redundancies, business processes breaking down because of data quality issues, regulatory compliance risks, and lost business opportunities. David Loshin blogged about the challenge of estimating the cost of poor data quality, noting that many estimates, upon close examination, seem to rely exclusively on anecdotal evidence.
As usual, Jim does a very good job of illustrating costs and profits from poor data quality.
I have a slightly different question:
What could you know about data to spot that it is of poor quality?
It is one thing to find out after a space ship crashes that poor data quality was responsible, but it would be better to spot the error before hand. As in before the launch.
Probably data specific but are there any general types of information that would help you spot poor quality data?
Before you are 1,000 meters off the lunar surface.
Posted in Data, Data Quality | No Comments »
Saturday, April 13th, 2013
Wikileaks: Kissinger Cables
The code behind the Public Library of US Diplomacy.
Another rich source of information for anyone creating a mapping of relationships and events in the early 1970′s.
My only puzzle over Wikileaks is their apparent focus on US diplomatic cables.
Where are the diplomatic cables of the former government in Egypt? Or the USSR? Or of any of the many existing regimes around the globe?
Surely those aren’t more difficult to obtain than those of the US?
Perhaps that would make an interesting topic map.
Those who could be exposed by Wikileaks but aren’t.
I first saw this as: Wikileaks ProjectK Code (Github) on Nat Torkington’s Four short links: 12 April 2013.
Posted in Data, Wikileaks | No Comments »
Saturday, April 6th, 2013
Ultimate library challenge: taming the internet by Jill Lawless.
From the post:
Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.
For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.
As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.
The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.
”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.
”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”
For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).
The content gathered by the project will be made available to the public.
A welcome venture, particularly since the results will be made available to the public.
An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?
Hundreds if not thousands of plays were written and performed every year.
The Complete Greek Drama lists only forty-seven (47) that have survived to this day.
If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?
I first saw this in a tweet by Jason Ronallo.
Posted in Data, Indexing, Preservation, Search Data, WWW | No Comments »
Wednesday, April 3rd, 2013
Intrade Archive: Data for Posterity by Panos Ipeirotis.
From the post:
A few years back, I have done some work on prediction markets. For this line of research, we have been collecting data from Intrade, to perform our experimental analysis. Some of the data is available through the Intrade Archive, a web app that I wrote in order to familiarize myself with the Google App Engine.
In the last few weeks, through, after the effective shutdown of Intrade, I started receiving requests on getting access to the data stored in the Intrade Archive. So, after popular demand, I gathered all the data from the Intrade Archive, and also all the past data that I had about all the Intrade contracts going back to 2003, and I put them all on GitHub for everyone to access and download.
If you don’t know about Intrade, see: How Intrade Works.
Not sure why you would need the data but it is unusual enough to merit notice.
Posted in Data, Finance Services, Prediction | No Comments »
Thursday, March 28th, 2013
Analytics Best Practices: The Analytical Sandbox by Rick Sherman.
From the post:
So this situation sounds familiar, and you are wondering if you need an analytical sandbox…
The goal of an analytical sandbox is to enable business people to conduct discovery and situational analytics. This platform is targeted for business analysts and “power users” who are the go-to people that the entire business group uses when they need reporting help and answers. This target group is the analytical elite of the enterprise.
The analytical elite have been building their own makeshift sandboxes, referred to as data shadow systems or spreadmarts. The intent of the analytical sandbox is to provide the dedicated storage, tools and processing resources to eliminate the need for the data shadow systems.
Rick outlines what he thinks is needed for an analytical sandbox.
What would you include in a topic map sand box?
Posted in Analytics, Data, Data Analysis | No Comments »
Tuesday, March 26th, 2013
Forecast for Developers
From the webpage:
The same API that powers Forecast.io and Dark Sky for iOS can provide accurate shortterm and longterm weather predictions to your business, application, or crazy idea.
We’re developers too, and we like playing with new APIs, so we want you to be able to try ours hassle-free: all you need is an email address.
First thousand API calls a day are free.
Every 10,000 API calls after that are $1.
It could be useful/amusing to merge personal weather observations based on profile characteristics.
Like a recommendation system except for how you are going to experience the weather.
Posted in Data, Weather Data | No Comments »
Friday, March 22nd, 2013
The Shape of Data by Jesse Johnson.
From the “about” page:
Whether your goal is to write data intensive software, use existing software to analyze large, high dimensional data sets, or to better understand and interact with the experts who do these things, you will need a strong understanding of the structure of data and how one can try to understand it. On this blog, I plan to explore and explain the basic ideas that underlie modern data analysis from a very intuitive and minimally technical perspective: by thinking of data sets as geometric objects.
When I began learning about machine learning and data mining, I found that the intuition I had formed while studying geometry was extremely valuable in understanding the basic concepts and algorithms. My main obstacle has been to figure out what types of problems others are interested in solving, and what types of solutions would make the most difference. I hope that by sharing what I know (and what I continue to learn) from my own perspective, others will help me to figure out what are the major questions that drive this field.
A new blog that addresses the topology of data, in an accessible manner.
Posted in Data, Mathematics, Topology | No Comments »
Friday, March 22nd, 2013
Making It Happen: Sustainable Data Preservation and Use by Anita de Waard.
Great set of overview slides on why research data should be preserved.
Not to mention making the case that semantic diversity, in systems for capturing research data, between researchers, etc., needs to be addressed by any proffered solution.
If you don’t know Anita de Waard’s work, search for “Anita de Waard” on Slideshare.
As of today, I am getting one hundred and forty (140) presentations.
All of which you will find useful on a variety of data related topics.
Posted in Data, Data Preservation, Preservation | No Comments »
Friday, March 22nd, 2013
American Geophysical Union (AGU)
The mission of the AGU:
The purpose of the American Geophysical Union is to promote discovery in Earth and space science for the benefit of humanity.
While I was hunting down information on DataONE, I ran across the AGU site.
Like all disciplines, data analysis, collection, collation, sharing, etc. are ongoing concerns at the AGU.
My interest in more in the data techniques than the subject matter.
Seeking to avoid re-inventing the wheel and learning new insights than has yet to reach more familiar areas.
Posted in Data, Geophysical, Science | No Comments »
Thursday, March 21st, 2013
Data.ac.uk
From the website:
This is a landmark site for academia providing a single point of contact for linked open data development. It not only provides access to the know-how and tools to discuss and create linked data and data aggregation sites, but also enables access to, and the creation of, large aggregated data sets providing powerful and flexible collections of information.
Here at Data.ac.uk we’re working to inform national standards and assist in the development of national data aggregation subdomains.
I can’t imagine a greater contrast between my poor web authoring skills and a website than this one.
But having said that, I think you will be as disappointed as I was when you start looking for data on this “landmark site.”
There is some but not nearly enough to match the promise of such a cleverly designed website.
Perhaps they are hoping that someday RDF data (they also offer comma and tab delimited versions) will catch up to the site design.
I first saw this in a tweet by Frank van Harmelen.
Posted in Data, Open Data, RDF | No Comments »