Archive for the ‘Data Source’ Category


Tuesday, October 18th, 2011


From the website:

Projects in rOpenSci fall into two categories: those for working with the scientific literature, and those for working directly with the databases. Visit the active development hub of each project on github, where you can see and download source-code, see updates, and follow or join the developer discussions of issues. Most of the packages work through an API provided by the resource (database, paper archive) to access data and bring it within reach of R’s powerful manipulation.

Project started this past summer but has already collected some tutorials and data.

Good opportunity to learn some R as well as talk up the notion of re-using scientific data in new ways. Don’t jump right into recursion of subject identity as it relates to data, data structures and the subject both represent. 😉 YOU MAY THINK THAT, but what you say is: How do you say when you are talking about the same subject across data sets? Would that be useful to you? (Note the strategy of asking the user, not explaining their problem first. The explaining of their problem for them in terms I understand is mostly my strategy so this is a reminder to me to not do that!)

Federal Election Commission Campaign Data Analysis

Saturday, October 15th, 2011

Federal Election Commission Campaign Data Analysis by Dave Fauth.

From the post:

This post is inspired by Marko Rodriguez’ excellent post on a Graph-Based Movie Recommendation engine. I will use many of the same concepts that he describes in his post in order to load the data into Neo4J and then begin to analyze the data. This post will focus on the data loading. Follow-on posts will look at further analysis based on the relationships.


The Federal Election Commission has made campaign contribution data publicly available for download here. The FEC has provided campaign finance maps on its home page. The Sunlight Foundation has created the Influence Explorer to provide similar analysis.

This post and follow-on posts will look at analyzing the Campaign Data using the graph database Neo4j, and the graph traversal language Gremlin. This post will go about showing the data preparation, the data modeling and then loading into Neo4J.

I think the advantage that Dave’s work will have over the Sunlight Foundation “Influence Explorer” is that the “Influence Explorer” has a fairly simple model. Candidate gets money, therefore owned by contributor. To some degree true but how does that work when both sides of an issue are contributing money?

Tracing out the webs of influence that lead to particular positions is going to take something like Neo4j, primed with campaign contribution information but then decorated with other relationships and actors.

dmoz: computers: artificial intelligence

Friday, October 14th, 2011

dmoz: computers: artificial intelligence

I ran across this listing of resources, some 1,294 as of today, this morning.

Amusing to note that despite the category being “Artificial Intelligence,” “Programming Languages” shows “(0).”

Before you leap to the defense of dmoz, yes, I know that if you follow the “Programming Languages” link, you will find numerous Lisp resources (as well as others).

Three problems:

First, it isn’t immediately obvious that you should follow “Programming Languages” to find Lisp. After all, it says “(0).” What does that usually mean?

Second, the granularity (or lack thereof) of such a resource listing, enables easier navigation, but at the expense of a lack of detail. Surely post-printed text we can create “views” on the fly that serve the need to navigate as well as varying needs for details or different navigations.

Third, and most importantly from my perspective, how to stay aware of new materials and to find old materials at these sites? RSS feeds can help with changes but doesn’t gather similar reports together and certainly doesn’t help with material already posted.

Another rich lode of resources where delivery could be greatly improved.


Thursday, October 13th, 2011


From the website:

Numbrary is a free online service dedicated to finding, using and sharing numbers on the web.

With 26,475 data tables from the US Department of Labor, I get to Producer Price Indexes (3428 items) and then to Commodities (WPU 101) and there is very nice access to the underlying data:

Except that I don’t know how that should (could?) be reconciled with other data? Or what “other” data that would be, save for “See Also” on the webpage, but I don’t know why I should see that data as well.

Beyond just my lack of experience with economic data, this may illustrate something about “transparency” in government.

Can a government be said to be “transparent” if it provides data that is no more “transparent” to voters than the lack of data?

What burden does it have to make data more than simply accessible, but also meaningful? (I am mindful of the credit disclosure laws that provided foot faults for those wishing to pursue members of the credit industry but that did not credit rate disclosures meaningful.)

Still, a useful source of data that I commend to your attention.

Peter Skomoroch – Delicious

Thursday, October 13th, 2011

Peter Skomoroch – Delicious

As of today, 7845 links to data and data sources.

A prime candidate to illustrate that there is no shortage of data, but a serious shortage of meaningful navigation of data.

In Depth with Campaign Finance Data

Thursday, October 13th, 2011

In Depth with Campaign Finance Data by Ethan Phelps-Goodman.


Influence Explorer and TransparencyData are the Sunlight Foundation’s two main sources for data on money and influence in politics. Both sites are warehouses for a variety of datasets, including campaign finance, lobbying, earmarks, federal spending and various other corporate accountability datasets. The underlying data is the same for both sites, but the presentation is very different. Influence Explorer takes the most important or prominent entities in the data–such as state and federal politicians, well-known individuals, and large companies and organizations–and gives each its own page with easy to understand charts and graphs. TransparencyData, on the other hand, gives searchable access to the raw records that make up each Influence Explorer page. Influence Explorer can answer questions like, “who was the top donor to Obama’s presidential campaign?” TransparencyData lets you dig down into the details of every single donation to that campaign.

If you are interested in campaign finance data this is a very good starting point. At least you can get a sense for the difficulty in simply tracking the money. I think you will find that money can buy access, but that isn’t the same thing as influence. That more complicated.

Topic maps can help in several ways. First, there is the ability to consolidate information from a variety of sources so no one person has to try to assemble all the pieces. Second, the use of associations can help you discover patterns in relationships that may uncover some hidden (or relatively so) avenues of influence or access. Not to mention that being able to trade-up information with others, may help you build a better portfolio of data for when you go calling to exercise some influence.

Where to find data to use with R

Wednesday, October 12th, 2011

Where to find data to use with R

From the post:

Hardly a day goes by without someone or something reminding me that we are drowning in a sea of data (a bummer day ):, or that the new hero is the data scientist (a Yes! let’s go make some money kind of day!!). This morning I read “…Google grew from processing 100 terrabytes of data a day with MapReduce in 2004 to processing 20 petabytes a day with MapReduce in 2008. (Lin and Dyer, Data-Intensive Text Processing with MapReduce: Morgan&Claypool, 2010 p1) Assuming linear growth, that would mean did about 400 terabytes during the 15 minutes it took me to check my email. Even if Google is getting more than its fair share, data should be everywhere, more data that I could ever need, more than I could process, more than I could ever imagine.

So, how come every time I go to write a blog post or try some new stats I can never find any data? A few hours ago I Googled “free data sets” and got over 74,000,000 hits, but it looks as if it’s going to be another evening of me with iris. What’s wrong here? At the root, it’s a deep problem that gets at the essence of data. What are data anyway? My answer: data are structured information. Part of the structure includes meta-information describing the intention and the integrity with which the data were collected. When looking for a data set, even for some purpose that is not that important we all want some evidence that the data were either collected with intentions that are similar to our intentions to use the data or that the data can be re-purposed. Moreover, we need to establish some comfort level that the data were not collected to deceive, that they are reasonable representative, reasonably randomized, reasonable unbiased etc. The more we importance we place on our project the more we tighten up on these requirements. This is not all philosophy. I think that focusing on intentions and integrity provides some practical guidance of where to search for data on the internet.

If you are using R and need data, here is a first stop. Note the author is maintaining a list of such data sources.

Springer MARC Records

Saturday, October 1st, 2011

Springer Marc Records

From the webpage:

Springer offers two options for MARC records for Springer eBook collections:

1. Free Springer MARC records, SpringerProtocols MARC records & eBook Title Lists

  • Available free of charge
  • Generated using Springer metadata containing most common fields
  • Pick, download and install Springer MARC records in 4 easy steps

2.Free OCLC MARC records

  • Available free of charge
  • More enhanced MARC records
  • Available through OCLC WORLDCAT service

This looks like very good topic map fodder.

I saw this at all things cataloged.


Friday, September 9th, 2011


A data as service site that offers access to data (no downloads) via API codes. Has helps for authors to prepare their data, APIs for data, etc. Currently in beta.

I mention it because data as service is one model for delivery of topic map content so the successes, problems and usage of Kasabi may be important milestones to watch.

True, Lexis/Nexis, WestLaw, and any number of other commercial vendors have sold access to data in the past but it was mostly dumb data. That is you had to contribute something to it to make it meaningful. We are in the early stages but I think a data market for data that works with my data is developing.

The options to download citations in formats that fit particular bibliographic programs are an impoverished example of delivered data working with local data.

Not quite the vision for the Semantic Web but it isn’t hard to imagine your calendaring program showing links to current news stories about your appointments. You have to supply the reasoning to cancel the appointment with the bank president just arrested for securities fraud and to increase your airline reservations to two (2).

Federal Register (US)

Friday, September 2nd, 2011

Federal Register (US)

From the developers webpage for the Federal Register (US):

Project Source Code is a fully open source project; on GitHub you can find the source code for the main site, the chef cookbooks for maintaining the servers, and the WordPress themes and configuration. We welcome your contributions and feedback.


While the API is still a work in progress, we’ve designed it to be as easy-to-use as possible:

  • It comes pre-processed; the data provided is a combination of data from the GPO MODS (metadata) files and the GPO bulkdata files and has gone through our cleanup procedures.
  • We’re using JSON as a lighter-weight, more web-friendly data transfer format
  • No API keys are needed; all you need is an HTTP client or browser.
  • The API is fully RESTful; URLs are provided to navigate to the full details or to the next page of results (HATEOAS).
  • A simple JSONP interface is also possible; simply add a `callback=foo` CGI parameter to the end of any URL to have the results be ready for cross-domain JavaScript consumption

See the webpage for Endpoints, Search Functionality, Ruby API Client and Usage Restrictions.

For those of you who are unfamiliar with the Federal Register:

The Office of the Federal Register informs citizens of their rights and obligations, documents the actions of Federal agencies, and provides a forum for public participation in the democratic process. Our publications provide access to a wide range of Federal benefits and opportunities for funding and contain comprehensive information about the various activities of the United States Government. In addition, we administer the Electoral College for Presidential elections and the Constitutional amendment process.

The Federal Register is updated daily by 6 a.m. and is published Monday through Friday, except Federal holidays, and consists of four types of entries.

  • Presidential Documents, including Executive orders and proclamations.
  • Rules and Regulations, including policy statements and interpretations of rules.
  • Proposed Rules, including petitions for rulemaking and other advance proposals.
  • Notices, including scheduled hearings and meetings open to the public, grant applications, administrative orders, and other announcements of government actions.

We recommend reading the “Learn” pages of this site for more on the structure and value of the Federal Register and for an overview of the regulatory process.

Or as it says on their homepage: “The Daily Journal of the United States Government.”

IRE: Investigative Reporters and Editors

Monday, August 29th, 2011

IRE: Investigative Reporters and Editors

The IRE sponsors the census data that I pointed out at:

From the about page:

Investigative Reporters and Editors, Inc. is a grassroots nonprofit organization dedicated to improving the quality of investigative reporting.

IRE was formed in 1975 to create a forum in which journalists throughout the world could help each other by sharing story ideas, news gathering techniques and news sources.

Mission Statement

The mission of Investigative Reporters and Editors is to foster excellence in investigative journalism, which is essential to a free society. We accomplish this by:

  • Providing training, resources and a community of support to investigative journalists.
  • Promoting high professional standards.
  • Protecting the rights of investigative journalists.
  • Ensuring the future of IRE.

They are a membership based organization and for $70 (US) per year, you get access to a number of data sets that have been collected by the organization or culled from public sources. Doesn’t hurt to check for sources of data before you go to the trouble of extracting it yourself.

The other reason to mention them is that news organizations seem to like finding connections between people, between people and fraudulent activities, people and sex workers, and other connections that are the bread and butter of topic maps. Particularly when topic maps are combined and new connections become apparent.

So, this is a place where topic maps, or at least the results of using topic maps (not the same thing), may find a friendly reception.

Suggestions of other likely places to pitch either topic maps or the results of using topic maps most welcome!


Monday, August 1st, 2011


NASA has shared data and software for years but now has a shiny new website and to be fair, some introductions to make sure of the material easier.

I don’t have a citation for it but Jim Grey (MS) was reported to say that astronomy data was great because there was so much of it and it was free.

There is a lot of mapping possible twixt and tween astronomy data sets, both historic and recent, so it is a ripe area for exploration with topic maps.


NASA’s Open Government Site Built On Open Source, an InformationWeek post on the NASA site.

Why InformationWeek mentions Object Oriented Data Technology (OODT) and Disqus but provides no links to the same, I cannot say.

Admittedly I don’t do enough linking for concepts, etc., but I do try to put in links to projects and the like.

Who’s Your Daddy?

Sunday, July 3rd, 2011

Who’s Your Daddy? (Genealogy and Corruption, American Style)

NPR (National Public Radio) News broadcast the opinion this morning that Brits are marginally less corrupt than Americans. Interesting question. Was Bonnie less corrupt than Clyde? Debate at your leisure but the story did prompt me to think of an excellent resource for tracking both U.S. and British style corruption.

Probably all the talk of lineage in the news lately but why not use the genealogy records that are gathered so obsessively to track the soft corruption of influence?

Just another data set to overlay on elected, appointed, and hired positions, lobbyists, disclosure statements, contributions, known sightings, congressional legislation and administrative regulations, etc. Could lead to a “Who’s Your Daddy?” segment on NPR where employment or contracts are questioned naming names. That would be interesting.

It also seems more likely to be effective than the “disclose your corruption” sunlight approach. Corruption is never confessed, it has to be rooted out.

World Bank Data

Friday, July 1st, 2011

World Bank Data

Available through other portals, the World Bank offers access to over 7,000 indicators at its site, along with widgets for displaying the data.

While the World Bank Data website is well done and a step towards “transparency,” it does not address the need for “transparency” in terms financial auditing.

Take for example the Uganda – Financial Sector DPC Project. Admittedly it is only $50M but given it has a forty (40) year term with a ten (10) year grace period, who will be able to say with any certainty what happened to the funds in question?

If there were a mapping between the financial systems that disburse these funds into the financial systems in Uganda, then on whatever basis the information is updated, the World Bank would know and could assure others of the fate of the funds in question.

Granted I am assuming that different institutions and countries have different financial systems and that uniformity of such applications or systems isn’t realistic. It should certainly be possible to setup and maintain mappings between such systems. I suspect that mappings to banks and other financial institutions should be made as well to enable off-site auditing of any and all transactions.

Lest it seem like I am picking on World Bank recipients, I would recommend such mapping/auditing practices for all countries before approval of big ticket items like defense budgets. The fact that an auditing mapping fails in a following year is an indication something was changed for a reason. Once it is understood that changes attract attention and attention uncovers fraud, unexpected maintenance is unlikely to be an issue.


Friday, July 1st, 2011


From the About page:

What is ScraperWiki?

There’s lots of useful data on the internet – crime statistics, government spending, missing kittens…

But getting at it isn’t always easy. There’s a table here, a report there, a few web pages, PDFs, spreadsheets… And it can be scattered over thousands of different places on the web, making it hard to see the whole picture and the story behind it. It’s like trying to build something from Lego when someone has hidden the bricks all over town and you have to find them before you can start building!

To get at data, programmers write bits of code called ‘screen scrapers’, which extract the useful bits so they can be reused in other apps, or rummaged through by journalists and researchers. But these bits of code tend to break, get thrown away or forgotten once they have been used, and so the data is lost again. Which is bad.

ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code.

Something to keep an eye on and whenever possible, to contribute to.

People make data difficult to access for a reason. Let’s disappoint them.

Sunday, June 26th, 2011

An easy source of US government regulations, which you can then use to demonstrate how your topic map application either maps the regulation into a legal environment or maps named individuals who win or lose under the regulation (in or out of government).

Pew Research raw survey data now available

Sunday, May 29th, 2011

Pew Research raw survey data now available

Actually the data sets pointed to by FlowingData are part of the Pew Internet (Pew Internet & American Life Project).

For all Pew raw data sets, see: Pew Research Center The Databank

Data is available in the following formats:

  1. Raw survey data file in both SPSS and comma-delimited (.csv) formats. To protect the privacy of respondents, telephone numbers, county of residence and zip code have been removed from all public data files.
  2. Cross tabulation file of questions with basic demographics in Word format. Standard demographic categories include sex, race, age, household income, educational attainment, parental status and geographic location (i.e. urban/rural/suburban).
  3. Survey instrument/questionnaire in Word format. The survey questionnaire provides question and response labels for the raw data file. It also includes all interviewer prompts and programming filters for outside researchers who would like to see how our questions are constructed or use our questions in their own surveys.
  4. Topline data file in Word format that includes trend data to previous surveys in which we have asked each question, where applicable.

As far as I know, the use of topic maps with survey and other data to create “profiles” of particular communities remains unexplored. May not be able to predict the actions of any individual but probabilistic predictions about members of a group may be close enough. Interesting. Predicting the actions of any individual may be NP-Hard but also irrelevant for most purposes.


Friday, May 27th, 2011


A search engine for data and statistics.

I was puzzled by results containing mostly PDF files until I read:

Zanran doesn’t work by spotting wording in the text and looking for images – it’s the other way round. The system examines millions of images and decides for each one whether it’s a graph, chart or table – whether it has numerical content.

Admittedly you may have difficulty re-using such data but finding it is a big first step. You can then contact the source for the data in a more re-usable form.

From Hints & Helps:

Language. English only please… for now.
Phrase search. You can use double quotes to make phrases (e.g. “mobile phones”).
Vocabulary. We have only limited synonyms – please try different words in your query. And we don’t spell-check … yet.

From the website:

Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. This huge amount of information can be difficult to find using conventional search engines, which are focused primarily on finding text rather than graphs, tables and bar charts.

Put more simply: Zanran is Google for data.

Well said.

Open government sites scrapped due to budget cuts

Wednesday, May 25th, 2011

Open government sites scrapped due to budget cuts

This isn’t so much surprising as it is disappointing. We now know the priority that “open” government in U.S. government budgetary discussions.

I could go on at length about this decision, the people who made it, complete with speculation on their motives, morals and parentage. Unfortunately, that would not restore the funding nor would it be a useful exercise.

As an alternative, let me suggest that everyone select one or two of the data sets that are already available and do something interesting. Something that will catch the imagination of the average citizen. Then credit these government sites as the sources and gently point out that with more funding, there would be more data. And hence more interesting things to see.

Asking someone at the agencies that produce data could result in interesting suggestions. They may lack the time, resources, personnel to do something really creative but with their ideas and your talents…, well, the result could interest the agency and the public. These agencies are the ones fighting on the inside of the public budget process for funding.

What data sets and ideas for those data sets do you think would have the most appeal or impact?

Health: Public-Use Data Files and Documentation

Monday, May 23rd, 2011

Health: Public-Use Data Files and Documentation

While looking for other data files, I ran across this resource.

Public health is always a popular topic (sorry!).

Stats of the Union tells health stories in America

Monday, May 9th, 2011

Stats of the Union tells health stories in America

From news of an iPad app that:

maps the status of health in America. Browse, pan, zoom, and explore through a number of demographics and breakdowns.

I don’t have an iPad (or iPhone) but both are venues of opportunity for topic maps.

It isn’t hard to imagine a topic map that takes the same information in Stats of the Union and adds in data that correlates obesity with the density of fast-food restaurants, making zoning decisions for the same a matter of public health.

To answer the question: “Why are you fat?” with a localized “McDonalds, Wendys, Arbies, etc.”

Nice visualizations from what I could see on the video.

Just a thought, to personalize the obesity app, you could map in frequent customers who are, ahem, extra large sizes. (With their consent of course. I wouldn’t bother asking McDonalds.)

Perhaps a new slogan: Topic maps, focusing information to a sharp point.

What do you think?

​Data visualizations for a changing world

Friday, February 18th, 2011

?Data visualizations for a changing world introduced The Google Public Data Explorer.

From the website:

he Google Public Data Explorer makes large datasets easy to explore, visualize and communicate. As the charts and maps animate over time, the changes in the world become easier to understand. You don’t have to be a data expert to navigate between different views, make your own comparisons, and share your findings.
Explore the data

Students, journalists, policy makers and everyone else can play with the tool to create visualizations of public data, link to them, or embed them in their own webpages. Embedded charts and links can update automatically so you’re always sharing the latest available data…..

A deeply useful achievement but one that leaves a lot of hard to specify semantics on the cutting room floor.

For example:

  1. What subjects are being visualized? That is what would I look for additional information about if I wanted to create another visualization?
  2. What relationship between the subjects is being illustrated? Would help in terms of looking for more information relevant to that relationship.
  3. How can I specify either #1 or #2 so that I can pass that along to someone else? (Asynchronous communication or recordation of insights)

PS: Thanks to Benjamin Bock for reminding me to cover this and the data exploration language from Google.

How It Works – The “Musical Brain”

Sunday, February 13th, 2011

How It Works – The “Musical Brain”

I found this following the links in the Million Song Dataset post.

One aspect, among others, that I found interesting, was the support for multiple ID spaces.

I am curious about the claim it works by:

Analyzing every song on the web to extract key, tempo, rhythm and timbre and other attributes — understanding every song in the same way a musician would describe it

Leaving aside the ambitious claims about NLP processing made elsewhere on that page, I find it curious that there is a uniform method for describing music.

Or perhaps they mean that the “Musical Brain” uses only one description uniformly across the music it evaluates. I can buy that. And it could well be a useful exercise.

At least from the prospective of generating raw data that could then be mapped to other nomenclatures used by musicians.

I wonder if the Rolling Stone uses the same nomenclature as the “Musical Brain?” Will have to check.

Suggestions for other music description languages? Mappings to the one used by the “Musical Brain?”

BTW, before I forget, the “Musical Brain” offers a free API (for non-commercial use) to its data.

Would appreciate hearing about your experiences with the API.

First, you need to Get the Data – Post

Wednesday, February 9th, 2011

First, you need to Get the Data is a post by Mathew Hurst about a site for asking questions about data sets (and getting answers).

A couple of the questions just to give you an idea about the site:

  • How can I compile a log of Wikipedia articles by date of creation?
  • Are there any indexes of available data sets?

There are useful answers to both of those questions.

Before starting off to build a data set, this is one site to check first.

A listing of sites to check for existing data sets would make an useful chapter in a book on topic maps.

Baltimore – Semi-Transparent or Semi-Opaque?

Thursday, January 27th, 2011

Open Baltimore is leading the way towards semi-transparent or semi-opaque government.

You be the judge.

The City of Baltimore is leading in placing hundreds of data sets online.

But is that being semi-transparent or semi-opaque?

Data sets I would like to see:

  • City contracts, their amounts and who was successful at bidding on them?
  • Successful bidders not be corporate names but who owns them? Who works there? What lawyers represent them?
  • What are the relationships, personal, business, etc., between staff, elected officials and anyone who does business with the city?
  • Same questions for school, fire, police and other departments.
  • Code violations, what are they, which inspectors write them, for what locations?
  • Arrests made of who, by which officers, for what crimes, locations and times.
  • etc. (these are illustrations and not an exhaustive list)

Make no mistake, I am grateful for the information the city has already provided.

What they have provided took a lot of work and will be useful for a number of purposes.

But I don’t want people to think that a large number of data sets means transparency.

Transparency involves questions of relevant data and meaningful ways to evaluate it and to connect it to other data.


Thursday, January 27th, 2011

Another free data source. (Commercial plans also available.)

Large number of data sources and what looks like a friendly number of free API calls while you are building an application.

Observation: Finding one data source or project seems to lead to several others in the same area.

Definitely worth a visit.

PS: The abundance of online data sources opens the door to semantic mappings (can you say topic maps?) that enhance the value of these data sets.

Such as resolving the semantic impedance between the data sets.

Topic map artifacts as commercial products.

The trick is going to be discovering (and resolving) semantic impedances that people are willing to pay to avoid.

DataMarket – Drill Down/Annotate?

Wednesday, January 26th, 2011


From the website:

Find and understand data.

Visualize the world’s economy, societies, nature, and industries, and gain new insights.

100 million time series from the most important data providers, such as the UN, World Bank and Eurostat.

I have just registered for the free account and have started poking about.

This looks deeply awesome!

In addition to being a source of data for analytical tools, I see an opportunity for topic maps to enable a drill-down capacity for such displays.

After all, any point in a time series is data from a file but at least for most such data, it should be traceable back to a file, report, questionnaire.

And from that file, report, questionnaire, it should be further traceable back to the author of the file or report and even further back, to the persons reported upon or questioned.

This site definitely has potential for real growth, particularly if they offer tools that enable drill down into data sets to source materials as well as to annotate points in a data set with other materials. Topic maps would excel at both.


  1. Register for a free account.
  2. Choose any two data sets and create two visualizations (use screen capture to capture the graphic).
  3. What information would you want to drill down to find or that you would want to use to annotate data points in either visualization? (3-5 pages, no citations)

MIMI Merge Process

Wednesday, January 19th, 2011

Michigan Molecular Interactions

From the website:

MiMI provides access to the knowledge and data merged and integrated from numerous protein interactions databases. It augments this information from many other biological sources. MiMI merges data from these sources with “deep integration” (see The MiMI Merge Process section) into its single database. A simple yet powerful user interface enables you to query the database, freeing you from the onerous task of having to know the data format or having to learn a query language. MiMI allows you to query all data, whether corroborative or contradictory, and specify which sources to utilize.

MiMI displays results of your queries in easy-to-browse interfaces and provides you with workspaces to explore and analyze the results. Among these workspaces is an interactive network of protein-protein interactions displayed in Cytoscape and accessed through MiMI via a MiMI Cytoscape plug-in.

MiMI gives you access to more information than you can get from any one protein interaction source such as:

  • Vetted data on genes, attributes, interactions, literature citations, compounds, and annotated text extracts through natural language processing (NLP)
  • Linkouts to integrated NCIBI tools to: analyze overrepresented MeSH terms for genes of interest, read additional NLP-mined text passages, and explore interactive graphics of networks of interactions
  • Linkouts to PubMed and NCIBI’s MiSearch interface to PubMed for better relevance rankings
  • Queriying by keywords, genes, lists or interactions
  • Provenance tracking
  • Quick views of missing information across databases.
  • I found the site looking for tracking of provenance after merging and then saw the following description of merging:

    MIMI Merge Process

    Protein interaction data exists in a number of repositories. Each repository has its own data format, molecule identifier, and supplementary information. MiMI assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers data from well-known protein interaction databases and deep-merges the information.

    Utilizing an identity function, molecules that may have different identifiers but represent the same real-world object are merged. Thus, MiMI allows the user to retrieve information from many different databases at once, highlighting complementary and contradictory information.

    There are several steps needed to create the final MiMI dataset. They are:

    1. The original source datasets are obtained, and transformed into the MiMI schema, except KEGG, NCBI Gene, Uniprot, Ensembl.
    2. Molecules that can be rolled into a gene are annotated to that gene record.
    3. Using all known identifiers of a merged molecule, sources such as Organelle DB or miBLAST, are queried to annotate specific molecular fields.
    4. The resulting dataset is loaded into a relational database.

    Because this is an automated process, and no curation occurs, any errors or misnomers in the original data sources will also exist in MiMI. For example, if a source indicates that the organism is unknown, MiMI will as well.

    If you find that a molecule has been incorrectly merged under a gene record, please contact us immediately. Because MiMI is completely automatically generated, and there is no data curation, it is possible that we have merged molecules with gene records incorrectly. If made aware of the error, we can and will correct the situation. Please report any problems of this kind to

    Tracking provenance is going to be a serious requirement for mission critical, financial and medical usage topic maps.

    Scraping for Journalism: A Guide for Collecting Data

    Monday, January 17th, 2011

    Scraping for Journalism: A Guide for Collecting Data by Dan Nguyen
    at ProPublica.

    I know, it says Journalism in the title. So just substitute topic map wherever you see journalism.. 😉

    Scraping is a good way to collect data for topic maps or that other activity.

    I saw the reference on and thought I should pass it on.