Archive for the ‘Data Source’ Category

Bitly Social Data APIs

Wednesday, January 9th, 2013

Bitly Social Data APIs by Hilary Mason.

From the post:

We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.

First, we share the analysis that we do at the link level….

Second, we’ve opened up access to a realtime search engine. …

Finally, we asked the question — what is the world paying attention to right now?…”bursting phrases”…

See Hilary’s post for the details, or even better, take a shot at the APIs!

I first saw this in a tweet by Dave Fauth.

New EU Data Portal [Transparency/Innovation?]

Wednesday, December 26th, 2012

EU Commission unwraps public beta of open data portal with 5800+ datasets, ahead of Jan 2013 launch by Robin Wauters.

The EU Data Portal.

From the post:

Good news for open data lovers in the European Union and beyond: the European Commission on Christmas Eve quietly pushed live the public beta version of its all-new open data portal.

For the record: open data is general information that can be freely used, re-used and redistributed by anyone. In this case, it concerns all the information that public bodies in the European Union produce, collect or pay for (it’s similar to the United States government’s

This could include geographical data, statistics, meteorological data, data from publicly funded research projects, and digitised books from libraries.

The post always quotes the portal website as saying:

This portal is about transparency, open government and innovation. The European Commission Data Portal provides access to open public data from the European Commission. It also provides access to data of other Union institutions, bodies, offices and agencies at their request.

The published data can be downloaded by everyone interested to facilitate reuse, linking and the creation of innovative services. Moreover, this Data Portal promotes and builds literacy around Europe’s data.

Eurostat is the largest data contributor so signs of “transparency” should be there, if anywhere.

The first twenty (20) data sets from Eurostat are:

  • Quarterly cross-trade road freight transport by type of transport (1 000 t, Mio Tkm)
  • Turnover by residence of client and by employment size class for div 72 and 74
  • Generation of waste by sector
  • Standardised incidence rate of accidents at work by economic activity, severity and age
  • At-risk-of-poverty rate of older people, by age and sex (Source: SILC)
  • Telecommunication services: Access to networks (1 000)
  • Production of environmentally harmful chemicals, by environmental impact class
  • Fertility indicators
  • Area under wine-grape vine varieties broken down by vine variety, age of the vines and NUTS 2 regions – Romania
  • Severe material deprivation rate by most frequent activity status (population aged 18 and over)
  • Government bond yields, 10 years’ maturity – monthly data
  • Material deprivation for the ‘Economic strain’ and ‘Durables’ dimensions, by number of item (Source: SILC)
  • Participation in non-formal taught activities within (or not) paid hours by sex and working status
  • Number of persons by working status within households and household composition (1 000)
  • Percentage of all enterprises providing CVT courses, by type of course and size class
  • EU Imports from developing countries by income group
  • Extra-EU imports of feedingstuffs: main EU partners
  • Production and international trade of foodstuffs: Fresh fish and fish products
  • General information about the enterprises
  • Agricultural holders

When I think of government “transparency,” I think of:

  • Who is making the decisions?
  • What are their relationships to the people asking for the decisions? School, party, family, social, etc.
  • What benefits are derived from the decisions?
  • Who benefits from those decisions?
  • What are the relationships between those who benefit and those who decide?
  • Remembering it isn’t the “EU” that makes a decision for good or ill for you.

    Some named individual or group of named individuals, with input from other named individuals, with who they had prior relationships, made those decisions.

    Transparency in government would name the names and relationships of those individuals.

    BTW, I would be very interested to learn what sort of “innovation” you can derive from any of the first twenty (20) data sets listed above.

    The holidays may have exhausted my imagination because I am coming up empty.


    Monday, November 19th, 2012


    From the about page:

    At FindTheData, we present you with the facts stripped of any marketing influence so that you can make quick and informed decisions. We present the facts in easy-to-use tables with smart filters, so that you can decide what is best.

    Too often, marketers and pay-to-play sites team up to present carefully crafted advertisements as objective “best of” lists. As a result, it has become difficult and time consuming to distinguish objective information from paid placements. Our goal is to become a trusted source in assisting you in life’s important decisions.

    FindTheData is organized into 9 broad categories

    Each category includes dozens of Comparisons from smartphones to dog breeds. Each Comparison consists of a variety of listings and each listing can be sorted by several key filters or compared side-by-side.

    Traditional search is a great hammer but sometimes you need a wrench.

    Currently search can find any piece of information across hundreds of billions of Web pages, but when you need to make a decision whether it’s choosing the right college or selecting the best financial advisor, you need information structured in an easily comparable format. FindTheData does exactly that. We help you compare apples-to-apples data, side-by-side, on a wide variety of products & services.

    If you think in the same categories as the authors, sorta like using LCSH, you are in like Flint. If you don’t, well, your mileage may vary.

    While some people may find it convenient to have tables and sorts pre-set for them, it would be nice to be able to download the data files.

    Still, you may find it useful to browse for datasets that are new to you.


    Wednesday, October 24th, 2012


    Most publishers have TOC services for new issues of their journals.

    JournalTOCs aggregates TOCs from publishers and maintains a searchable database of their TOC postings.

    A database that is accessible via a free API I should add.

    The API should be a useful way to add journal articles to a topic map, particularly when you want to add selected articles and not entire issues.

    I am looking forward to using and exploring JournalTOCs.

    Suggest you do the same.

    The Cell: An Image Library

    Thursday, August 9th, 2012

    The Cell: An Image Library

    For the casual user, an impressive collection of cell images.

    For the professional user, the advanced search page gives you an idea of the depth of images in this collection.

    A good source of images for curated (not “mash up”) alignment with other materials. Such as instructional resources on biology or medicine.

    Olympic medal winners: every one since 1896 as open data

    Thursday, July 5th, 2012

    Olympic medal winners: every one since 1896 as open data

    The Guardian Datablog has posted Olympic medal winner data for download.

    Admitting to some preference I was pleased to see that OpenDocument Format was one of the download choices. 😉

    It may just be my ignorance of Olympic events but it seems odd for the gender of competitors to be listed along with the gender of the event?

    A brief history of Olympic Sports (from Wikipedia). Military patrol was a demonstration sport in 1928, 1936 and 1948. Is that likely to make a return in 2016? Or would terrorist spotting be more appropriate?

    Open Content (Index Data)

    Tuesday, June 12th, 2012

    Open Content

    From the webpage:

    The searchable indexes below expose public domain ebooks, open access digital repositories, Wikipedia articles, and miscellaneous human-cataloged Internet resources. Through standard search protocols, you can make these resources part of your own information portals, federated search systems, catalogs etc. Connection instructions for SRU and Z39.50 are provided. If you have comments, questions, or suggestions for resources you would like us to add, please contact us, or consider joining the mailing list.. This service is powered by Index Data’s Zebra and Metaproxy

    Looking around after reading the post on the interview with Sebastian Hammer on Federated Search I found this listing of resources.

    Database name #records Description
    gutenberg 22194

    Project Gutenberg.
    High-quality clean-text ebooks, some audio-books.

    oaister 9988376

    OAIster. A Union catalog of digital resources, chiefly open archives of journals, etc.

    oca-all 135673 All of the ebooks made available by the Internet Archive
    as part of the Open Content Alliance (OCA). Includes high-quality, searchable PDFs, online book-readers,
    audio books, and much more. Excludes the Gutenberg sub-collection, which is available as a
    separate database.
    oca-americana 49056 The American
    collection of the Open Content Alliance.
    oca-iacl 669 The Internet Archive Children’s Library. Books for children from around the world.
    oca-opensource 2616 Collection of community-contributed books at the Internet Archive.
    oca-toronto 37241 The Canadian Libraries
    collection of the Open
    Content Alliance
    oca-universallibrary 30888 The Universal Library, a digitzation
    project founded at Carnegie-Mellon University. Content hosted at the Internet Archive.
    wikipedia 1951239 Titles and abstracts from Wikipedia, the open encyclopedia.
    wikipedia-da 66174 The Danish Wikipedia. Many thanks to Fujitsu Denmark for their support for the indexing of the national Wikipedias.
    wikipedia-sv 243248 The Swedish Wikipedia.

    Latency is an issue but I wonder what my reaction would be if a search quickly offered 3 or 4 substantive resources and invited me to read/manipulate them, while it seeks additional information/data?

    Most of the articles you see cited in this blog aren’t the sort of thing you can skim and some take more than one pass to jell.

    I suppose I could be offered 50 highly relevant articles in milli-seconds but I am not capable of assimalating them that quickly.

    So how many resources have been wasted to give me a capacity I can’t effectively use?

    From Tweets to Results: How to obtain, mine, and analyze Twitter data

    Thursday, May 31st, 2012

    From Tweets to Results: How to obtain, mine, and analyze Twitter data by Derek Ruths (McGill University).


    Since Twitter’s creation in 2006, it has become one of the most popular microblogging platforms in the world. By virtue of its popularity, the relative structural simplicity of Twitter posts, and a tendency towards relaxed privacy settings, Twitter has also become a popular data source for research on a range of topics in sociology, psychology, political science, and anthropology. Nonetheless, despite its widespread use in the research community, there are many pitfalls when working with Twitter data.

    In this day-long workshop, we will lead participants through the entire Twitter-based research pipeline: from obtaining Twitter data all the way through performing some of the sophisticated analyses that have been featured in recent high-profile publications. In the morning, we will cover the nuts and bolts of obtaining and working with a Twitter dataset including: using the Twitter API, the firehose, and rate limits; strategies for storing and filtering Twitter data; and how to publish your dataset for other researchers to use. In the afternoon, we will delve into techniques for analyzing Twitter content including constructing retweet, mention, and follower networks; measuring the sentiment of tweets; and inferring the gender of users from their profiles and unstructured text.

    We assume that participants will have little to no prior experience with mining Twitter or other social network datasets. As the workshop will be interactive, participants are encouraged to bring a laptop. Code examples and exercises will be given in Python, thus participants should have some familiarity with the language. However, all concepts and techniques covered will be language-independent, so any individual with some background in scripting or programming will benefit from the workshop.

    Any plans to use Twitter feeds for your topic maps?

    I first saw a reference to this workshop at: Do you haz teh (twitter) codez? by Ricard Nielson.

    The Data Lifecycle, Part One: Avroizing the Enron Emails

    Friday, May 25th, 2012

    The Data Lifecycle, Part One: Avroizing the Enron Emails by Russell Jurney.

    From the post:

    Series Introduction

    This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

    The Berkeley Enron Emails

    In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.

    Email is a rich source of information for analysis by many means. During the investigation of the Enron scandal of 2001, 517,431 messages from 114 inboxes of key Enron executives were collected. These emails were published and have become a common dataset for academics to analyze document collections and social networks. Andrew Fiore and Jeff Heer at UC Berkeley have cleaned this email set and provided it as a MySQL archive.

    We hope that this dataset can become a sort of common set for examples and questions, as anonymizing one’s own data in public forums can make asking questions and getting quality answers tricky and time consuming.

    More information about the Enron Emails is available:

    Covering the data lifecycle in any detail is a rare event.

    To do so with a meaningful data set is even rarer.

    You will get the maximum benefit from this series by “playing along” and posting your comments and observations.

    1 Billion Pages Visited In 2012

    Wednesday, May 23rd, 2012

    The ClueWeb12 project reports:

    The Lemur Project is creating a new web dataset, tentatively called ClueWeb12, that will be a companion or successor to the ClueWeb09 web dataset. This new dataset is expected to be ready for distribution in June 2012. Dataset construction consists of crawling the web for about 1 billion pages, web page filtering, and organization into a research-ready dataset.

    The crawl was initially seeded with 2,820,500 uniq URLs. This list was generated by taking the 10 million ClueWeb09 urls that had the highest PageRank scores, and then removing any page that was not in the top 90% of pages as ranked by Waterloo spam scores (i.e., least likely to be spam). Two hundred sixty-two (262) seeds were added from the most popular sites in English-speaking countries, as reported by Alexa. The number of sites selected from each country depended on its relative population size, for example, United States (71.0%), United Kindom (14.0%), Canada (7.7%), Australia (5.2%), Ireland (3.8%), and New Zealand (3.7%). Finally, Charles Clark, University of Waterloo, provided 5,950 seeds specific to travel sites.

    A blacklist was used to avoid sites that are reported to distribute pornography, malware, and other material that would not be useful in a dataset intended to support a broad range of research on information retrieval and natural language understanding. The blacklist was obtained from a commercial managed URL blacklist service,, which was downloaded on 2012-02-03. The crawler blackliset consists of urls in the malware, phishing, spyware, virusinfected, filehosting and filesharing categories. Also included in the blacklist is a small number (currently less than a dozen) of sites that opted out of the crawl.

    The crawled web pages will be filtered to remove certain types of pages, for example, pages that a text classifier identifies as non-English, pornography, or spam. The dataset will contain a file that identifies each url that was removed and why it was removed. The web graph will contain all pages visited by the crawler, and will include information about redirected links.

    The crawler captures an average of 10-15 million pages (and associated images, etc) per day. Its progress is documented in a daily progess report.

    Are there any search engine ads: X billion of pages crawled?

    Health Care Cost Institute

    Tuesday, May 22nd, 2012

    Health Care Cost Institute

    I can’t give you a clean URL but on Monday (21 May 2012), the Washington Post ran a story on the Health Care Cost Institute, which had the following quotes:

    This morning a new nonprofit called the Health Care Cost Institute will roll out a database of 5 billion health insurance claims (all stripped of the individual health plan’s identity, to address privacy concerns).

    This is the first study to use the HCCI data, although more are in the works. Gaynor has been inundated with about 130 requests from health policy researchers to use the database. While his team sifts through those, three approved studies are already tackling big health policy questions.

    There is immense interest in gaining access,” says HCCI executive director David Newman. “We’re having trouble keeping up with that.” (emphasis added)

    Sorry, that went by a little fast. The data has already been scrubbed so why the choke point of the Health Care Cost Insitute on the data?

    Spin it up to one or more clouds that support free public storage for data sets of public interest.

    Problem of sorting through access request is solved.

    Just maybe researchers will want to address other questions, ones that aren’t necessarily about costs. And/or combine this data with other data. Like data on local pollution. (Although you would need historical data to make that work.)

    Mapping this data set to other data sets could only magnify its importance.

    Many thanks are owed to the Health Care Cost Institute for securing the data set.

    But our thanks should not include electing the HCCI as censor of uses of this data set.

    TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

    Monday, May 14th, 2012

    TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

    From the post:

    TREC Legal Track — part of the U.S. government’s Text Retrieval Conference — announced last week that the 2012 edition of its annual document review project for testing new systems is canceled, while prominent e-discovery software company Recommind confirmed that it’s been asked to leave the project for prematurely sharing results.

    These difficulties highlight the need for:

    • open data sets and
    • protocols for reporting of results as they occur.

    That requires a data set with relevance judgments and other work.

    Have you thought about the: Open Relevance Project at the Apache Foundation?

    Email archives from Apache projects, the backbone of the web as we know it, are ripe for your contributions.

    Let me be the first to ask Recommind to join in building a public data set for everyone.

    Trawling the web for socioeconomic data? Look no further than Knoema

    Thursday, May 10th, 2012

    Trawling the web for socioeconomic data? Look no further than Knoema

    From the Guardian Datablog, John Burn-Murdoch writes:

    A joint venture by Russian and Indian technology professionals aims to be the Youtube of data. Knoema which launched last month and is marketed by its creators as “your personal knowledge highway”, combines data-gathering with presentation to create an online bank of socioeconomic and environmental data-sets.

    The website’s homepage shows a selection of the topics on which Knoema has collected data. Among the categories are broad fields such as commodities and energy, but also more specialised collections including sexual exploitation and biofuels.

    [graphics omitted]

    Within each subject-area you can find one or more ‘dashboards’ – simple yet comprehensive presentations of data for a given topic, with all source-material documented. Knoema also provides choropleth maps for many of the datasets where figures are given for geographical areas.

    Commodity passports‘ are another format in which Knoema offers some of its data. These give a detailed breakdown of production, consumption, imports, exports and market prices for a diverse range of products and materials including apples, cotton and natural gas.

    Resource listings following the site review, including the Guardian’s world government data gateway and other resources.

    CNN Transcript Collection (2000-2012)

    Thursday, May 10th, 2012

    CNN Transcript Collection (2000-2012)

    From the webpage:

    For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at This is a just-in-case grab of the years of transcripts for later study and historical research.

    Suggested transcript sources for other broadcast media?

    Seen at Nathan Yau’s Flowing Data.

    46 Research APIs: DataUnison, Mendeley, LexisNexis and Zotero

    Sunday, April 29th, 2012

    46 Research APIs: DataUnison, Mendeley, LexisNexis and Zotero by Wendell Santos.

    From the post:

    Our API directory now includes 46 research APIs. The newest is the Globus Online Transfer API. The most popular, in terms of mashups, is the Mendeley API. We list 3 Mendeley mashups. Below you’ll find some more stats from the directory, including the entire list of research APIs.

    I did see an API that accepts Greek strings and returns Latin transliteration. Oh, doesn’t interest you. 😉

    There are a number of bibliography, search and related tools.

    I am sure you will find something to enhance an academic application of topic maps.

    Campaign Finance Data in Real Time

    Saturday, March 10th, 2012

    Campaign Finance Data in Real Time by DEREK WILLIS.

    From the post:

    Political campaigns can change every day. The Campaign Finance API now does a better job of keeping pace.

    We worked with ProPublica, one of the heaviest users of the API, to make the API more real-time, and to surface more data, such as itemized contributions for every presidential candidate and “super PAC”.

    When the API was launched, most of the data it served up was updated every week or, in some cases, on a daily basis. But we work for news organizations, and what is news right now can be old news tomorrow. Committees that raise and spend money influencing federal elections are filing reports every day, not just on the day that reports are due.

    If you are mapping political data, the New York Times is a real treasure trove of information.

    Read this post for more details on real time campaign finance data.

    Particle Physics – Stanford

    Monday, February 13th, 2012

    Leonard Susskind lectures on particle physics. Like astronomy (both optical and radio), particle physics was a leading source of “big data” before there was “big data.”

    Particle Physics: Basic Concepts

    Particle Physics: Standard Model

    Interesting in its own right, another field for testing data mining software.

    Finding Data on the Internet

    Tuesday, February 7th, 2012

    Finding Data on the Internet

    From the post:

    What I would like is a nice list of all of credible sources on the Internet for finding data to use with R projects. I know that this is a crazy idea, not well formulated (what are data after all) and loaded with absurd computational and theoretical challenges. (Why can’t I just google “data R” and get what I want?) So, what can I do? As many people are also out there doing, I can begin to make lists (in many cases lists of lists) on a platform that is stable enough to survive and grow, and perhaps encourage others to help with the effort.

    Here follows a list of data sources that may easily be imported into R. If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what’s out there.

    Useful listing of data sources for R, but you could use them with any SQL, NoSQ, SQL-NoSQL hybrid, or topic map as well.

    Title probably should be: “Data Found on the Internet.” Finding data is a more difficult proposition.

    Curious: Is there a “data crawler” that attempts to crawl websites of governments and the usual suspects for new data sets?

    Web Data Commons

    Tuesday, January 24th, 2012

    Web Data Commons: Extracting Structured Data from the Common Web Crawl

    From the post:

    Web Data Commons will extract all Microformat, Microdata and RDFa data that is contained in the Common Crawl corpus and will provide the extracted data for free download in the form of RDF-quads as well as CSV-tables for common entity types (e.g. product, organization, location, …).

    We are finished with developing the software infrastructure for doing the extraction and will start an extraction run for the complete Common Crawl corpus once the new 2012 version of the corpus becomes available in February. For testing our extraction framework, we have extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010. The results of this extraction run are provided below. We will provide the data from the complete 2010 corpus together with the data from the 2012 corpus in order to enable comparisons on how data provision has evolved within the last two years.

    An interesting mining of open data.

    The ability to perform comparisons on data over time is particularly interesting.

    Tuesday, January 24th, 2012

    This website is required by the “Federal Funding Accountability and Transparency Act (Transparency Act).”

    The FAQ describes its purpose as:

    To provide the public with information about how their tax dollars are spent. Citizens have a right and a need to understand where tax dollars are spent. Collecting data about the various types of contracts, grants, loans, and other types of spending in our government will provide a broader picture of the Federal spending processes, and will help to meet the need of greater transparency. The ability to look at contracts, grants, loans, and other types of spending across many agencies, in greater detail, is a key ingredient to building public trust in government and credibility in the professionals who use these agreements.

    An amazing amount of data which can be searched or browsed in a number of ways.

    It is missing one ingredient that would change it from an amazing information resource to a game changing information resource, you.

    The site can only report information known to the federal government and covered by the Transparency Act.

    For example, it can’t report on family or personal relationships between various parties to contracts or even offer good (or bad) information on performance on contacts or methods used by contractors.

    However, a topic map (links into this site are stable) could combine this information with other information quite easily.

    I ran across this site in Analyzing US Government Contract Awards in R by Vik Paruchuri. A very good article that scratches the surface of mining this content.

    Monthly Twitter activity for all members of the U.S. Congress

    Wednesday, January 11th, 2012

    Monthly Twitter activity for all members of the U.S. Congress by Drew Conway.

    From the post:

    Many months ago I blogged about the research that John Myles White and I are conducting on using Twitter data to estimate an individual’s political ideology. As I mentioned then, we are using the Twitter activity of members of the U.S. Congress to build a training data set for our model. A large part of the effort for this project has gone into designing a system to systematically collect the Twitter data on the members of the U.S. Congress.

    Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API.

    Looks like an interesting data set to match up to the ages of addresses doesn’t it?

    ProgrammableWeb – New APIs

    Thursday, December 22nd, 2011

    70 New APIs: Google Affiliate Network, Visual Search and Mobile App Sales Tracking by Wendell Santos.

    In a post dated 18 December 2011, ProgrammableWeb reports:

    This week we had 70 new APIs added to our API directory including an audio fingerprinting service, sentiment analysis and analytics service, affiliate marketing network, mobile app sales tracking service, visual search service and an eCommerce service. In addition we covered a “mobile engagement” platform adding revenue analytics to their service. Below are more details on each of these new APIs.

    I have a question: ProgrammableWeb lists 4657 APIs (as of 22 December 2011, about 6:30 PM East Coast time) with six (6) filters, Keywords, Category, Company, Protocols/Styles, Data Format, Date, Managed By. How easy/hard is that to use? Care to guess where the break point will come in terms of ease of use?

    For example, choosing “government” as a category, results in 154 APIs. A result that is a very uneven listing from Liepzig city data to Brazilian election candidate information to words used in the U.S. Congress. Minimal organization by country would be nice.

    Introducing the Events API

    Sunday, December 18th, 2011

    Introducing the Events API by Brian Balser.

    From the post:

    This past November, The New York Times launched the Arts & Entertainment Guide, an interactive guide to noteworthy cultural events in and around New York City. The application lets you browse through a hand-selected listing of events, customizing the view based on date range, category and location.

    At our annual Hack Day we made the Event Listings API, used by the interactive guide, publicly available to the developer community on the NYTimes Developer Network. The API supports three types of search: spatial, faceted and full-text. Each can be used separately or in conjunction in order to find events by different sets of criteria.

    If the twenty-two (million) metro area population doesn’t sound like a large enough market, consider that New York City is projected to have fifty (50) million visitors in 2012.

    Topic maps that merge data from this feed and conference websites seems a likely early use of this data. But more creative uses are certainly possible.

    What would you suggest?


    Thursday, November 24th, 2011


    From the webpage:

    Factlab collects official stats from around the world, bringing together the World Bank, UN, the EU and the US Census Bureau. How does it work for you – and what can you do with the data?

    From the guardian in the UK.

    Very impressive and interactive site.

    Don’t agree with their philosophical assumptions about “facts,” but none the less, a number of potential clients do. So long as they are paying the freight, facts they are. 😉

    Department of Numbers

    Thursday, October 27th, 2011

    Department of Numbers

    From the webpage:

    The Department of Numbers contextualizes public data so that individuals can form independent opinions on everyday social and economic matters.

    Possible source for both data and analysis that is of public interest. Thinking it will be easier to attract attention to topic maps that address current issues.

    Royal Society Journal Archive

    Wednesday, October 26th, 2011

    Royal Society Journal Archive – Free Permanent Access

    From the announcement:

    The Royal Society has today announced that its world-famous historical journal archive – which includes the first ever peer-reviewed scientific journal – has been made permanently free to access online.

    So, if you search for information using modern terminology, are you going to pick up materials from 10, 50, 100, 300 years ago?

    Where do you think the break point will be on terminology?

    Here’s my suggestion:

    1. We will pair off in two person teams. The A teams will research a subject and record their queries, along with an estimate for when the language changes.
    2. The B teams will take the A team results and try to show that the estimate for when the language changed is incorrect. (too early, too late, never, etc.)
    3. Prizes will be awarded for the best results as well as the most interesting queries, subjects and odd facts learned along the way.


    Tuesday, October 25th, 2011


    From the Guardian in the UK. If you don’t know the Guardian, you are missing a real treat.

    The Datablog offers visualizations of facts that otherwise may be difficult to grasp or that become more compelling in graphic form.

    Browse around and you will find a number of interesting resources, such as a listing of all the visualizations for the last 2 years and information on how they make data available.

    Some news outlets in the U.S., such as the New York Times, have similar efforts but I don’t know of any that are quite this good. Suggestions anyone?

    This and similar resources should give you ideas on how to visualize information to discover information and subjects for your topic maps as well as ways that you can present topic map data more effectively to your users.

    1. Choose one visualization from the Guardian and explain what advantages it offers over a simple table layout of the same information. 2-3 pages (no citations)
    2. What other information sets could be effectively displayed using a technique similar to #1? What would be different about it over table display? 2-3 pages (no citations)
    3. What are the limitations of the visualization you have chosen for #2? 2-3 pages (no citations)

    Processing every Wikipedia article

    Saturday, October 22nd, 2011

    Processing every Wikipedia article by Gareth Lloyd.

    From the post:

    I thought it might be worth writing a quick follow up to the Wikipedia Visualization piece. Being able to parse and process all of Wikipedia’s articles in a reasonable amount of time opens up fantastic opportunities for data mining and analysis. What’s more, it’s easy once you know how.

    An alternative method for accessing and parsing Wikipedia data. I probably need to do a separate post on the Visualization post.


    New food web dataset

    Tuesday, October 18th, 2011

    New food web dataset

    From the post:

    So, there is a new food web dataset out that was put in Ecological Archives here, and I thought I would play with it. The food web is from Otago Harbour, an intertidal mudflat ecosystem in New Zealand. The web contains 180 nodes, with 1,924 links.

    Fun stuff…

    Interesting visuals but do you find that they help you understand the data?

    Important question for visualizing topic maps because you can make the nodes and associations between them jump, jitter, blink (shades of the browser wars), or zoom in and out. OK, so if I am playing “Idiotfield 3” or whatever that might be interesting.

    But the question for topic maps or any information systems is whether it helps me find or understand the underlying data?

    What do you think here? Data is available. What would you do differently?

    Ecological Society of America (esa) – Ecological Archives

    Tuesday, October 18th, 2011

    Ecological Society of America (esa) – Ecological Archives

    If you are interested in ecological data for use with topic maps, this looks like a good place to start.

    The available digital files/supplements to published papers go back to 1982.

    Published by the Ecological Society of America.