Data Source « Another Word For It

January 9, 2013

Bitly Social Data APIs

Filed under: Bitly,Data Source,Search Data — Patrick Durusau @ 12:02 pm

From the post:

We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.

First, we share the analysis that we do at the link level….

Second, we’ve opened up access to a realtime search engine. …

Finally, we asked the question — what is the world paying attention to right now?…”bursting phrases”…

See Hilary’s post for the details, or even better, take a shot at the APIs!

I first saw this in a tweet by Dave Fauth.

Comments Off

December 26, 2012

New EU Data Portal [Transparency/Innovation?]

Filed under: Data,Data Source,EU,Transparency — Patrick Durusau @ 2:30 pm

EU Commission unwraps public beta of open data portal with 5800+ datasets, ahead of Jan 2013 launch by Robin Wauters.

The EU Data Portal.

From the post:

Good news for open data lovers in the European Union and beyond: the European Commission on Christmas Eve quietly pushed live the public beta version of its all-new open data portal.

For the record: open data is general information that can be freely used, re-used and redistributed by anyone. In this case, it concerns all the information that public bodies in the European Union produce, collect or pay for (it’s similar to the United States government’s Data.gov).

This could include geographical data, statistics, meteorological data, data from publicly funded research projects, and digitised books from libraries.

The post always quotes the portal website as saying:

This portal is about transparency, open government and innovation. The European Commission Data Portal provides access to open public data from the European Commission. It also provides access to data of other Union institutions, bodies, offices and agencies at their request.

The published data can be downloaded by everyone interested to facilitate reuse, linking and the creation of innovative services. Moreover, this Data Portal promotes and builds literacy around Europe’s data.

Eurostat is the largest data contributor so signs of “transparency” should be there, if anywhere.

The first twenty (20) data sets from Eurostat are:

Quarterly cross-trade road freight transport by type of transport (1 000 t, Mio Tkm)
Turnover by residence of client and by employment size class for div 72 and 74
Generation of waste by sector
Standardised incidence rate of accidents at work by economic activity, severity and age
At-risk-of-poverty rate of older people, by age and sex (Source: SILC)
Telecommunication services: Access to networks (1 000)
Production of environmentally harmful chemicals, by environmental impact class
Fertility indicators
Area under wine-grape vine varieties broken down by vine variety, age of the vines and NUTS 2 regions – Romania
Severe material deprivation rate by most frequent activity status (population aged 18 and over)
Government bond yields, 10 years’ maturity – monthly data
Material deprivation for the ‘Economic strain’ and ‘Durables’ dimensions, by number of item (Source: SILC)
Participation in non-formal taught activities within (or not) paid hours by sex and working status
Number of persons by working status within households and household composition (1 000)
Percentage of all enterprises providing CVT courses, by type of course and size class
EU Imports from developing countries by income group
Extra-EU imports of feedingstuffs: main EU partners
Production and international trade of foodstuffs: Fresh fish and fish products
General information about the enterprises
Agricultural holders

When I think of government “transparency,” I think of:

Who is making the decisions?

What are their relationships to the people asking for the decisions? School, party, family, social, etc.

What benefits are derived from the decisions?

Who benefits from those decisions?

What are the relationships between those who benefit and those who decide?

Remembering it isn’t the “EU” that makes a decision for good or ill for you.

Some named individual or group of named individuals, with input from other named individuals, with who they had prior relationships, made those decisions.

Transparency in government would name the names and relationships of those individuals.

BTW, I would be very interested to learn what sort of “innovation” you can derive from any of the first twenty (20) data sets listed above.

The holidays may have exhausted my imagination because I am coming up empty.

Comments Off

November 19, 2012

FindTheData

Filed under: Data,Data Source,Dataset — Patrick Durusau @ 7:03 pm

FindTheData

From the about page:

At FindTheData, we present you with the facts stripped of any marketing influence so that you can make quick and informed decisions. We present the facts in easy-to-use tables with smart filters, so that you can decide what is best.

Too often, marketers and pay-to-play sites team up to present carefully crafted advertisements as objective “best of” lists. As a result, it has become difficult and time consuming to distinguish objective information from paid placements. Our goal is to become a trusted source in assisting you in life’s important decisions.

FindTheData is organized into 9 broad categories

Each category includes dozens of Comparisons from smartphones to dog breeds. Each Comparison consists of a variety of listings and each listing can be sorted by several key filters or compared side-by-side.

Traditional search is a great hammer but sometimes you need a wrench.

Currently search can find any piece of information across hundreds of billions of Web pages, but when you need to make a decision whether it’s choosing the right college or selecting the best financial advisor, you need information structured in an easily comparable format. FindTheData does exactly that. We help you compare apples-to-apples data, side-by-side, on a wide variety of products & services.

If you think in the same categories as the authors, sorta like using LCSH, you are in like Flint. If you don’t, well, your mileage may vary.

While some people may find it convenient to have tables and sorts pre-set for them, it would be nice to be able to download the data files.

Still, you may find it useful to browse for datasets that are new to you.

Comments Off

October 24, 2012

JournalTOCs

Filed under: Data Source,Library,Library software,Publishing — Patrick Durusau @ 4:02 pm

JournalTOCs

Most publishers have TOC services for new issues of their journals.

JournalTOCs aggregates TOCs from publishers and maintains a searchable database of their TOC postings.

A database that is accessible via a free API I should add.

The API should be a useful way to add journal articles to a topic map, particularly when you want to add selected articles and not entire issues.

I am looking forward to using and exploring JournalTOCs.

Suggest you do the same.

Comments Off

August 9, 2012

The Cell: An Image Library

Filed under: Bioinformatics,Biomedical,Data Source,Medical Informatics — Patrick Durusau @ 3:50 pm

The Cell: An Image Library

For the casual user, an impressive collection of cell images.

For the professional user, the advanced search page gives you an idea of the depth of images in this collection.

A good source of images for curated (not “mash up”) alignment with other materials. Such as instructional resources on biology or medicine.

Comments Off

July 5, 2012

Olympic medal winners: every one since 1896 as open data

Filed under: Data,Data Mining,Data Source — Patrick Durusau @ 5:21 am

Olympic medal winners: every one since 1896 as open data

The Guardian Datablog has posted Olympic medal winner data for download.

Admitting to some preference I was pleased to see that OpenDocument Format was one of the download choices.

It may just be my ignorance of Olympic events but it seems odd for the gender of competitors to be listed along with the gender of the event?

A brief history of Olympic Sports (from Wikipedia). Military patrol was a demonstration sport in 1928, 1936 and 1948. Is that likely to make a return in 2016? Or would terrorist spotting be more appropriate?

Comments Off

June 12, 2012

Open Content (Index Data)

Filed under: Data,Data Source — Patrick Durusau @ 3:20 pm

Open Content

From the webpage:

The searchable indexes below expose public domain ebooks, open access digital repositories, Wikipedia articles, and miscellaneous human-cataloged Internet resources. Through standard search protocols, you can make these resources part of your own information portals, federated search systems, catalogs etc. Connection instructions for SRU and Z39.50 are provided. If you have comments, questions, or suggestions for resources you would like us to add, please contact us, or consider joining the mailing list.. This service is powered by Index Data’s Zebra and Metaproxy

Looking around after reading the post on the interview with Sebastian Hammer on Federated Search I found this listing of resources.

Database name #records Description

gutenberg 22194
Project Gutenberg.
High-quality clean-text ebooks, some audio-books.

oaister 9988376
OAIster. A Union catalog of digital resources, chiefly open archives of journals, etc.

oca-all 135673 All of the ebooks made available by the Internet Archive
as part of the Open Content Alliance (OCA). Includes high-quality, searchable PDFs, online book-readers,
audio books, and much more. Excludes the Gutenberg sub-collection, which is available as a
separate database.

oca-americana 49056 The American
Libraries collection of the Open Content Alliance.

oca-iacl 669 The Internet Archive Children’s Library. Books for children from around the world.

oca-opensource 2616 Collection of community-contributed books at the Internet Archive.

oca-toronto 37241 The Canadian Libraries
collection of the Open
Content Alliance.

oca-universallibrary 30888 The Universal Library, a digitzation
project founded at Carnegie-Mellon University. Content hosted at the Internet Archive.

wikipedia 1951239 Titles and abstracts from Wikipedia, the open encyclopedia.

wikipedia-da 66174 The Danish Wikipedia. Many thanks to Fujitsu Denmark for their support for the indexing of the national Wikipedias.

wikipedia-sv 243248 The Swedish Wikipedia.

Latency is an issue but I wonder what my reaction would be if a search quickly offered 3 or 4 substantive resources and invited me to read/manipulate them, while it seeks additional information/data?

Most of the articles you see cited in this blog aren’t the sort of thing you can skim and some take more than one pass to jell.

I suppose I could be offered 50 highly relevant articles in milli-seconds but I am not capable of assimalating them that quickly.

So how many resources have been wasted to give me a capacity I can’t effectively use?

Comments Off

May 31, 2012

From Tweets to Results: How to obtain, mine, and analyze Twitter data

Filed under: Data Source,Data Streams,Tweets — Patrick Durusau @ 4:16 pm

From Tweets to Results: How to obtain, mine, and analyze Twitter data by Derek Ruths (McGill University).

Description:

Since Twitter’s creation in 2006, it has become one of the most popular microblogging platforms in the world. By virtue of its popularity, the relative structural simplicity of Twitter posts, and a tendency towards relaxed privacy settings, Twitter has also become a popular data source for research on a range of topics in sociology, psychology, political science, and anthropology. Nonetheless, despite its widespread use in the research community, there are many pitfalls when working with Twitter data.

In this day-long workshop, we will lead participants through the entire Twitter-based research pipeline: from obtaining Twitter data all the way through performing some of the sophisticated analyses that have been featured in recent high-profile publications. In the morning, we will cover the nuts and bolts of obtaining and working with a Twitter dataset including: using the Twitter API, the firehose, and rate limits; strategies for storing and filtering Twitter data; and how to publish your dataset for other researchers to use. In the afternoon, we will delve into techniques for analyzing Twitter content including constructing retweet, mention, and follower networks; measuring the sentiment of tweets; and inferring the gender of users from their profiles and unstructured text.

We assume that participants will have little to no prior experience with mining Twitter or other social network datasets. As the workshop will be interactive, participants are encouraged to bring a laptop. Code examples and exercises will be given in Python, thus participants should have some familiarity with the language. However, all concepts and techniques covered will be language-independent, so any individual with some background in scripting or programming will benefit from the workshop.

Any plans to use Twitter feeds for your topic maps?

I first saw a reference to this workshop at: Do you haz teh (twitter) codez? by Ricard Nielson.

Comments Off

May 25, 2012

The Data Lifecycle, Part One: Avroizing the Enron Emails

Filed under: Avro,Data Source,Hadoop — Patrick Durusau @ 4:41 am

The Data Lifecycle, Part One: Avroizing the Enron Emails by Russell Jurney.

From the post:

Series Introduction

This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

The Berkeley Enron Emails

In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.

Email is a rich source of information for analysis by many means. During the investigation of the Enron scandal of 2001, 517,431 messages from 114 inboxes of key Enron executives were collected. These emails were published and have become a common dataset for academics to analyze document collections and social networks. Andrew Fiore and Jeff Heer at UC Berkeley have cleaned this email set and provided it as a MySQL archive.

We hope that this dataset can become a sort of common set for examples and questions, as anonymizing one’s own data in public forums can make asking questions and getting quality answers tricky and time consuming.

More information about the Enron Emails is available:

Document Classification on Enron Email Dataset

Enron Scandal on Wikipedia

UC Berkeley Enron Email Analysis

Covering the data lifecycle in any detail is a rare event.

To do so with a meaningful data set is even rarer.

You will get the maximum benefit from this series by “playing along” and posting your comments and observations.

Comments (1)

May 23, 2012

1 Billion Pages Visited In 2012

Filed under: ClueWeb2012,Data Source,Lemur Project — Patrick Durusau @ 6:08 pm

The ClueWeb12 project reports:

The Lemur Project is creating a new web dataset, tentatively called ClueWeb12, that will be a companion or successor to the ClueWeb09 web dataset. This new dataset is expected to be ready for distribution in June 2012. Dataset construction consists of crawling the web for about 1 billion pages, web page filtering, and organization into a research-ready dataset.

…

The crawl was initially seeded with 2,820,500 uniq URLs. This list was generated by taking the 10 million ClueWeb09 urls that had the highest PageRank scores, and then removing any page that was not in the top 90% of pages as ranked by Waterloo spam scores (i.e., least likely to be spam). Two hundred sixty-two (262) seeds were added from the most popular sites in English-speaking countries, as reported by Alexa. The number of sites selected from each country depended on its relative population size, for example, United States (71.0%), United Kindom (14.0%), Canada (7.7%), Australia (5.2%), Ireland (3.8%), and New Zealand (3.7%). Finally, Charles Clark, University of Waterloo, provided 5,950 seeds specific to travel sites.

A blacklist was used to avoid sites that are reported to distribute pornography, malware, and other material that would not be useful in a dataset intended to support a broad range of research on information retrieval and natural language understanding. The blacklist was obtained from a commercial managed URL blacklist service, URLBlacklist.com, which was downloaded on 2012-02-03. The crawler blackliset consists of urls in the malware, phishing, spyware, virusinfected, filehosting and filesharing categories. Also included in the blacklist is a small number (currently less than a dozen) of sites that opted out of the crawl.

…

The crawled web pages will be filtered to remove certain types of pages, for example, pages that a text classifier identifies as non-English, pornography, or spam. The dataset will contain a file that identifies each url that was removed and why it was removed. The web graph will contain all pages visited by the crawler, and will include information about redirected links.

The crawler captures an average of 10-15 million pages (and associated images, etc) per day. Its progress is documented in a daily progess report.

Are there any search engine ads: X billion of pages crawled?

Comments Off

May 22, 2012

Health Care Cost Institute

Filed under: Data Source,Health care — Patrick Durusau @ 2:37 pm

Health Care Cost Institute

I can’t give you a clean URL but on Monday (21 May 2012), the Washington Post ran a story on the Health Care Cost Institute, which had the following quotes:

This morning a new nonprofit called the Health Care Cost Institute will roll out a database of 5 billion health insurance claims (all stripped of the individual health plan’s identity, to address privacy concerns).

…

This is the first study to use the HCCI data, although more are in the works. Gaynor has been inundated with about 130 requests from health policy researchers to use the database. While his team sifts through those, three approved studies are already tackling big health policy questions.

…

“There is immense interest in gaining access,” says HCCI executive director David Newman. “We’re having trouble keeping up with that.” (emphasis added)

Sorry, that went by a little fast. The data has already been scrubbed so why the choke point of the Health Care Cost Insitute on the data?

Spin it up to one or more clouds that support free public storage for data sets of public interest.

Problem of sorting through access request is solved.

Just maybe researchers will want to address other questions, ones that aren’t necessarily about costs. And/or combine this data with other data. Like data on local pollution. (Although you would need historical data to make that work.)

Mapping this data set to other data sets could only magnify its importance.

Many thanks are owed to the Health Care Cost Institute for securing the data set.

But our thanks should not include electing the HCCI as censor of uses of this data set.

Comments Off

May 14, 2012

TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

Filed under: Data Mining,Data Source,Open Relevance Project,TREC — Patrick Durusau @ 12:47 pm

TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

From the post:

TREC Legal Track — part of the U.S. government’s Text Retrieval Conference — announced last week that the 2012 edition of its annual document review project for testing new systems is canceled, while prominent e-discovery software company Recommind confirmed that it’s been asked to leave the project for prematurely sharing results.

These difficulties highlight the need for:

open data sets and
protocols for reporting of results as they occur.

That requires a data set with relevance judgments and other work.

Have you thought about the: Open Relevance Project at the Apache Foundation?

Email archives from Apache projects, the backbone of the web as we know it, are ripe for your contributions.

Let me be the first to ask Recommind to join in building a public data set for everyone.

Comments Off

May 10, 2012

Trawling the web for socioeconomic data? Look no further than Knoema

Filed under: Data Source,News,Socioeconomic Data — Patrick Durusau @ 2:21 pm

Trawling the web for socioeconomic data? Look no further than Knoema

From the Guardian Datablog, John Burn-Murdoch writes:

A joint venture by Russian and Indian technology professionals aims to be the Youtube of data. Knoema which launched last month and is marketed by its creators as “your personal knowledge highway”, combines data-gathering with presentation to create an online bank of socioeconomic and environmental data-sets.

The website’s homepage shows a selection of the topics on which Knoema has collected data. Among the categories are broad fields such as commodities and energy, but also more specialised collections including sexual exploitation and biofuels.

[graphics omitted]

Within each subject-area you can find one or more ‘dashboards’ – simple yet comprehensive presentations of data for a given topic, with all source-material documented. Knoema also provides choropleth maps for many of the datasets where figures are given for geographical areas.

‘Commodity passports‘ are another format in which Knoema offers some of its data. These give a detailed breakdown of production, consumption, imports, exports and market prices for a diverse range of products and materials including apples, cotton and natural gas.

Resource listings following the site review, including the Guardian’s world government data gateway and other resources.

Comments (1)

CNN Transcript Collection (2000-2012)

Filed under: Data Source,News — Patrick Durusau @ 2:03 pm

CNN Transcript Collection (2000-2012)

From the webpage:

For over a decade, CNN (Cable News Network) has been providing transcripts of shows, events and newscasts from its broadcasts. The archive has been maintained and the text transcripts have been dependably available at transcripts.cnn.com. This is a just-in-case grab of the years of transcripts for later study and historical research.

Suggested transcript sources for other broadcast media?

Seen at Nathan Yau’s Flowing Data.

Comments Off

April 29, 2012

46 Research APIs: DataUnison, Mendeley, LexisNexis and Zotero

Filed under: Data Source,LexisNexis,Zotero — Patrick Durusau @ 3:37 pm

46 Research APIs: DataUnison, Mendeley, LexisNexis and Zotero by Wendell Santos.

From the post:

Our API directory now includes 46 research APIs. The newest is the Globus Online Transfer API. The most popular, in terms of mashups, is the Mendeley API. We list 3 Mendeley mashups. Below you’ll find some more stats from the directory, including the entire list of research APIs.

I did see an API that accepts Greek strings and returns Latin transliteration. Oh, doesn’t interest you.

There are a number of bibliography, search and related tools.

I am sure you will find something to enhance an academic application of topic maps.

Comments Off

March 10, 2012

Campaign Finance Data in Real Time

Filed under: Data Source,Politics — Patrick Durusau @ 8:21 pm

Campaign Finance Data in Real Time by DEREK WILLIS.

From the post:

Political campaigns can change every day. The Campaign Finance API now does a better job of keeping pace.

We worked with ProPublica, one of the heaviest users of the API, to make the API more real-time, and to surface more data, such as itemized contributions for every presidential candidate and “super PAC”.

When the API was launched, most of the data it served up was updated every week or, in some cases, on a daily basis. But we work for news organizations, and what is news right now can be old news tomorrow. Committees that raise and spend money influencing federal elections are filing reports every day, not just on the day that reports are due.

If you are mapping political data, the New York Times is a real treasure trove of information.

Read this post for more details on real time campaign finance data.

Comments Off

February 13, 2012

Particle Physics – Stanford

Filed under: Data Source,Particle Physics — Patrick Durusau @ 8:19 pm

Leonard Susskind lectures on particle physics. Like astronomy (both optical and radio), particle physics was a leading source of “big data” before there was “big data.”

Particle Physics: Basic Concepts

Particle Physics: Standard Model

Interesting in its own right, another field for testing data mining software.

Comments Off

February 7, 2012

Finding Data on the Internet

Filed under: Data,Data Source,R — Patrick Durusau @ 4:31 pm

Finding Data on the Internet

From the post:

What I would like is a nice list of all of credible sources on the Internet for finding data to use with R projects. I know that this is a crazy idea, not well formulated (what are data after all) and loaded with absurd computational and theoretical challenges. (Why can’t I just google “data R” and get what I want?) So, what can I do? As many people are also out there doing, I can begin to make lists (in many cases lists of lists) on a platform that is stable enough to survive and grow, and perhaps encourage others to help with the effort.

Here follows a list of data sources that may easily be imported into R. If an (R) appears after source this means that the data are already in R format or there exist R commands for directly importing the data from R. (See http://www.quantmod.com/examples/intro/ for some code.) Otherwise, i have limited the list to data sources for which there is a reasonably simple process for importing csv files. What follows is a list of data sources organized into categories that are not mutually exclusive but which reflect what’s out there.

Useful listing of data sources for R, but you could use them with any SQL, NoSQ, SQL-NoSQL hybrid, or topic map as well.

Title probably should be: “Data Found on the Internet.” Finding data is a more difficult proposition.

Curious: Is there a “data crawler” that attempts to crawl websites of governments and the usual suspects for new data sets?

Comments Off

January 24, 2012

Web Data Commons

Filed under: Common Crawl,Data Source — Patrick Durusau @ 3:42 pm

Web Data Commons: Extracting Structured Data from the Common Web Crawl

From the post:

Web Data Commons will extract all Microformat, Microdata and RDFa data that is contained in the Common Crawl corpus and will provide the extracted data for free download in the form of RDF-quads as well as CSV-tables for common entity types (e.g. product, organization, location, …).

We are finished with developing the software infrastructure for doing the extraction and will start an extraction run for the complete Common Crawl corpus once the new 2012 version of the corpus becomes available in February. For testing our extraction framework, we have extracted structured data out of 1% of the currently available Common Crawl corpus dating October 2010. The results of this extraction run are provided below. We will provide the data from the complete 2010 corpus together with the data from the 2012 corpus in order to enable comparisons on how data provision has evolved within the last two years.

An interesting mining of open data.

The ability to perform comparisons on data over time is particularly interesting.

Comments Off

USAspending.gov

Filed under: Data Source,Government Data — Patrick Durusau @ 3:40 pm

USAspending.gov

This website is required by the “Federal Funding Accountability and Transparency Act (Transparency Act).”

The FAQ describes its purpose as:

To provide the public with information about how their tax dollars are spent. Citizens have a right and a need to understand where tax dollars are spent. Collecting data about the various types of contracts, grants, loans, and other types of spending in our government will provide a broader picture of the Federal spending processes, and will help to meet the need of greater transparency. The ability to look at contracts, grants, loans, and other types of spending across many agencies, in greater detail, is a key ingredient to building public trust in government and credibility in the professionals who use these agreements.

An amazing amount of data which can be searched or browsed in a number of ways.

It is missing one ingredient that would change it from an amazing information resource to a game changing information resource, you.

The site can only report information known to the federal government and covered by the Transparency Act.

For example, it can’t report on family or personal relationships between various parties to contracts or even offer good (or bad) information on performance on contacts or methods used by contractors.

However, a topic map (links into this site are stable) could combine this information with other information quite easily.

I ran across this site in Analyzing US Government Contract Awards in R by Vik Paruchuri. A very good article that scratches the surface of mining this content.

Comments Off

January 11, 2012

Monthly Twitter activity for all members of the U.S. Congress

Filed under: Data Source,Government Data,Tweets — Patrick Durusau @ 8:04 pm

Monthly Twitter activity for all members of the U.S. Congress by Drew Conway.

From the post:

Many months ago I blogged about the research that John Myles White and I are conducting on using Twitter data to estimate an individual’s political ideology. As I mentioned then, we are using the Twitter activity of members of the U.S. Congress to build a training data set for our model. A large part of the effort for this project has gone into designing a system to systematically collect the Twitter data on the members of the U.S. Congress.

Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API.

Looks like an interesting data set to match up to the ages of addresses doesn’t it?

Comments Off

December 22, 2011

ProgrammableWeb – New APIs

Filed under: Data Source,Mashups — Patrick Durusau @ 6:40 pm

70 New APIs: Google Affiliate Network, Visual Search and Mobile App Sales Tracking by Wendell Santos.

In a post dated 18 December 2011, ProgrammableWeb reports:

This week we had 70 new APIs added to our API directory including an audio fingerprinting service, sentiment analysis and analytics service, affiliate marketing network, mobile app sales tracking service, visual search service and an eCommerce service. In addition we covered a “mobile engagement” platform adding revenue analytics to their service. Below are more details on each of these new APIs.

I have a question: ProgrammableWeb lists 4657 APIs (as of 22 December 2011, about 6:30 PM East Coast time) with six (6) filters, Keywords, Category, Company, Protocols/Styles, Data Format, Date, Managed By. How easy/hard is that to use? Care to guess where the break point will come in terms of ease of use?

For example, choosing “government” as a category, results in 154 APIs. A result that is a very uneven listing from Liepzig city data to Brazilian election candidate information to words used in the U.S. Congress. Minimal organization by country would be nice.

Comments Off

December 18, 2011

Introducing the Events API

Filed under: Data Source,New York,News — Patrick Durusau @ 8:48 pm

Introducing the Events API by Brian Balser.

From the post:

This past November, The New York Times launched the Arts & Entertainment Guide, an interactive guide to noteworthy cultural events in and around New York City. The application lets you browse through a hand-selected listing of events, customizing the view based on date range, category and location.

At our annual Hack Day we made the Event Listings API, used by the interactive guide, publicly available to the developer community on the NYTimes Developer Network. The API supports three types of search: spatial, faceted and full-text. Each can be used separately or in conjunction in order to find events by different sets of criteria.

If the twenty-two (million) metro area population doesn’t sound like a large enough market, consider that New York City is projected to have fifty (50) million visitors in 2012.

Topic maps that merge data from this feed and conference websites seems a likely early use of this data. But more creative uses are certainly possible.

What would you suggest?

Comments Off

November 24, 2011

FactLab

Filed under: Data,Data Source,Interface Research/Design — Patrick Durusau @ 3:55 pm

FactLab

From the webpage:

Factlab collects official stats from around the world, bringing together the World Bank, UN, the EU and the US Census Bureau. How does it work for you – and what can you do with the data?

From the guardian in the UK.

Very impressive and interactive site.

Don’t agree with their philosophical assumptions about “facts,” but none the less, a number of potential clients do. So long as they are paying the freight, facts they are.

Comments Off

October 27, 2011

Department of Numbers

Filed under: Data Source,Marketing — Patrick Durusau @ 4:45 pm

Department of Numbers

From the webpage:

The Department of Numbers contextualizes public data so that individuals can form independent opinions on everyday social and economic matters.

Possible source for both data and analysis that is of public interest. Thinking it will be easier to attract attention to topic maps that address current issues.

Comments Off

October 26, 2011

Royal Society Journal Archive

Filed under: Data Source — Patrick Durusau @ 3:22 pm

Royal Society Journal Archive – Free Permanent Access

From the announcement:

The Royal Society has today announced that its world-famous historical journal archive – which includes the first ever peer-reviewed scientific journal – has been made permanently free to access online.

So, if you search for information using modern terminology, are you going to pick up materials from 10, 50, 100, 300 years ago?

Where do you think the break point will be on terminology?

Here’s my suggestion:

We will pair off in two person teams. The A teams will research a subject and record their queries, along with an estimate for when the language changes.
The B teams will take the A team results and try to show that the estimate for when the language changed is incorrect. (too early, too late, never, etc.)
Prizes will be awarded for the best results as well as the most interesting queries, subjects and odd facts learned along the way.

Comments Off

October 25, 2011

Datablog

Filed under: Data Source,Visualization — Patrick Durusau @ 7:33 pm

Datablog

From the Guardian in the UK. If you don’t know the Guardian, you are missing a real treat.

The Datablog offers visualizations of facts that otherwise may be difficult to grasp or that become more compelling in graphic form.

Browse around and you will find a number of interesting resources, such as a listing of all the visualizations for the last 2 years and information on how they make data available.

Some news outlets in the U.S., such as the New York Times, have similar efforts but I don’t know of any that are quite this good. Suggestions anyone?

This and similar resources should give you ideas on how to visualize information to discover information and subjects for your topic maps as well as ways that you can present topic map data more effectively to your users.

Choose one visualization from the Guardian and explain what advantages it offers over a simple table layout of the same information. 2-3 pages (no citations)
What other information sets could be effectively displayed using a technique similar to #1? What would be different about it over table display? 2-3 pages (no citations)
What are the limitations of the visualization you have chosen for #2? 2-3 pages (no citations)

Comments Off

October 22, 2011

Processing every Wikipedia article

Filed under: Data Mining,Data Source — Patrick Durusau @ 3:17 pm

Processing every Wikipedia article by Gareth Lloyd.

From the post:

I thought it might be worth writing a quick follow up to the Wikipedia Visualization piece. Being able to parse and process all of Wikipedia’s articles in a reasonable amount of time opens up fantastic opportunities for data mining and analysis. What’s more, it’s easy once you know how.

An alternative method for accessing and parsing Wikipedia data. I probably need to do a separate post on the Visualization post.

Enjoy!

Comments Off

October 18, 2011

New food web dataset

Filed under: Data Source,Graphs,R,Visualization — Patrick Durusau @ 2:41 pm

New food web dataset

From the post:

So, there is a new food web dataset out that was put in Ecological Archives here, and I thought I would play with it. The food web is from Otago Harbour, an intertidal mudflat ecosystem in New Zealand. The web contains 180 nodes, with 1,924 links.

Fun stuff…

Interesting visuals but do you find that they help you understand the data?

Important question for visualizing topic maps because you can make the nodes and associations between them jump, jitter, blink (shades of the browser wars), or zoom in and out. OK, so if I am playing “Idiotfield 3” or whatever that might be interesting.

But the question for topic maps or any information systems is whether it helps me find or understand the underlying data?

What do you think here? Data is available. What would you do differently?

Comments Off

Ecological Society of America (esa) – Ecological Archives

Filed under: Data Source — Patrick Durusau @ 2:41 pm

Ecological Society of America (esa) – Ecological Archives

If you are interested in ecological data for use with topic maps, this looks like a good place to start.

The available digital files/supplements to published papers go back to 1982.

Published by the Ecological Society of America.

Comments Off

Older Posts »

Database name	#records	Description
gutenberg	22194	Project Gutenberg. High-quality clean-text ebooks, some audio-books.
oaister	9988376	OAIster. A Union catalog of digital resources, chiefly open archives of journals, etc.
oca-all	135673	All of the ebooks made available by the Internet Archive as part of the Open Content Alliance (OCA). Includes high-quality, searchable PDFs, online book-readers, audio books, and much more. Excludes the Gutenberg sub-collection, which is available as a separate database.
oca-americana	49056	The American Libraries collection of the Open Content Alliance.
oca-iacl	669	The Internet Archive Children’s Library. Books for children from around the world.
oca-opensource	2616	Collection of community-contributed books at the Internet Archive.
oca-toronto	37241	The Canadian Libraries collection of the Open Content Alliance.
oca-universallibrary	30888	The Universal Library, a digitzation project founded at Carnegie-Mellon University. Content hosted at the Internet Archive.
wikipedia	1951239	Titles and abstracts from Wikipedia, the open encyclopedia.
wikipedia-da	66174	The Danish Wikipedia. Many thanks to Fujitsu Denmark for their support for the indexing of the national Wikipedias.
wikipedia-sv	243248	The Swedish Wikipedia.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 9, 2013

December 26, 2012

November 19, 2012

October 24, 2012

August 9, 2012

July 5, 2012

June 12, 2012

May 31, 2012

May 25, 2012

May 23, 2012

May 22, 2012

May 14, 2012

May 10, 2012

April 29, 2012

March 10, 2012

February 13, 2012

February 7, 2012

January 24, 2012

January 11, 2012

December 22, 2011

December 18, 2011

November 24, 2011

October 27, 2011

October 26, 2011

October 25, 2011

October 22, 2011

October 18, 2011