Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 30, 2011

Case study with Data blogs, from 300 to 1000

Filed under: Dataset — Patrick Durusau @ 7:04 pm

Case study with Data blogs, from 300 to 1000

From the post:

We looked at how we could help increase the size of a list of Top300 Data blogs fast.

Initial idea from Marshall Kirkpatrick.

Project turn around 1 day. Here is the initial list of 300 bloggers:

Great study in how to take an initial data set and expand it quickly.

Data: Making a List of the Top 300 Blogs about Data, Who Did We Miss?

Filed under: Data,Dataset — Patrick Durusau @ 7:04 pm

Data: Making a List of the Top 300 Blogs about Data, Who Did We Miss? by Marshall Kirkpatrick.

From the post:

Dear friends and neighbors, as part of my ongoing practice of using robots and algorithms to make grandiose claims about topics I know too little about, I have enlisted a small army of said implements of journalistic danger to assemble the above collection of blogs about data. I used a variety of methods to build the first half of the list, then scraped all the suggestions from this Quora discussion to flesh out the second half. Want to see if your blog is on this list? Control-F and search for its name or URL and your browser will find it if it’s there.

Why data? Because we live in a time when the amount of data being produced is exploding and it presents incredible opportunities for software developers and data analysts. Opportunities to build new products and services, but also to discover patterns. Those patterns will represent further opportunities for innovation, or they’ll illuminate injustices, or they’ll simply delight us with a greater sense of self-awareness than we had before. (I was honored to have some of my thoughts on data as a platform cited in this recent Slate write-up on the topic, if you’re interested in a broader discussion.) Data is good, and these are the leading people I’ve found online who are blogging about it.

A bit dated now but instructive for the process of mining and then ranking the blogs. There are any number of subject areas that await similar treatment.

  1. What subject area would interest you enough to collect the top 100 or 300 blogs?
  2. Would collecting and ranking be enough to be useful? For what purposes? Where would that fail?
  3. How would you envision topic maps making a difference for such a collection of blogs?

October 13, 2011

Numbrary

Filed under: Data Source,Dataset — Patrick Durusau @ 7:00 pm

Numbrary

From the website:

Numbrary is a free online service dedicated to finding, using and sharing numbers on the web.

With 26,475 data tables from the US Department of Labor, I get to Producer Price Indexes (3428 items) and then to Commodities (WPU 101) and there is very nice access to the underlying data:

http://numbrary.com/sources/10d891fc1320-produce-price-index-commodi.

Except that I don’t know how that should (could?) be reconciled with other data? Or what “other” data that would be, save for “See Also” on the webpage, but I don’t know why I should see that data as well.

Beyond just my lack of experience with economic data, this may illustrate something about “transparency” in government.

Can a government be said to be “transparent” if it provides data that is no more “transparent” to voters than the lack of data?

What burden does it have to make data more than simply accessible, but also meaningful? (I am mindful of the credit disclosure laws that provided foot faults for those wishing to pursue members of the credit industry but that did not credit rate disclosures meaningful.)

Still, a useful source of data that I commend to your attention.

Peter Skomoroch – Delicious

Filed under: Data Source,Dataset — Patrick Durusau @ 6:59 pm

Peter Skomoroch – Delicious

As of today, 7845 links to data and data sources.

A prime candidate to illustrate that there is no shortage of data, but a serious shortage of meaningful navigation of data.

In Depth with Campaign Finance Data

Filed under: Data Source,Dataset — Patrick Durusau @ 6:57 pm

In Depth with Campaign Finance Data by Ethan Phelps-Goodman.

Introduction

Influence Explorer and TransparencyData are the Sunlight Foundation’s two main sources for data on money and influence in politics. Both sites are warehouses for a variety of datasets, including campaign finance, lobbying, earmarks, federal spending and various other corporate accountability datasets. The underlying data is the same for both sites, but the presentation is very different. Influence Explorer takes the most important or prominent entities in the data–such as state and federal politicians, well-known individuals, and large companies and organizations–and gives each its own page with easy to understand charts and graphs. TransparencyData, on the other hand, gives searchable access to the raw records that make up each Influence Explorer page. Influence Explorer can answer questions like, “who was the top donor to Obama’s presidential campaign?” TransparencyData lets you dig down into the details of every single donation to that campaign.

If you are interested in campaign finance data this is a very good starting point. At least you can get a sense for the difficulty in simply tracking the money. I think you will find that money can buy access, but that isn’t the same thing as influence. That more complicated.

Topic maps can help in several ways. First, there is the ability to consolidate information from a variety of sources so no one person has to try to assemble all the pieces. Second, the use of associations can help you discover patterns in relationships that may uncover some hidden (or relatively so) avenues of influence or access. Not to mention that being able to trade-up information with others, may help you build a better portfolio of data for when you go calling to exercise some influence.

October 12, 2011

Where to find data to use with R

Filed under: Data Source,Dataset,R — Patrick Durusau @ 4:37 pm

Where to find data to use with R

From the post:

Hardly a day goes by without someone or something reminding me that we are drowning in a sea of data (a bummer day ):, or that the new hero is the data scientist (a Yes! let’s go make some money kind of day!!). This morning I read “…Google grew from processing 100 terrabytes of data a day with MapReduce in 2004 to processing 20 petabytes a day with MapReduce in 2008. (Lin and Dyer, Data-Intensive Text Processing with MapReduce: Morgan&Claypool, 2010 p1) Assuming linear growth, that would mean did about 400 terabytes during the 15 minutes it took me to check my email. Even if Google is getting more than its fair share, data should be everywhere, more data that I could ever need, more than I could process, more than I could ever imagine.

So, how come every time I go to write a blog post or try some new stats I can never find any data? A few hours ago I Googled “free data sets” and got over 74,000,000 hits, but it looks as if it’s going to be another evening of me with iris. What’s wrong here? At the root, it’s a deep problem that gets at the essence of data. What are data anyway? My answer: data are structured information. Part of the structure includes meta-information describing the intention and the integrity with which the data were collected. When looking for a data set, even for some purpose that is not that important we all want some evidence that the data were either collected with intentions that are similar to our intentions to use the data or that the data can be re-purposed. Moreover, we need to establish some comfort level that the data were not collected to deceive, that they are reasonable representative, reasonably randomized, reasonable unbiased etc. The more we importance we place on our project the more we tighten up on these requirements. This is not all philosophy. I think that focusing on intentions and integrity provides some practical guidance of where to search for data on the internet.

If you are using R and need data, here is a first stop. Note the author is maintaining a list of such data sources.

October 9, 2011

Open Relevance Project

Filed under: Dataset,Relevance — Patrick Durusau @ 6:40 pm

Open Relevance Project

From the website:

What Is the Open Relevance Project?

The Open Relevance Project (ORP) is a new Apache Lucene sub-project aimed at making materials for doing relevance testing for Information Retrieval (IR), Machine Learning and Natural Language Processing (NLP) into open source.

Our initial focus is on creating collections, judgments, queries and tools for the Lucene ecosystem of projects (Lucene Java, Solr, Nutch, Mahout, etc.) that can be used to judge relevance in a free, repeatable manner.

One dataset that needs attention from this project is: Apache Software Foundation Public Mail Archives, which is accessible on the Amazon cloud.

Project work products would benefit Apache software users, vendors with Apache software bases, historians, sociologists and others interested in the dynamics, technical and otherwise, of software development.

I am willing to try to learn cloud computing and the skills necessary to turn this dataset into a test collection. Are you?

Apache Software Foundation Public Mail Archives

Filed under: Cloud Computing,Dataset — Patrick Durusau @ 6:40 pm

Apache Software Foundation Public Mail Archives

From the webpage:

Submitted By: Grant Ingersoll
US Snapshot ID (Linux/Unix): snap-­17f7f476
Size: 200 GB
License: Public Domain (See http://apache.org/foundation/public-­archives.html)
Source: The Apache Software Foundation (http://www.apache.org)
Created On: August 15, 2011 10:00 PM GMT
Last Updated: August 15, 2011 10:00 PM GMT

A collection of all publicly available mail archives from the Apache55 Software Foundation (ASF), taken on July 11, 2011.

This collection contains all publicly available email archives from the ASF’s 80+ projects (http://mail-archives.apache.org/mod_mbox/), including mailing lists such as Apache HTTPD Server, Apache Tomcat, Apache Lucene and Solr, Apache Hadoop and many more.

Generally speaking, most projects have at least three lists: user, dev and commits, but some have more, some have less. The user lists are where users of the software ask questions on usage, while the dev list usually contains discussions on the development of the project (code, releases, etc.)

The commit lists usually consists of automated notifications sent by the various ASF version control tools, like Subversion or CVS, and contain information about changes made to the project’s source code.

Both tarballs and per project sets are available in the snapshot. The tarballs are organized according to project name. Thus, a-d.tar.gz contains all ASF projects that begin with the letters a, b, c or d, such as abdera.apache.org. Files within the project are usually gzipped mbox files. (I split the first paragraph into several paragraphs for readability reasons.)

Rather meager documentation for a 200 GB data set don’t you think? I think a first step would be to create basic documentation on what projects are present, their mailing lists, some basic statistical counts to serve as reference points.

You have been waiting for a motivation to “get into” cloud computing, well, now you have the motivation and an interesting dataset!

Ancestry.com Forum Dataset

Filed under: Dataset — Patrick Durusau @ 6:40 pm

Ancestry.com Forum Dataset

From the post:

The Ancestry.com Forum Dataset was created with the cooperation of Ancestry.com in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the Ancestry.com online forum, boards.ancestry.com, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active participation for over ten years.

In addition to the document collection, queries from Ancestry.com’s query log and pairwise preference relevance judgements for a message thread retrieval task using this online forum are distributed.

This webpage describes the dataset, gives instructions for obtaining the dataset, and describes the supplemental data to use for thread search information retrieval experiments. Further details of the dataset can be found in the tech report describing the collection.

Contact: Jonathan Elsas.


Document Collection

The Ancestry.com Online Forum document collection is a full snapshot of the online forum, boards.ancestry.com from July 2010.

Number of Messages 22,054,728
Number of Threads 9,040,958
Number of Sub-forums 165,358
Number of Unique Authors 3,775,670
Message Date Range December 1995 – July 2010
Size 5 GB (compressed)

The documents distributed in the collection are in the TRECTEXT SGML format, similar to other collections used at the Text REtrieval Conference.

As you will read, creation of a dataset, for use as a test set, is a non-trivial project.

Curious, what questions would you ask of such a dataset? Or perhaps better, what tools would you use to ask those questions and why?

Grant Ingersoll mentioned this collection in email on the openrelevance-dev@apache.org mailing list.

October 6, 2011

Sandbox from YAHOO! Research

Filed under: Dataset — Patrick Durusau @ 5:34 pm

Sandbox from YAHOO! Research

Saw this on Machine Learning (Theory) as reported by Lihong Li.

The data sets sound really great, but then I read:

Eligibility:

Yahoo! is pleased to make these datasets available to researchers who are advancing the state of knowledge and understanding in web sciences. The datasets are only available for academic use by faculty and university researchers who agree to the Data Sharing Agreement.

To be eligible to receive Webscope data you must:

  • Be a faculty member, research employee or student from an accredited university
  • Send the data request from an accredited university .edu or domain name (for international universities) email address
  • Ensure that your request has been acknowledged by your Department Chair

We are not able to share data with:

  • Commercial entities
  • Employees of commercial entities with university appointment
  • Research institutions not affiliated with a research university

Note: You must have a Yahoo! account to apply for Webscope datasets.

I think I can pass everything except “employees of commercial entities with university appointment” since I am an adjunct faculty member and to work outside the university as my primary means of support.

This reads like someone who doesn’t want to share data trying to thing of foot-faults to build into the sharing process. Such as “acknowledged by your Department Chair.” Acknowledged to who? By what means? Is once enough?

I can understand reasonable restrictions, say non-commercial use, attribution on publication, contribution of improvements back to the community, etc., but the user community deserves better rules than these.

September 24, 2011

Introducing Fech

Filed under: Dataset,Marketing — Patrick Durusau @ 6:58 pm

Introducing Fech by Michael Strickland.

From the post:

Ten years ago, the Federal Election Commission introduced electronic filing for political committees that raise and spend money to influence elections to the House and the White House. The filings contain aggregate information about a committee’s work (what it has spent, what it owes) and more detailed listings of its interactions with the public (who has donated to it, who it has paid for services).

Journalists who work with these filings need to extract their data from complex text files that can reach hundreds of megabytes. Turning a new set into usable data involves using the F.E.C.’s data dictionaries to match all the fields to their positions in the data. But the available fields have changed over time, and subsequent versions don’t always match up. For example, finding a committee’s total operating expenses in version 7 means knowing to look in column 52 of the “F3P” line. It used to be found at column 50 in version 6, and at column 44 in version 5. To make this process faster, my co-intern Evan Carmi and I created a library to do that matching automatically.

Fech (think “F.E.C.h,” say “fetch”), is a Ruby gem that abstracts away any need to map data points to their meanings by hand. When you give Fech a filing, it checks to see which version of the F.E.C.’s software generated it. Then, when you ask for a field like “total operating expenses,” Fech knows how to retrieve the proper value, no matter where in the filing that particular software version stores it.

At present Fech only parses presidential filings but can be extended to other filings.

OK, so now it is easier to get campaign finance information. Now what?

So members of congress live in the pockets of their largest supporters. Is that news?

How would you use topic map to make that news? Serious question.

Or how to use topic maps to make that extraction a value-add when used with other New York Times content?


Update: Fech 1.1 Released.

September 23, 2011

Free and Public Data Sets

Filed under: Dataset — Patrick Durusau @ 6:23 pm

Free and Public Data Sets

Some of these will be familiar, some not.

I am aware of a number of government sites that offer a variety of data sets. What I don’t know of is a list of data sets by characteristics. That would include subject matter, format, age, etc.

Suggestions?

September 19, 2011

TCP Text Creation Partnership

Filed under: Concept Drift,Dataset,Language — Patrick Durusau @ 7:51 pm

TCP Text Creation Partnership

From the “mission” page:

The Text Creation Partnership’s primary objective is to produce standardized, digitally-encoded editions of early print books. This process involves a labor-intensive combination of manual keyboard entry (from digital images of the books’ original pages), the addition of digital markup (conforming to guidelines set by a text encoding standard-setting body know as the TEI), and editorial review.

The chief sources of the TCP’s digital images are database products marketed by commercial publishers. These include Proquest’s Early English Books Online (EEBO), Gale’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints. Idiosyncrasies in early modern typography make these collections very difficult to convert into searchable, machine-readable text using common scanning techniques (i.e., Optical Character Recognition). Through the TCP, commercial publishers and over 150 different libraries have come together to fund the conversion of these cultural heritage materials into enduring, digitally dynamic editions.

To date, the EEBO-TCP project has converted over 25,000 books. ECCO- and EVANS-TCP have converted another 7,000+ books. A second phase of EEBO-TCP production aims to convert the remaining 44,000 unique monograph titles in the EEBO corpus by 2015, and all of the TCP texts are scheduled to enter the public domain by 2020.

Several thousand titles from the 18th century collection are already available to the general public.

I mention this as a source of texts for testing search software against semantic drift. The sort of drift that occurs in any living language. To say nothing of the changing mores of our interpretation of languages with no native speakers remaining to defend them.

Introducing CorporateGroupings

Filed under: Data,Dataset,Fuzzy Matching — Patrick Durusau @ 7:51 pm

Introducing CorporateGroupings: where fuzzy concepts meet legal entities

From the webpage:

One of the key issues when you’re looking at any big company is what are the constituent parts – because these days a company of any size is pretty much never a single legal entity, but a web of companies, often spanning multiple jurisdictions.

Sometimes this is done because the company’s operations are in different territories, sometimes because the company is a conglomerate of different companies – an educational book publisher and a financial newspaper, for example. Sometimes it’s done to limit the company’s tax liability, or for other legal reasons (e.g. to benefit from a jurisdiction’s rules & regulations compared with the ‘parent’ company’s jurisdiction).

Whatever the reason, getting a handle on the constituent parts is pretty tricky, whether you’re a journalist, a campaigner, a government tax official or a competitor, and making it public is trickier still, meaning the same research is duplicated again and again. And while we may all want to ultimately surface in detail the complex cross-holdings of shareholdings between the different companies, that goal is some way off, not least because it’s not always possible to discover the shareholders of a company.

….

So you must make do with reading annual reports and trawling company registries around the world, and hoping you don’t miss any. We like to think OpenCorporates has already made this quite a bit easier, meaning that a single search for Tesco returns hundreds of results from around the world, not just those in the UK, or some other individual jurisdiction. But what about where the companies don’t include the group in the name, and how do you surface the information you’ve found for the rest of the world?

The solution to both, we think, is Corporate Groupings, a way of describing a grouping of companies without having to say exactly what legal form that relationship takes (it may be a subsidiary of a subsidiary, for example). In short, it’s what most humans (i.e. non tax-lawyers) think of when they think of a large company – whether it’s a HSBC, Halliburton or HP.

This could have legs.

Not to mention what is a separate subject to you (subsidiary) may be encompassed by a larger subject to me. Both are valid from a certain point of view.

September 14, 2011

FigShare

Filed under: Data,Dataset — Patrick Durusau @ 6:59 pm

FigShare

From the website:

Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data, negative results and unpublished figures. In doing this, other researchers will not duplicate the work, but instead may publish with your previously wasted figures, or offer collaboration opportunities and feedback on preprint figures.

There wasn’t a category on the site for CS data sets or rather the results of processing/searching data sets.

Would that be the same thing?

Thinking it would be interesting to have examples of data analysis that failed along with the data sets in question. Or at least pointers to the data sets.

September 10, 2011

GTD – Global Terrorism Database

Filed under: Authoring Topic Maps,Data,Data Integration,Data Mining,Dataset — Patrick Durusau @ 6:08 pm

GTD – Global Terrorism Database

From the homepage:

The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2010 (with annual updates planned for the future). Unlike many other event databases, the GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 98,000 cases.

While chasing down a paper that didn’t make the cut I ran across this data source.

Lacking an agreed upon definition of terrorism (see Chomsky for example), you may or may not find what you consider to be incidents of terrorism in this dataset.

Never the less, it is a dataset of events of popular interest and can be used to attract funding for your data integration project using topic maps.

August 29, 2011

RuSSIR/EDBT 2011 Summer School

Filed under: Dataset,Machine Learning — Patrick Durusau @ 6:25 pm

RuSSIR/EDBT 2011 Summer School

Machine learning task with task and training set data.

RuSSIR machine learning contest winners presentations

Contest tasks are described on http://bit.ly/russir2011. Results are presented in the previous post: http://bit.ly/pr6bSz

Yura Perov: http://dl.dropbox.com/u/1572852/RussirResults/yura_perov_ideas_for_practical_task.pptx

Dmitry Kan and Ivan Golubev: http://dl.dropbox.com/u/1572852/RussirResults/Russir_regression-task-ivan_dima.pptx

Nikita Zhiltsov: http://dl.dropbox.com/u/1572852/RussirResults/nzhiltsov_task2.pdf

Census.IRE.org

Filed under: Dataset — Patrick Durusau @ 6:23 pm

Census.IRE.org

From the website:

Investigative Reporters and Editors is pleased to announce the next phase in our ongoing Census project, designed to provide journalists with a simpler way to access 2010 Census data so they can spend less time importing and managing the data and more time exploring and reporting the data. The project is the result of work by journalists from the The Chicago Tribune, The New York Times, USA Today, CNN, the Spokesman-Review (Spokane, Wash.) and the University of Nebraska-Lincoln, funded through generous support from the Donald W. Reynolds Journalism Institute at the Missouri School of Journalism.

You can download bulk data as well as census data in JSON format.

You can browse data by:

Census tracts
Can vary in size but averages 4,000 people. Designed to remain relatively stable across decades to allow statistical comparisons. Boundaries defined by local officials using Census Bureau rules.
Places
1. What most people call cities or towns. A locality incorporated under state law that acts as a local government.
2. An unincorporated area that is well-known locally. Defined by state officals under Census Bureau rules and called a “census designated place.” “CDP” is added to the end of name.
Counties (parishes in LA)
The primary subdivisions of states. To cover the full country, this includes Virginia’s cities and Baltimore, St. Louis and Carson City, Nev., which sit outside counties; the District of Columbia; and the boroughs, census areas and related areas in Alaska.
County Subdivisions
There are 2 basic kinds: 1. In 29 states, they have at least some governmental powers and are called minor civil divisions (MCDs). Their names may include variations on “township,” “borough,” “district,” “precinct,” etc. In 12 of those 29 states, they operate as full-purpose local governments: CT, MA, ME, MI, MN, NH, NJ, NY, PA, RI, VT, WI. 2. In states where there are no MCDs, county subdivisions are primarily statistical entities known as census county divisions. Their names end in “CCD.”

[State and USA.]

Great source of census information for use with other data, even proprietary data in your topic map.

August 22, 2011

Public Dataset Catalogs Faceted Browser

Filed under: Dataset,Facets,Linked Data,RDF — Patrick Durusau @ 7:42 pm

Public Dataset Catalogs Faceted Browser

A faceted browser for the catalogs, not their content.

Filter on coverage, location, country (not sure how location and country usefully differ), catalog status (seems to mix status and data type), and managed by.

Do be aware that as the little green balloons disappear with your selection that more of the coloring of the map itself appears.

I mention that because at first it seemed the map was being colored based on the facets I choose. Such as Europe is suddenly dark green when I chose the United States in the filter. Confusing at first and makes me wonder, why use a map with underlying coloration anyway? A white map with borders would be a better display background for the green balloons indicating catalog locations.

BTW, if you visit a catalog and then use the back button, all your filters are reset. Not a problem now with a small set of filters and only 100 catalogs but should this resource continue to grow, that could become a usability issue.

August 20, 2011

WordNet Data > 10.3 Billion Unique Values

Filed under: Dataset,Linguistics,WordNet — Patrick Durusau @ 8:08 pm

WordNet Data > 10.3 Billion Unique Values

Wanted to draw your attention to some WordNet data files.

From the readme.TXT file in the directory:

As of August 19, 2011 pairwise measures for all nouns using the path measure are available. This file is named WordNet-noun-noun-path-pairs.tar. It is approximately 120 GB compressed. In this file you will find 146,312 files, one for each noun sense. Each file consists of 146,313 lines, where each line (except the first) contains a WordNet noun sense and the similarity to the sense featured in that particular file. Doing the math here, you find that each .tar file contains
about 21,000,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion unique values.

We are currently running wup, res, and lesk, but do not have an estimated date of availability yet.

BTW, on verb data:

These files were created with WordNet::Similarity version 2.05 using WordNet 3.0. They show all the pairwise verb-verb similarities found in WordNet according to the path, wup, lch, lin, res, and jcn measures. The path, wup, and lch are path-based, while res, lin, and jcn are based on information content.

As of March 15, 2011 pairwise measures for all verbs using the six measures above are availble, each in their own .tar file. Each *.tar file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx 2.0 – 2.4 GB compressed. In each of these .tar files you will find 25,047 files, one for each verb sense. Each file consists of 25,048 lines, where each line (except the first) contains a WordNet verb sense and the similarity to the sense featured in that particular file. Doing
the math here, you find that each .tar file contains about 625,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have a bit more than 300 million unique values.

August 9, 2011

Solr Powered ISFDB

Filed under: Dataset,Lucene,Solr — Patrick Durusau @ 7:58 pm

Solr Powered ISFDB

The first in a series of posts on Solr and the ISFDB. (Try Solr-ISFDB for all the posts.)

ISFDB = Internet Speculative Fiction Database.

A bit over 650,000 documents when this series started last January so we aren’t talking “big data” but its a fun data set. And the lessons to be learned here will stay us in good stead with much larger data sets.

I haven’t read all the posts yet but did notice comments about modeling relationships. As I work through the posts, will see how close/far away that modeling comes to a topic maps approach.

Working through something like this won’t hurt in terms of preparing for Lucene/Solr certification either. Haven’t decided on that but until we have a topic map certification it would not hurt.

July 29, 2011

1.USA.gov hackathon this Friday (July 29, 2011)

Filed under: Dataset — Patrick Durusau @ 7:45 pm

1.USA.gov hackathon this Friday (July 29, 2011)

Drew Conway reports:

On Friday, July 29, 2011 USA.gov with host its first ever open data/hack day event. As I am a New Yorker, I am very excited to be participating at the NYC satellite event, but I wanted to pass along this information to those of you who may not have seen it yet, or wish to participate at one of the other locations. Here is the pertinent information from the official announcement:

Apologies for the late notice, but I assume the data is still going to be available:

In March, we announced a new URL shortening service called 1.USA.gov. 1.USA.gov automatically creates .gov URLs whenever you use bitly to shorten a URL that ends in .gov or .mil. We created this service to make it easy for people to know when a short URL will lead to official, and trustworthy, government information.

Data is created every time someone clicks on a 1.USA.gov link, which happens about 56,000 times each day. Together, these clicks show what government information people are sharing with their friends and networks. No one has ever had such a broad view of how government information is viewed and shared online.

Today, we’re excited to announce that all of the data created by 1.USA.gov clicks is freely available through the Developers page on USA.gov. We want as many people as possible to benefit from the insights we get from 1.USA.gov.

Doesn’t 56,000 times a day sound a little low? I don’t doubt the numbers but I am curious about the lack of uptake.

Does anyone have numbers on the other URL shortening services for comparison?

July 3, 2011

Who’s Your Daddy?

Filed under: Data Source,Dataset,Marketing,Mashups,Social Graphs,Social Networks — Patrick Durusau @ 7:30 pm

Who’s Your Daddy? (Genealogy and Corruption, American Style)

NPR (National Public Radio) News broadcast the opinion this morning that Brits are marginally less corrupt than Americans. Interesting question. Was Bonnie less corrupt than Clyde? Debate at your leisure but the story did prompt me to think of an excellent resource for tracking both U.S. and British style corruption.

Probably all the talk of lineage in the news lately but why not use the genealogy records that are gathered so obsessively to track the soft corruption of influence?

Just another data set to overlay on elected, appointed, and hired positions, lobbyists, disclosure statements, contributions, known sightings, congressional legislation and administrative regulations, etc. Could lead to a “Who’s Your Daddy?” segment on NPR where employment or contracts are questioned naming names. That would be interesting.

It also seems more likely to be effective than the “disclose your corruption” sunlight approach. Corruption is never confessed, it has to be rooted out.

June 25, 2011

HackReduce Data

Filed under: Conferences,Dataset,Hadoop,MapReduce — Patrick Durusau @ 8:49 pm

HackReduce Data

Data sets and instructions on data sets for Hack/Reduce Big Data. Hackathon.

Includes:

Always nice to have data of interest to a user community when demonstrating topic maps.

June 19, 2011

Open Government Data 2011 wrap-up

Filed under: Conferences,Dataset,Government Data,Public Data — Patrick Durusau @ 7:35 pm

Open Government Data 2011 wrap-up by Lutz Maicher.

From the post:

On June 16, 2011 the OGD 2011 – the first Open Data Conference in Austria – took place. Thanks to a lot of preliminary work of the Semantic Web Company the topic open (government) data is very hot in Austria, especially in Vienna and Linz. Hence 120 attendees (see the list here) for the first conference is a real success. Congrats to the organizers. And congrats to the community which made the conference to a very vital and interesting event.

If there is a Second Open Data Conference, it is a venue where topic maps should put in an appearance.

PublicData.EU Launched During DAA

Filed under: Dataset,Government Data,Public Data — Patrick Durusau @ 7:33 pm

PublicData.EU Launched During DAA

From the post:

During the Digital Agenda Assembly this week in Brussels the new portal PublicData.EU was launched in beta. This is a step aimed to make public data easier to find across the EU. As it says on the ‘about’ page:

“In order to unlock the potential of digital public sector information, developers and other prospective users must be able to find datasets they are interested in reusing. PublicData.eu will provide a single point of access to open, freely reusable datasets from numerous national, regional and local public bodies throughout Europe.

Information about European public datasets is currently scattered across many different data catalogues, portals and websites in many different languages, implemented using many different technologies. The kinds of information stored about public datasets may vary from country to country, and from registry to registry. PublicData.eu will harvest and federate this information to enable users to search, query, process, cache and perform other automated tasks on the data from a single place. This helps to solve the “discoverability problem” of finding interesting data across many different government websites, at many different levels of government, and across the many governments in Europe.

In addition to providing access to official information about datasets from public bodies, PublicData.eu will capture (proposed) edits, annotations, comments and uploads from the broader community of public data users. In this way, PublicData.eu will harness the social aspect of working with data to create opportunities for mass collaboration. For example, a web developer might download a dataset, convert it into a new format, upload it and add a link to the new version of the dataset for others to use. From fixing broken URLs or typos in descriptions to substantive comments or supplementary documentation about using the datasets, PublicData.eu will provide up to date information for data users, by data users.”

PublicData.EU is built by the Open Knowledge Foundation as part of the LOD2 project. “PublicData.eu is powered by CKAN, a data catalogue system used by various institutions and communities to manage open data. CKAN and all its components are open source software and used by a wide community of catalogue operators from across Europe, including the UK Government’s data.gov.uk portal.”

Here’s a European marketing opportunity for topic maps. How would a topic map solution be different from what is offered here? (There are similar opportunities in the US as well.)

June 6, 2011

Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)

Filed under: Challenges,Conferences,Dataset,Semantic Web — Patrick Durusau @ 1:57 pm

Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011)

Full Day Workshop in conjunction with the 10th International Semantic Web Conference 2011 23/24 October 2011, Bonn, Germany

Important Dates

Deadline for paper submission: 8 August 2011 23:59 (11:59pm) Hawaii time
Notification of Acceptance: 29 August 2011 23:59 (11:59pm) Hawaii time
Camera-ready version: 8 September 2011
Workshop: 23 or 24 October 2011

Abstract:

The goal of DeRiVE 2011 is to strengthen the participation of the semantic web community in the recent surge of research on the use of events as a key concept for representing knowledge and organizing and structuring media on the web. The workshop invites contributions to three central questions, and the goal is to formulate answers to these questions that advance and reflect the current state of understanding of events in the semantic web. Each submission will be expected to address at least one question explicitly, and, if possible, include a system demonstration. We have released an event challenge dataset for use in the preparation of contributions, with the goal of supporting a shared understanding of their impact. A prize will be awarded for the best use(s) of the dataset; but the use of other datasets will also be allowed.

See the CFP for questions papers must address.

Also note the anticipated release of a dataset:

We will release a dataset of event data. In addition to regular papers, we invite everybody to submit a Data Challenge paper describing work on this dataset. We welcome analyses, extensions, alignments or modifications of the dataset, as well as applications and demos. The best Data Challenge paper will get a prize.

The dataset consists of over 100.000 events from three sources: the music website Last.fm, and the entertainment websites upcoming.yahoo.com and eventful.com. All three are represented in the LODE schema. Next to events, they contain artists, venues and location and time information. Some links between the instances of the three datasets are provided.

Suggestions for modeling events in topic maps?

June 5, 2011

Kasabi

Filed under: Dataset,Graphs,Marketing,Topic Maps — Patrick Durusau @ 3:21 pm

Kasabi

A dataset collection, curation and interface website that is currently in a public beta.

Summarized in part as:

Search, Browse, Explore

You can browse through the catalog to find datasets based on their category, or search via keywords. From each dataset’s homepage you can quickly find useful information about its provenance, licensing and a snapshot of useful metrics such as when the dataset was last updated.

Using the Explore tools will get you deeper into the dataset: drilling down into detailed documentation and sample data.

Datasets and APIs

Every dataset in Kasabi has a range of core APIs listed right on the dataset homepage or discoverable through the search and browse tools. Choose the API that best supports what you need to do, whether its a search over the data or more complex queries. Subscribe to an API to immediately gain access using your API key. Your dashboard lists all your subscribed APIs, and each has a useful reference card of parameters and response formats available from its homepage. Need more detailed docs? We have those too.

Contribute APIs

Can’t find an API that matches your application? In Kasabi, you can contribute your own using our API building tools. These tools let developers create customised RESTful APIs that capture ways of querying or navigating across a dataset, producing results in a variety of built-in and custom formats. All contributed APIs are listed in the catalog, along with automatically generated documentation, allowing them to be shared with the Kasabi community.

The Contribute APIs looks quite interesting, particularly since all the datasets are stored as separate graph databases.

A bit more from the FAQ on custom APIs:

A custom API allows you tailor access to the dataset. This custom access will then be suited to your particular application or user community. By creating and maintaining a custom API over the data, you won’t be constrained by the default APIs provided by Kasabi or the data owner.

By allowing the developer community to share its skills in ways other than just creating applications, Kasabi lets us broaden the definition of data curation to cover APIs and access as well as the data itself.

Only fifty-nine (59) datasets as of June 4, 2011, with a definite UK flavor but I expect that will grow fairly quickly. The usual suspects, the CIA World Factbook, BBC, New York Times, DBpedia, are all present. More than enough information to make topic map interfaces interesting. The principal advantage of topic map interfaces is the ability to specify a basis for a mapping, thereby enabling other researchers to follow or not, as they choose.

May 27, 2011

Zanran

Filed under: Data Source,Dataset,Search Engines — Patrick Durusau @ 12:36 pm

Zanran

A search engine for data and statistics.

I was puzzled by results containing mostly PDF files until I read:

Zanran doesn’t work by spotting wording in the text and looking for images – it’s the other way round. The system examines millions of images and decides for each one whether it’s a graph, chart or table – whether it has numerical content.

Admittedly you may have difficulty re-using such data but finding it is a big first step. You can then contact the source for the data in a more re-usable form.

From Hints & Helps:

Language. English only please… for now.
Phrase search. You can use double quotes to make phrases (e.g. “mobile phones”).
Vocabulary. We have only limited synonyms – please try different words in your query. And we don’t spell-check … yet.

From the website:

Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. This huge amount of information can be difficult to find using conventional search engines, which are focused primarily on finding text rather than graphs, tables and bar charts.

Put more simply: Zanran is Google for data.

Well said.

« Newer PostsOlder Posts »

Powered by WordPress