Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 25, 2011

Open government sites scrapped due to budget cuts

Filed under: Data Source,Dataset — Patrick Durusau @ 1:24 pm

Open government sites scrapped due to budget cuts

This isn’t so much surprising as it is disappointing. We now know the priority that “open” government in U.S. government budgetary discussions.

I could go on at length about this decision, the people who made it, complete with speculation on their motives, morals and parentage. Unfortunately, that would not restore the funding nor would it be a useful exercise.

As an alternative, let me suggest that everyone select one or two of the data sets that are already available and do something interesting. Something that will catch the imagination of the average citizen. Then credit these government sites as the sources and gently point out that with more funding, there would be more data. And hence more interesting things to see.

Asking someone at the agencies that produce data could result in interesting suggestions. They may lack the time, resources, personnel to do something really creative but with their ideas and your talents…, well, the result could interest the agency and the public. These agencies are the ones fighting on the inside of the public budget process for funding.

What data sets and ideas for those data sets do you think would have the most appeal or impact?

May 23, 2011

Health: Public-Use Data Files and Documentation

Filed under: Data Source,Dataset — Patrick Durusau @ 7:45 pm

Health: Public-Use Data Files and Documentation

While looking for other data files, I ran across this resource.

Public health is always a popular topic (sorry!).

May 21, 2011

FamilySearch.org

Filed under: Dataset,Marketing — Patrick Durusau @ 5:17 pm

FamilySearch.org

After locating the census record abstracts for record linkage, it occurred to me to look for census records for other countries.

Which fairly quickly put me out at family history sites.

FamilySearch.org looks like one of the better ones.

Pointers to very diverse sets of records which should provide grist for any matching algorithms as well as modeling issues for other information.

I am not familiar with the software in this area but my impression is that a lot of effort has gone into even the free stuff so poor UIs or performing apps need not apply. Topic maps are going to have to offer a real value add to get traction in this area.

If you investigate or are in the family history area, post a note if current software allows merging of family histories together?

opencorporates

Filed under: Authoring Topic Maps,Dataset — Patrick Durusau @ 5:14 pm

opencorporates – The Open Corporate Database of the World

A “alpha” status project that is collecting corporate registration/report information from around the world.

As of 21 May 2011, 12,678,041 companies.

Five US states plus District of Columbia, United Kingdom, Netherlands and a scattering of others.

This is a useful data source, provided the corporations of interest fall in a covered jurisdiction.

The following video illustrates the usefulness of this site:

How to use OpenCorporates to match companies in Google Refine

Certainly looks like a useful tool for populating a topic map to me!

That may be the ultimate value of all the Linked Data efforts. Being the step before reconciliation of information into a reliable form for merger with other reconciled information. At some point raw information has to be gathered together and a rough cut gathering with Linked Data is as good as any other method.

May 20, 2011

Integrated Public Use Microdata Series
(IPUMS-USA)

Filed under: Dataset,Record Linkage — Patrick Durusau @ 4:07 pm

Integrated Public Use Microdata Series (IPUMS-USA)

Lars Marius asked about some test data files for his Duke 0.1 release.

A lot of record linkage work is on medical records so there are disclosure agreements/privacy concerns, etc.

Just poking around for sample data sets and ran across this site.

From the website:

IPUMS-USA is a project dedicated to collecting and distributing United States census data. Its goals are to:

  • Collect and preserve data and documentation
  • Harmonize data
  • Disseminate the data absolutely free!

Goes back to the 1850 US Census and comes forward.

More data sets than I can easily describe and more are being produced.

Occurs to me that this could be good data for testing topic map techniques.

Enjoy!

Seevl

Filed under: Dataset,Interface Research/Design,Linked Data,Music Retrieval,Semantic Web — Patrick Durusau @ 4:04 pm

Seevl: Reinventing Music Discovery

If you are interested in music or interfaces, this is a must stop location!

Simple search box.

I tried searching for artists, albums, types of music.

In addition to search results you also get suggestions of related information.

The Why is this related? link for related information was particularly interesting. It offers a “why” additional information was offered for a particular search result.

Developers can access their data for non-commercial uses for free.

The simplicity of the interface was a real plus.

May 19, 2011

Search Your Gmail Messages with ElasticSearch and Ruby

Filed under: Dataset,ElasticSearch,Search Data,Search Engines,Search Interface — Patrick Durusau @ 3:26 pm

Search Your Gmail Messages with ElasticSearch and Ruby

From the website:

If you’d like to check out ElasticSearch, there’s already lots of options where to get the data to feed it with. You can use a Twitter or Wikipedia river to fill it with gigabytes of public data, or you can feed it very quickly with some RSS feeds.

But, let’s get a bit personal, shall we? Let’s feed it with your own e-mail, imported from your own Gmail account.

A useful way to teach basic searching.

After all, a search of Wikipedia or Twitter may return impressive results, but are they correct results?

Hard for a user to say because both Wikipedia and Twitter are large enough that verification (other than by other programs) of search results isn’t possible.

Assuming your Gmail inbox is smaller than Wikipedia you should be able to recognize what results are “correct” and which ones look “off.”

And you may learn some Ruby in the bargain.

Not a bad day’s work. 😉


PS: You may want to try the links on mining Twitter, Wikipedia and RSS feeds with ElasticSearch.

May 18, 2011

Datalift

Filed under: Dataset,Linked Data,Semantic Web — Patrick Durusau @ 6:42 pm

Datalift (also available in French)

From the webpage:

Datalift brings raw structured data coming from various formats (relational databases, CSV, XML, …) to semantic data interlinked on the Web of Data.

Datalift is an experimental research project funded by the French national research agency. Its goal is to develop a platform to publish and interlink datasets on the Web of data. Datalift will both publish datasets coming from a network of partners and data providers and propose a set of tools for easing the datasets publication process.

A few steps to data heaven

The project will provide tools allowing to facilitate each step of the publication process:

  • selecting ontologies for publishing data
  • converting data to the appropriate format (RDF using the selected ontology)
  • publishing the linked data
  • interlinking data with other data sources

The project is funded for three years so it needs to hit the ground on the run.

I am sure they would appreciate useful feedback.

May 15, 2011

Data First

Filed under: Dataset,Topic Maps — Patrick Durusau @ 5:56 pm

The Death of Open Data?

Not recent (April 2011) but interesting article from technology review (MIT) on the impact of budget cuts on Obama’s Open Data initiatives.

Speaking of data.gov, Rufus Pollock (director Open Knowledge Foundation) says:

Pollock says what’s most concerning about cutting the Electronic Government Fund is that it represents a turn away from Obama’s open-government policies. “The website is really great, but the crucial thing is the actual data,” he says. Though data.gov is a symbol whose loss would be painful, the real question is whether the U.S. government will continue to make its data more accessible and useful, with or without it.

In an era of budget reductions, it needs to be data first (including documentation) and agency website search/presentations later, if at all.

If you know anyone with an effective voice in this debate, suggest to them data first, agency presentation/spin only after the data has been posted for download.

May 3, 2011

Introducing Druid: Real-Time Analytics at a Billion Rows Per Second

Filed under: Data Structures,Dataset — Patrick Durusau @ 1:17 pm

Introducing Druid: Real-Time Analytics at a Billion Rows Per Second

A general overview of Druid and the choices that led up to it.

The next post is said to have details about the architecture, etc.

From what I read here, the holding of all data in memory is one critical part of the solution.

That and having data that can be held in smallish cells.

Tossing blobs, ASCII or binary, into cells, might cause a problem.

Won’t know until the software is available for use by a diverse audience.

I mention it here as an example of defining data sets and requirements in such a way that scalable architectures can be developed, for that particular set of requirements.

There is nothing wrong with having a solution that works best for a particular application.

Ballpoint pens are wonderful writing devices but fail miserably as hammers.

A software or technology solutions that works for your problem is far more valuable than software that solves the general case but not yours.

April 6, 2011

OPEN! Government Data

Filed under: Dataset — Patrick Durusau @ 6:21 pm

OPEN! Government Data

Another listing of government data sets and other materials.

Relevant for topic maps as more grist for a topic map mill.

May have a certain sense of urgency in the United States as several of the government sponsored data sites will be going dark later this year. Budget cuts.

Why the transparency minded Obama administration and the secretive opposition would agree on less government transparency isn’t clear.

I note that agreement only to point out that if you are going to copy data currently available for later use in topic maps, the time to do so is now.

*****
PS: Not that access to data = transparency but in the absence of data, there isn’t even a basis for transparency.

List of European Open Data Catalogues

Filed under: Dataset — Patrick Durusau @ 6:20 pm

List of European Open Data Catalogues

From the website:

Following is a list of open data catalogues from around European member states, sorted by country. This list is very much a work in progress.

EU oriented listing of open data catalogues.

publicdata.eu — Europe’s public data

Filed under: Dataset — Patrick Durusau @ 6:19 pm

publicdata.eu — Europe’s public data

Noticed this in the Open Data Challenge materials and thought it merited a separate entry.

Deserves a visit, if for no other reason than the home page that lists “Places” as: United Kingdom, England, Wales, Scotland, Northern Ireland, and, International.

Where “International” includes the United States, Australia, Afghanistan, and oh, yes, the rest of Europe.

The first fourteen entries from International will give you an idea of the range of the data sets:

* German federal budget (OffenerHaushalt)
* 2000 U.S. Census in RDF (rdfabout.com)
* 32000 Naples Florida Businesses in KML format
* Airborne Antarctic Ozone Experiment (AAOE-87)
* AcaWiki
* Acupuncture & Moxibustion in London
* Asian Development Bank (ADB) – Statistical Database System (SDBS)
* Addgene
* Adopt a Roadside (Victoria, Australia)
* Advances in Dental Research
* Aegean Archaeomalacology
* Afghanistan Election Data
* Agricultural and forestry exports from New Zealand
* AGROVOC

Open Data Challenge

Filed under: Contest,Dataset — Patrick Durusau @ 6:19 pm

Open Data Challenge

EU residents and organizations with operations in the EU can compete in four basic categories:

  • Ideas – Anyone can suggest an idea for projects which reuse public information to do something interesting or useful.
  • Apps – Teams of developers can submit working applications which reuse public information.
  • Visualisations – Designers, artists and others can submit interesting or insightful visual representations of public information.
  • Datasets – Public bodies can submit newly opened up datasets, or developers can submit derived datasets which they’ve cleaned up, or linked together

Runs 5 April to 5 June, 2011

See the site for various rules and details.

April 2, 2011

Pathogen Portal

Filed under: Bioinformatics,Biomedical,Dataset — Patrick Durusau @ 5:34 pm

Pathogen Portal, The Bioinformatics Resource Centers Portal.

From the website:

Pathogen Portal is a repository linking to the Bioinformatics Resource Centers (BRCs) sponsored by the National Institute of Allergy and Infectious Diseases (NIAID) and maintained by The Virginia Bioinformatics Institute. The BRCs are providing web-based resources to scientific community conducting basic and applied research on organisms considered potential agents of biowarfare or bioterrorism or causing emerging or re-emerging diseases.

Motherlode of resources and datasets on “…potential agents of biowarfare or bioterrorism….”

I read an article years ago in Popular Science about smearing punji stakes with water buffalo excrement. A primitive, but effective, form of biowarfare.

I suppose that would fall in the realm of applied research for purposes of a topic map.

March 25, 2011

Open-source Data Science Toolkit

Filed under: Dataset,Geographic Data,Geographic Information Retrieval,Software — Patrick Durusau @ 4:32 pm

Open-source Data Science Toolkit

From Flowingdata.com:

Pete Warden does the data community a solid and wraps up a collection of open-source tools in the Data Science Toolkit to parse, geocode, and process data.

Mostly geographic material but some other interesting tools, such as extracting the “main” story from a document. (It has never encountered one of my longer email exchanges with Newcomb. 😉 )

It is interesting to me that so many tools and data sets related to geography appear so regularly.

GIS (geographic information systems) can be very hard but perhaps they are easier than the semantic challenges of say medical or legal literature.

That is it is easier to say here you are with regard to a geographic system than to locate a subject in a conceptual space which has been partially captured by a document.

Suspect the difference in hardness could only be illustrated by example and not by some test. Will have to give that some thought.

Open Data is not Transparency

Filed under: Dataset — Patrick Durusau @ 4:31 pm

Open Data is not Transparency

From the blog:

There are many encouraging signs of late in the general area of open data. However, one thing that has to be kept in mind with this movement is that open data is only part of transparency – it is necessary but not sufficient. If the data is not understandable by the intended audience (and open data suggests a very broad audience) then there is no transparency. The information and knowledge locked in the data will be hiding in plain sight.

This thought suggests that any open data movement has to be combined with a ‘plain English’ (or ‘plain ‘) programme and an investment in data literacy. In addition, to take the whole movement to its obvious conclusion, there should be some well defined success criteria. What is the answer to the question: what happens to whom when citizens experience open data?

For a similar take, see my: Baltimore – Semi-Transparent or Semi-Opaque?

What surprises me is that in response to the Cablegate scandal, that the State Department did not simply start dumping all their daily output to the web. In all its inconsistent formats, vocabularies, etc.

The ensuing flood of data would effectively hide any secrets they may have far more effectively than any security protocol.

I can imagine the news conference now: “You found a document that said what? Imagine that!”

With no attribution it could be anyone from the janitor writing a novel on their lunch break to Hillary finally saying good-bye to Bill.

And that would allow the reservation of “top-secret” for things like launch codes, where is the red button, stuff like that.

March 21, 2011

KDD Cup

Filed under: Dataset,Examples,Music Retrieval — Patrick Durusau @ 8:48 am

KDD Cup

From the website:

People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups. It comes as no surprise that human tastes in music are remarkably diverse, as nicely exhibited by the famous quotation: “We don’t like their sound, and guitar music is on the way out” (Decca Recording Co. rejecting the Beatles, 1962).

Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.

Such an exciting analysis introduces new scientific challenges. The KDD Cup contest releases over 300 million ratings performed by over 1 million anonymized users. The ratings are given to different types of items-songs, albums, artists, genres-all tied together within a known taxonomy.

Important dates:

March 15, 2011 Competition begins

June 30, 2011 Competition ends

July 3, 2011 Winners notified

August 21, 2011 Workshop

An interesting data set that focuses on machine learning and prediction.

Equally interesting would be merging this data set with other music data sets.

March 19, 2011

Busses Come In Threes, Why Do Proofs Come In Two’s? – Post

Filed under: Dataset,Mathematics Indexing — Patrick Durusau @ 5:53 pm

Busses Come In Threes, Why Do Proofs Come In Two’s?

Dick Lipton, at Gödel’s Lost Letter explores:

Why do theorems get proved independently at the same time

Jacques Hadamard and Charles-Jean de laVallée Poussin, Neil Immerman and Robert Szelepcsenyi, Steve Cook and Leonid Levin, Georgy Egorychev and Dmitry Falikman, Sanjeev Arora and Joseph Mitchell, are pairs of great researchers. Each pair proved some wonderful theorem, yet they did this in each case independently and at almost the same time.

Interesting in its own right but I mention here to raise the issue of the use to topic maps to bridge the use of different nomenclatures.

Would that increase the incidence of discovery of independent proofs of theorems?

Even harder to answer: Would bridging different nomenclatures increase the incidence of independent proofs of theorems?

Thinking that all such proofs need not be of famous theorems.

Could have independent proofs of lesser theorems as well.

The 2010 Mathematics Subject Classification is no doubt very useful but too crude to assist in the discovery of duplicate proofs (or beyond general areas to look) proofs altogether.

March 15, 2011

WeatherSpark

Filed under: Dataset,Mashups — Patrick Durusau @ 5:33 am

WeatherSpark

Courtesy of Flowingdata.com, the WeatherSpark site is a graphic and historical representation of weather conditions.

From the site:

WeatherSpark is a new type of weather website, with interactive weather graphs that allow you to pan and zoom through the entire history of any weather station on earth.

Get multiple forecasts for the current location, overlaid on records and averages to put it all in context.

Unlike some mashups it is fairly apparent what is being used as a binding point. Which would make re-use of this data easier.

For example, if I were looking for weak points in a transportation system, I would take the traffic accident/delay records and then map them against the weather records from this site.

Thereby enabling predictions of when and where disruptive activity would have the greatest multiplier effect from natural weather conditions, time of day, etc.

March 12, 2011

UK Science, Media, Railway Data Dump!

Filed under: Dataset — Patrick Durusau @ 6:48 pm

Documentation for collections data from Science Museum, National Media Museum, National Railway Museum (NMSI) released as CSV was the original title.

OK, so I took some liberties with the title.

It is one thing to have an interesting data set. It is quite another to get enough attention to encourage its use.

Pass this along to science, media and railroad sites and lists. I am sure some of the partisans there will be interested.

Questions: (Remember, I promised to return to these.)

  1. Choose one of the collections. Describe the topic map you would create with the data. (4-6 pages, no citations)
  2. What aspects of your topic map make it easier to incorporate additional information from other sources? (4-6 pages, no citations)
  3. Outline your design of an interface for delivery of content from your topic map. (4-6 pages, no citations)
  4. For extra credit, up to and including no final, create your topic map. (subject to instructor approval)

March 7, 2011

Microsoft Academic Search

Filed under: Dataset,Search Engines — Patrick Durusau @ 7:09 am

Microsoft Academic Search

I ran across a reference to this search engine in a thread bitching about ranking of publications, etc.

I suppose but my first reaction was like a kid in a candy store.

Hard to know of:

  • Algorithms & Theory
  • Artificial Intelligence
  • Bioinformatics & Computational Biology
  • Computer Education
  • Computer Vision
  • Databases
  • Data Mining
  • Distributed & Parallel Computing
  • Graphics
  • Hardware & Architecture
  • Human-Computer Interaction
  • Information Retrieval
  • Machine Learning & Pattern Recognition
  • Multimedia
  • Natural Language & Speech
  • Networks & Communications
  • Operating Systems
  • Programming Languages
  • Real-Time & Embedded Systems
  • Scientific Computing
  • Security & Privacy
  • Simulation
  • Software Engineering
  • World Wide Web
  • Computer Science Overall
  • Other Domains Overall

…which to choose first!

As far as the critics of this site, I have to agree it isn’t everything it could be.

But that is a good thing because it leaves Microsoft and everyone else something to strive for.

I don’t have any illusions about corporate entities, including Microsoft.

But, all of them have people working for them who do good work, that benefits the public interest, and who are doing so while working for a corporate entity.

I know that because I know people who work for a number of the larger software corporate entities.

I am sure you know some of them too.

NPTEL – Computer Science

Filed under: CS Lectures,Dataset — Patrick Durusau @ 7:08 am

NPTEL – Computer Science

An extensive set of computer science lectures courtesy of a joint venture of the Indian Institutes of Technology and the Indian Institute of Science.

I listed this both as a CS lecture and a dataset as it occurs to me that it would be really useful to have a topic map of online CS courses.

If a student doesn’t “get” a concept when explained in one lecture, another approach, by another lecturer, could turn the trick.

Not something I am going to get to soon but the type of thing I need to create as a framework to capture that sort of information as I encounter it.

Or even better a framework to which others could contribute to a map as they find such resources.

Seed it with courses from NPTEL, MIT, Stanford and maybe a couple of other places. Enough to make it worthwhile on its own.

Something to think about.

Shout out if you are interested or want to take the lead.

March 4, 2011

Table competition at ICDAR 2011

Filed under: Dataset,Subject Identity — Patrick Durusau @ 10:40 am

I first noticed this item at Mathew Hurst’s blog Table Competition at ICDAR 2011.

As a markup person with some passing familiarity with table encoding issues, this is just awesome!

Update: March 10, 2011 Competition registration, which consists of expressing interest in competing, by email, to the competition organisers

The basic description is OK:

Motivation: Tables are a prominent element of communication in documents, often containing information that would take many a paragraph to write otherwise. The first step to table understanding is to draw the tables physical model, i.e. identify its location and component cells, rows ad columns. Several authors have dedicated themselves to these tasks, using diverse methods, however it is difficult to know which methods work best under which circumstance because of the diverse testing conditions used by each. This competition aims at addressing this lacuna in our field.

Tasks: This competition will involve two independent sub-competitions. Authors may choose to compete for one task or the other or both.

1. Table location sub-competition:

This task consists of identifying which lines in the document belong to one same table area or not;

2. Table segmentation sub-competition:

This task consists of identifying which column the cells of each table belong to, i.e. identifying which cells belong to one same column. Each cell should be attributed a start and end column index (which will be different from each other for spanning cells). Identifying row spanning cells is not relevant for this competition.

But what I think will excite markup folks (and possibly topic map advocates) is the description of the data sets:

Description of the datasets: We have gathered 22 PDF financial statements. Our documents have lengths varying between 13 and 235 pages with very diverse page layouts, for example, pages can be organised in one or two columns and page headers and footers are included; each document contains between 3 and 162 tables. In Appendix A, we present some examples of pages in our dataset with tables that we consider hard to locate or segment. We randomly chose 19 documents for training and 3 for validation; our tougher cases turned out to be in the training set.

We then converted all files to ASCII using the pdttotext linux utility2 (2Red Hat Linux 7.2 (Enigma), October 22, 2001, Linux 2.4.7-10, pdftotext version 0.92., copyright 1996-2000 Derek B. Noonburg.). As a result of the conversion, each line of each document became a line of ASCII, which when imported into a database becomes a record in a relational table. Apart from this, we collected an extra 19 PDF financial statements to form the test set; these were converted into ASCII using the same tool as the training set.

Table 1 underneath shows the resulting dimensions of the datasets and how they compare to those used by other authors (Wang et al. (2002)’s tables were automatically generated and Pinto et al. (2003)’s belong to the same government statistics website). The sizes of the datasets in other papers are not distant from ours. An exception would be Cafarella et al. (2008), who created the first large repository of HTML tables, with 154 million tables. These consist of non-marked up HTML tables detected using Wang and Hu (2002)’s algorithm, which is naturally subject to mistakes.

We have then manually created the ground-truth for this data, which involved: a) identifying which lines belong to tables and which do not; b) for each line, identifying how it should be clipped into cells; c) for each cell, identifying which table column it belongs to.

Whether you choose to compete or not, this should prove to be very interesting.

Sorry, left off the dates from the original post:

Important dates:

  • February 26, 2011 Training set is made available on the Competition Website
  • March 10, 2011 Competition registration, which consists of expressing interest in competing, by email, to the competition organisers
  • May 13, 2011 Validation set is made available on the Competition Website
  • May 15, 2011 Submission of results by competitors, which should be executable files; if at all impossible, the test data will be given out to competitors, but results must be submitted within no more than one hour (negotiable)
  • June 15, 2011 Submission of summary paper for ICDAR’s proceedings, already including the identification of the competition’s winner
  • September, 2011 Test set is made available on the Competition Website
  • September, 2011 Announcement of the results will be made during ICDAR’2011, the competition session

Benchmark: Python Machine Learning – Post

Filed under: Dataset,Machine Learning — Patrick Durusau @ 5:49 am

Benchmark for several Python machine learning packages

From the website:

We compare computation time for a few algorithms implemented in the major machine learning toolkits accessible in Python. We use the Madelon data set Guyon2004, 4400 instances and 500 attributes, that can be used in supervised and unsupervised settings and is quite large, but small enough for most algorithms to run.

Useful site for a couple of reasons:

1) A cross-check to make sure I have some of the major Python machine learning packages listed.

2) Another reminder that we don’t have similar test sets of data for topic maps.

The first one I can check and remedy fairly quickly.

The second one is going to take more thought, planning and mostly effort. 😉

Suggestions/comments?

March 3, 2011

Wikipedia Page Traffic Statistics Dataset

Filed under: Dataset,Topic Maps,Uncategorized — Patrick Durusau @ 9:46 am

Wikipedia Page Traffic Statistics Dataset

Data Wrangling reports a data set of 320 GB sample of Wikipedia traffic.

Thoughts on similar sample data sets for topic maps?

Sizes, subjects, complexity?

LOS on data.networkedplanet.com – Post

Filed under: Dataset — Patrick Durusau @ 9:22 am

LOS on data.networkedplanet.com opines that http://data.norge.no could be better and outlines some principles as guidance to making it or similar effort better.

Sorry, Networked Planet Blog = Graham Moore and/or Kal Ahmed to most of the topic map regulars.

I am not real sure what LOS stands for…, loan origination solution perhaps? A quick search gives 3.5 million “hits” so I am not going to try to sort it out. Maybe Networked Planet will clear up that mystery in an upcoming post.

I would be more concerned with publication of identifiers, along with when those identifiers should be applied to particular subjects (read properties) than insuring that all identifiers be URLs but then if one is playing to the Semantic Web niche market I suppose that is good advice.

It was just the other day that I mentioned the 100+ million non-URL identifiers that are nearly universally used in chemistry and related fields. I am on the look out for similar, curated sets of identifiers so please post, oh, you know, there is that German publisher that curates chemical structures as search criteria as well. I will go run them down for later this week.

More on the issue of identifier advice to follow.

March 2, 2011

Cussing in Commits – Follow Up Topic Map Project?

Filed under: Dataset,Humor — Patrick Durusau @ 10:24 am

Cussing in Commits: Which Programming Language Inspires the Most Swearing? is a deeply amusing chart based on analysis of one million GitHub commit messages. Tracks the use of profanity in commit messages by programming language for the project.

Oh, the topic map follow up project?

Grab a similar number of commits and create topics and associations. Be imaginative. Create topics for geographic locations of committers. Time of date of commits. Pre or post .0 releases, etc.

Tracking one dimension, such as cussing by language can be amusing. Having the ability to create intersections between dimensions via associations, that could be quite useful. Here is a fun data set to explore.

February 24, 2011

Challenge to the Opera Topic Map?

Filed under: Dataset,Music Retrieval — Patrick Durusau @ 12:25 pm

Well, not quite. Needs topic mapping step but…, you are a little closer that before.

Data mining & Hip Hop

Dataminingtools.net reports:

Tahir Hemphil data mined 30 years of hip-hop lyrics to provide a searchable index of the genre’s lexicon.

The project analyzes the lyrics of over 40,000 songs for metaphors, similes, cultural references, phrases, memes and socio-political ideas.[Project] The project is one of its kind with a huge potential offering to the hip hop world, not only can you visualize the artists career’s but also have deeper analysis into their world where you can potential patternize their music.

See the post for more material and links.

ICWSM 2011 Data Challenge

Filed under: Conferences,Data Mining,Dataset — Patrick Durusau @ 12:21 pm

ICWSM 2011 Data Challenge

From the website:

The ICWSM 2011 Data Challenge introduces a brand-new dataset, the 2011 ICWSM Spinn3r dataset. This dataset includes blogs from Spinn3r over a 33 day period, from January 13th, 2011 through February 14th, 2011. See here for details on how to obtain the collection.

Since the new collection spans some rather extraordinary world events, this year introduces a specific task: to locate significant posts in the collection which are relevant to the revolutions in Tunisia and Egypt. The criterion for “significant relevance” is that the post is worthy of being shared by you, an observer, with a friend. To participate in the task, we will ask that you submit a ranked list of items in the collection, and we will do some form of relevance judgments and scoring in time for the conference.

The data challenge will culminate at ICWSM 2011 with a special workshop. To participate in the workshop, you must submit a 3-page short paper in PDF format and bring a poster to present at the workshop. The short papers will not be reviewed, but the workshop organizers will select a small panel of speakers based on the submissions. The short paper/poster can describe your participation in the shared task, OR ALTERNATIVELY other compelling work you have performed WITH THE 2011 DATASET.

Submissions will be due on April 22, 2011. Details on the submission process will be posted soon.

Oh, just briefly about the collection:

The dataset consists of over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset’s time period).

If you are going to be in Barcelona (the conference location), why not submit an entry using topic maps?

« Newer PostsOlder Posts »

Powered by WordPress