Dataset « Another Word For It

February 18, 2011

DataLift

Filed under: Dataset,Linked Data,Ontology,RDF — Patrick Durusau @ 5:12 am

The DataLift project will no doubt produce some useful tools and output but reading its self-description:

The project will provide tools allowing to facilitate each step of the publication process:

selecting ontologies for publishing data

converting data to the appropriate format (RDF using the selected ontology)

publishing the linked data

interlinking data with other data sources

I am struck by how futile the effort sounds in the face of petabytes of data flow, changing semantics of that data and changing semantics of other data, with which it might be interlinked.

The nearest imagery I can come up with is trying to direct the flow of a tsunami with a roll of paper towels.

It is certainly brave (I forgo usage of the other term) to try but ultimately isn’t very productive.

First, any scheme that start with conversion to a particular format is an automatic loser.

The source format is itself composed of subjects that are discarded by the conversion process.

Moreover, what if we disagree about the conversion?

Remember all the semantic diversity that gave rise to this problem? Where did it get off to?

Second, the interlinking step introduces brittleness into the process.

Both in terms of the ontology that any particular data must follow but also in terms of resolution of any linkage.

Other data sources can only be linked in if they use the correct ontology and format. And that assumes they are reachable.

I hope the project does well, but at best it will result in another semantic flavor to be integrated using topic maps.

*****
PS: The use of data heaven betrays the religious nature of the Linked Data movement. I don’t object to Linked Data. What I object to is the missionary conversion aspects of Linked Data.

Comments (2)

DSPL: Dataset Publishing Language

Filed under: Dataset,DSPL - Dataset Publishing Language,Interface Research/Design,Mashups,Visualization — Patrick Durusau @ 5:10 am

DSPL: Dataset Publishing Language

From the website:

DSPL is the Dataset Publishing Language, a representation language for the data and metadata of datasets. Datasets described in this format can be processed by Google and visualized in the Google Public Data Explorer.

Features:

Use existing data: Just add an XML metadata file to your existing CSV data files

Powerful visualizations: Unleash the full capabilities of the Google Public Data Explorer, including the animated bar chart, motion chart, and map visualization

Linkable concepts: Link to concepts in other datasets or create your own that others can use

Multi-language: Create datasets with metadata in any combination of languages

Geo-enabled: Make your data mappable by adding latitude and longitude data to your concept definitions. For even easier mapping, link to Google’s canonical geographic concepts.

Fully open: Freely use the DSPL format in your own applications

For the details:

A couple quick observations:

Geared towards data that can be captured in csv files, which are considerable and interesting data sets, but only a slice of all data.

Did not appear on a quick scan of the tutorial or developer guide to provide a way to specify properties for topics.

Did not appear to provide a way to specify when (or why) topic could be merged with one another.

Plus marks for enabling navigation by topics, but that is like complimenting a nautical map for having the compass directions isn’t it?

I think this could be a very good tool for investigating data or even showing, but if you had a topic map, sort of illustrations to clients.

Moving up in the stack, both virtual as well as actual, of reading materials on my desk.

Comments Off

February 11, 2011

Million Song Dataset

Filed under: Dataset — Patrick Durusau @ 1:01 pm

Million Song Dataset.

Yes, one million song dataset.

A 280 GB dataset. Site suggests you ask someone you know if they already have a copy. Not your average music download.

Amendment: There is no music included in this download. My reference to music download was sarcasm.

From the website:

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:

To encourage research on algorithms that scale to commercial sizes

To provide a reference dataset for evaluating research

As a shortcut alternative to creating a large dataset with The Echo Nest’s API

To help new researchers get started in the MIR field

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.

The Million Song Dataset is a collaborative project between The Echo Nest and LabROSA. It is supported in part by the NSF.

Two things to notice:

Not a small data set (remember the post about dealing with data?)
National Science Foundation funding on #1.

Note the combination: big data + funding. Nuff said?

Comments (3)

February 8, 2011

Topic Mapping BoingBoing Data?

Filed under: Dataset,Examples,Marketing — Patrick Durusau @ 6:15 am

A recent entry on Simon Willison’s blog, How we made an API for BoingBoing in an evening caught my eye.

It was based on the release of eleven years worth of post from BoingBoing, which you can download at: Eleven years’ worth of Boing Boing posts in one file!

Curious what subjects you would choose first for creating a topic map of this data set?

And having chosen them, how would you manage their identity to make it easier on yourself to incorporate other blog content?

I am mindful of Robert Barta’s approach of no data before its time for incorporation into a topic map.

Would that make a difference in your design of the topic map or only in the infrastructure that supports it?

Comments (2)

January 23, 2011

geocommons

Filed under: Dataset,Geographic Information Retrieval,Mapping,Maps — Patrick Durusau @ 9:27 pm

geocommons

A very impressive resource for mapping data against a common geographic background.

Works for a lot of reasons, not the least of which is the amount of effort that has gone into the site and its tools.

But, I think having a common frame of reference, that is geographic locations, simplifies the problem addressed by topic maps.

That is the data is seen through the common lens of geographic boundaries and/or locations.

To make it closer to the problem faced by topic maps, what if geographic locations had to be brought into focus, before data could be mapped against them?

That seems to me to be the harder problem.

Comments (2)

December 17, 2010

Google Books Ngram Viewer

Filed under: Dataset,Software — Patrick Durusau @ 4:33 pm

Google Books Ngram Viewer

From the website:

Scholars interested in topics such as philosophy, religion, politics, art and language have employed qualitative approaches such as literary and critical analysis with great success. As more of the world’s literature becomes available online, it’s increasingly possible to apply quantitative methods to complement that research. So today Will Brockman and I are happy to announce a new visualization tool called the Google Books Ngram Viewer, available on Google Labs. We’re also making the datasets backing the Ngram Viewer, produced by Matthew Gray and intern Yuan K. Shen, freely downloadable so that scholars will be able to create replicable experiments in the style of traditional scientific discovery.
…
Since 2004, Google has digitized more than 15 million books worldwide. The datasets we’re making available today to further humanities research are based on a subset of that corpus, weighing in at 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish. The datasets contain phrases of up to five words with counts of how often they occurred in each year.

Tracing shifts in language usage will help topic map designers create maps for historical materials that require less correction by users.

One wonders if the extracts can be traced back to particular works?

That would enable a map developed for these extracts to be used with the scanned texts themselves.

Comments Off

December 14, 2010

USA Today API

Filed under: Data Source,Dataset,News — Patrick Durusau @ 5:05 am

USA Today API

The nice folks at www.programmableweb.com reported today that USA has opened its article archive up going back to 2004.

From the story USA Today Expands APIs to Include Articles Back to 2004:

The dataset contains all web stories going back to 2004, as well as blog posts, newspaper stories, and even wire feeds.

Questions:

Use the standard USA today and another news archive of your choice to answer the following questions:

Find a series of news stories about some major event and compare the language used.
Could you find the stories in both archives using the same language? (2-3 pages, pointers to the news sources)
What stories about the event would require different languages to find? 2-3 pages, pointers to the news sources)

The point of this exercise is to develop examples of where creative searching is going to find more resources than using the typical search terms.

It will also illustrate the semantic limitations of current search engines.

Comments Off

December 13, 2010

USA Today Best-Selling Books API

Filed under: Books,Data Source,Dataset — Patrick Durusau @ 8:45 am

USA Today Best-Selling Books API

From the website:

USA Today’s Best-Selling Books API provides a method for developers to retrieve USA TOday’s weekly compiled list of the nation’s best-selling books, which is published each Thursday. In addition, developers can also retrieve archived lists since the book list’s launch on Thursday, Oct. 28, 1993. The Best-Selling Books API can also be used to retrieve a title’s history on the list and metadata about each title.

Available metadata:

Author. Contains one or more names of the authors, illustrators, editors or other creators of the book.

BookListAppearances. The number of weeks a book has appeared in the Top 150, regardless of ISBN.

BriefDescription. A summary of the book. Contains indicators of the book’s class (fiction or non-fiction) and format (hardcover, paperback, e-book). If a title is available in multiple formats, the format noted is the one selling the most copies that week.

CategoryID. Code for book category type.

CategoryName. Text of book category type.

Class. Specifies whether the book is fiction or non-fiction.

FirstBookListAppearance. The date of the list when the particular ISBN first appeared in the top 150.

FormatName. Specifies whether the ISBN is assigned to a hardcover, paperback or e-book edition.

HighestRank. The highest position on the list achieved by this book, regardless of ISBN.

ISBN. The book’s 13- or 10-digit ISBN. The ISBN for a title in a given week is the ISBN of the version (hardcover, paperback or e-book) that sold the most copies that week.

MostRecentBooksListAppearance. The date of the list when the particular ISBN last appeared in the top 150.

Rank. The book’s rank on the list.

RankHistories. The weekly history of the ISBN fetched.

RankLastWeek. The book’s rank on the prior week’s list if it appeared. Books not on the previous week’s list are designated with a “0”.

Title. The book title. Titles are generally reported as specified by publishers and as they appear on the book’s cover.

TitleAPIUrl. URL to retrieve the list history for that ISBN. Note that the ISBN refers to the version of the title that sold the most copies that week if multiple formats were available for sale. Sales from other ISBNs assigned to that title may be included; we do not provide the other ISBNs each week.

Questions:

Would you use a topic map to dynamically display this information to library patrons? If so, which parts? (2-3 pages, no citations)
What information would you want to use to supplement this information? How would you map it to this information? (2-3 pages, no citations)
What information would you include for library staff and not patrons? (if any) (2-3 pages, no citations)

Comments Off

10×10 – Words and Photos

Filed under: Data Source,Dataset,Subject Identity — Patrick Durusau @ 7:38 am

10×10

From the website:

10×10 (‘ten by ten’) is an interactive exploration of words and photos based on news stories during a particular hour. The 10×10 site displays 100 photos, each photo representative of a word used in many news stories published during the current hour. The 10×10 site maintains an archive of these photos and words back to 2004. The 10×10 API is organized like directories, with the year, month, day and hour. Retrieve the words list for a particular hour, then get the photos that correspond to those words.

A preset mapping of words to photos but nothing would prevent an application from offering additional photos.

Not to mention enabling merging based on the recognition of photos.*

Replication of merging could be an issue if based on image recognition.

On the other hand, I am not sure replication of merging would be any less certain than asking users to base merging decisions based on textual content.

Reliable replication of merging is possible only when our mechanical servants are given rules to apply.

****
*Leaving aside replication of merging issues (which may not be an operational requirement), facial recognition, perhaps supplemented by human operator confirmation, could be an interesting component of mass acquisition of images, say at border entry/exit points.

Not that border personnel need be given access to such information, a la Secret Intelligence – Public Recording Network (SIPRNet) systems, but a topic map could simply signal an order to detain, follow, get their phone number.

Simply dumping data into systems doesn’t lead to more “connect the dot” moments.

Topic maps may be a way to lead to more such moments, depending upon the skill of their construction and your analysts. (inquiries welcome)

Comments Off

December 12, 2010

Europeana: think culture

Filed under: Data Source,Dataset,Museums — Patrick Durusau @ 8:04 pm

europena: think culture

More than 14.6 million items from over 1500 organizations.

Truly an embarrassment of riches for anyone writing a topic map about Europe, its history, literature, influence on other parts of the world, etc.

I have just begun to explore the site and its interfaces. Will report back from time to time.

You can create your own tags but creation of an account requires the following agreement:

I understand that My Europeana gives me the opportunity to create tags for any item I wish. I agree that I will not create any tags that could be considered libelous, harmful, threatening, unlawful, defamatory,infringing, abusive, inflammatory, harassing, pornographic, obscene, fraudulent, invasive of privacy or publicity rights, hateful, or racially, ethnically or otherwise objectionable. By clicking this box I agree to abide by this agreement, and understand that if I don’t my membership of My Europeana will be terminated.

Just so you know.

Questions:

Select ten (10) artifacts to be integrated with local resources, using a topic map. Create a topic map. (The artifacts can be occurrences but associations provide richer opportunities.)
Select one of the projects on the Thought Lab page and review it.
What would you suggest as an improvement to the project you selected in #2? (3-5 pages, citations)

Comments Off

The European Library

Filed under: Data Source,Dataset,Library — Patrick Durusau @ 8:03 pm

The European Library

Free access to 48 European national libraries with materials in 35 languages.

Not to mention 25 million pages of scanned material read by OCR.

Collection access via several interfaces.

Any library with web access should be able to offer its multi-lingual patrons, Europeans ones anyway, with primary materials in languages of their preference.

Topic map mavens will no doubt want to push further than hyperlinks to language specific collections.

Questions:

Assume your library director has grown tired of “topic maps would…” type suggestions and asks you for a proposal to use topic maps to integrate part of the European Library materials into the local collection.

How would you choose the parts of each collection to be part of the topic map? (2-3 pages, no citations)
What other members of the library staff would you involve in planning the proposal/prototype? (2-3 pages, with attention to the skill sets needed)
Outline your prototype topic map and create a small but workable topic map to demonstrate the mapping you propose to use. (3-5 pages, no citations. Topic map without a custom interface. Very necessary step for a successful topic map deployment but beyond the time we have here.)

Comments Off

Copac

Filed under: Data Source,Dataset,Library — Patrick Durusau @ 7:59 pm

Copaq

From the website:

Copac is a freely available library catalogue, giving access to the merged online catalogues of many major UK and Irish academic and National libraries, as well as increasing numbers of specialist libraries.

Copac has c.36 million records, representing the merged holdings of:

members of the Research Libraries UK (RLUK). This includes the catalogues of the British Library, the National Library of Scotland, and the National Library of Wales / Llyfrgell Genedlaethol Cymru.

increasing numbers of specialist libraries with collections of national research interest, as well as records for specialist collections held in UK academic libraries.

Copac offers four interfaces:

Web interface (+ plugins, including one for Facebook)
Z39.50
OpenURL
SRU

Questions:

OK, so a topic map can merge your local library records with those in Copac. Why? What is your use case for that merging? (3-5 pages, no citations)
What other data would you argue should be linked to Copac records using topic maps? (3-5 pages, no citations)
What APIs at http://www.programmableweb.com would you use with Copac? Why? (3-5 pages, no citations)

Comments Off

ProgrammableWeb

Filed under: Data Source,Dataset,Topic Maps — Patrick Durusau @ 6:02 pm

ProgrammableWeb

As of 12 December 2010, 2479 APIs and 5429 Mashups.

Care to guess what a search on topic maps returns?

If your guess was 1 you would be correct.

Surely there is at least one API or Mashup out of the 7,908 listed that is a candidate for topic map #2?

Writing this as much of a note-home-from-the-teacher for myself as it is to anyone else.

Topic maps are a fundamentally different approach to semantic integration.

Not the usual re-write/convert to a new shiny orthodox format approach. Quickly before it gets supplanted (or is found wanting).

Topic maps offer a number of interesting capabilities:

no need for a universal and unique identifier
“double-ended” nature of associations that binds players together (you don’t have to remember to write the relationship both ways)
complete logical lock-step universe not required prior to authoring (or afterwards)
supports multiple overlapping views of the same data
…and other advantages.

But, my saying it to readers of this blog is preaching to the choir!

Surely some of us know relatives, former employers, parole officers who are not already sold on topic maps.

Please forward this post or tweet it to them.

Questions:

Search for APIs or mashups interest to you.

Which APIs or mashups interest you as sources for topic map material? Why?(2-3 pages, no citations)
Are there materials outside these you would want to point to or include in your topic map? (2-3 pages, citations/pointers)
How would you test your topic map? No syntactic correctness but for inclusion of resources, terminology, etc.(3-5 pages, citations)

Comments Off

Outside.in Hyperlocal News API

Filed under: Data Source,Dataset — Patrick Durusau @ 6:00 pm

Outside.in Hyperlocal News API

From the website:

The Outside.in API lets you easily integrate hyperlocal news in your sites and applications by providing recent news stories and blog posts for any neighborhood, ZIP code, city, or state in the United States.

A news aggregation site that offers free developer accounts (daily limits on accesses).

Follows > 54,000 RSS feeds.

Questions:

What subjects would a topic map for the postal code where you live include? What information would you use from this service? (2-3 pages)
What subjects would a topic map for the region where you live include? What information would you use from this service? (2-3 pages)
What subjects would a topic map for the country where you live include? What information would you from this service? (2-3 pages)

If it sounds like you weren’t given enough room for all the subjects you would want to include, consider that no topic map, dictionary, encyclopedia, etc., is ever complete.

Editorial choices always have to be made. This is an exercise to give you an opportunity to make those choices and then discuss them with your classmates. (Instead of your director or perhaps a board of supervisors.)

Comments Off

Daylife Developer

Filed under: Data Source,Dataset,Software — Patrick Durusau @ 5:54 pm

Daylife Developer

News aggregation and analysis service.

Offers free developer access to their API, capped at 5,000 calls per day.

From the website:

Have an idea for the next big news application? Build a great app using the Daylife API, then we’ll market it to our clients and give you 70% of the proceeds from any sales. Learn more.

I started to not mention this site so I could keep the 70% to myself but there is more than one great news app using topic maps. 😉

Oh, but that means creating an app.

An app that uses topic maps to deliver substantively different and useful aggregation of news.

Both of those are critical requirements.

The app must be substantively different in delivery of a unique value-add from the use of topic maps. Something the user can’t get somewhere else.

The app must be useful in delivery of value-add found useful by some community. A community willing to pay for that usefulness.

See you at Daylife Developer?

******
PS: Send pointers to similar resources to: patrick@durusau.net.

The more resources become available, including aggregation services, the greater the opportunity for topic maps!

Comments Off

November 23, 2010

Querying the British National Bibliography

Filed under: British National Bibliography,Dataset,RDF,Semantic Web,SPARQL — Patrick Durusau @ 9:40 am

Querying the British National Bibliography

From the webpage:

Following up on the earlier announcement that the British Library has made the British National Bibliography available under a public domain dedication, the JISC Open Bibliography project has worked to make this data more useable.

The data has been loaded into a Virtuoso store that is queriable through the SPARQL Endpoint and the URIs that we have assigned each record use the ORDF software to make them dereferencable, supporting perform content auto-negotiation as well as embedding RDFa in the HTML representation.

The data contains some 3 million individual records and some 173 million triples. …

The data is also available for local processing but it isn’t much of a “web” if the first step is to always download a local copy of the data.

It should be interesting to watch for projects that combine the results of queries against this data with the results of other queries against other data sets. Particularly if those other data sets follow different metadata regimes.

Isn’t that the indexing problem all over again?

Questions:

What data set would you want to combine with British National Bibliography (BNB)?
What issues do you see arising from combing the BNB with your data set? (3-5 pages, no citations)
Combining the BNB with another data set. (project)

Comments (1)

November 9, 2010

XML Data Repository

Filed under: Authoring Topic Maps,Dataset — Patrick Durusau @ 4:00 pm

XML Data Repository.

Data in XML format for testing augmented authoring or search tools.

Comments Off

October 20, 2010

Variations/FRBR: Variations as a Testbed for the FRBR Conceptual Model

Filed under: Dataset,FRBR,Search Interface,Searching — Patrick Durusau @ 3:18 am

FRBRized data in XML for free download!

Approximately 80,000 bibliographic records for musical recordings and 105,000 or so for scores.

Be sure to take a look at the search interface and submit suggestions.

From the post:

The Variations/FRBR [1] project at Indiana University has released bulk downloads of metadata for the sound recordings presented in our Scherzo [2] music discovery system in a FRBRized XML format. The downloadable data includes FRBR Work, Expression, Manifestation, Person, and Corporate Body records, along with the structural and responsibility relationships connecting them. While this is still an incomplete representation of FRBR and FRAD, we hope that the release of this data will aid others that are studying or working with FRBR. This XML data conforms to the “efrbr” set of XML Schemas [3] created for this project.

The XML data may be downloaded from http://vfrbr.info/data/1.0/index.shtml, and comments/questions may be directed to vfrbr@dlib.indiana.edu.

One caveat to those who seek to use this data: we plan to continue improving our FRBRization algorithm into the future and have not yet implemented a way to keep entity identifiers consistent between new data loads. Therefore we cannot at this time guarantee the Work with the identifier http://vfrbr.info/work/30001, for example, will have the same identifier in the future. Therefore this data at this time should be considered highly experimental.

Many thanks to the Institute of Museum and Library Services for funding the V/FRBR project.

Also, if you’re interested in FRBR, please do check out our experimental discovery system: . We’re very interested in your feedback!

Jenn

[1] V/FRBR project home page (http://vfrbr.info); FRBR report
(http://www.ifla.org/en/publications/functional-requirements-for-bibliographic-records)

[2] Scherzo (http://vfrbr.info/search)

[3] V/FRBR project XML Schemas (http://vfrbr.info/schemas/1.0/index.shtml)

Information shamelessly stolen from Last Week in FRBR #33.

Comments Off

October 5, 2010

Re-Using Linked Data

Filed under: Authoring Topic Maps,Dataset,Linked Data,Topic Maps — Patrick Durusau @ 9:24 am

The German national library released its authority records as linked data.

News and reference services have content management systems that don’t use URIs, so how do they link up public linked data with their private data?

In a way that they can share the resulting linked data within their organization?

Exploration question: What mapping facilities exist in popular CMS systems for mapping linked data to local data?

I don’t know the answer to that but will be finding out.

In the meantime, if you know your CMS system cannot do such a mapping, consider using topic maps. (topicmaps.org)

Topic maps can create linked data that is not subject to the limitation of using URIs.

Comments Off

Grist For Topic Map Mills: German National Library – Authority Files

Filed under: Dataset,Linked Data,RDF,Semantic Web — Patrick Durusau @ 6:09 am

German National Library – Authority Files (linked data)

A post from Lars Svensson announced the release of authority files from the German National Library:

The German National Library (DNB) has published the German library authority files as linked data. The dataset consists of 1.8 Mill differentiated persons from the PND (Personennamendatei, Name authority file), 187.000 subject headings from the SWD (Schlagwortnormdatei, Subject headings authority file), 1.3 Mill corporate bodies from the GKD (Gemeinsame Körperschaftsdatei, Corporate Body Authority file), and 51,000 classes from the German translation of the Dewey Decimal Classification (DDC).

Library students should take particular note of the subject heading and Dewey Decimal Classification materials.

For topic mappers, another set of identifiers that can be mapped between the data sets shown by data cloud as well those that don’t use URIs as identifiers (the vast majority of data).

This will also be of interest to the linked data community.

Comments (1)

September 22, 2010

Consultative Committee for Space Data Systems (CCSDS)

Filed under: Dataset,Space Data,Subject Identity — Patrick Durusau @ 8:15 pm

Consultative Committee for Space Data Systems (CCSDS) is a collaborative effort to create standards for space data.

Interesting because:

Space exploration get funding from governments
Subjects for mapping in a variety of formats, etc.

Assuming that agreement can be reached on the format for a mission, the question remains how do we integrate that data with articles, books, presentations, data from other missions or sources, and/or analysis of other data?

That agreement is reached on a format for one mission or even one set of data, is just a starting point for a more complicated conversation.

Comments Off