Archive for the ‘RDF’ Category

A Trillion Triples in Perspective

Saturday, May 18th, 2013

Mozart Meets MapReduce by Isaac Lopez.

From the post:

Big data has been around since the beginning of time, says Thomas Paulmichl, founder and CEO of Sigmaspecto, who says that what has changed is how we process the information. In a talk during Big Data Week, Paulmichl encouraged people to open up their perspective on what big data is, and how it can be applied.

During the talk, he admonished people to take a human element into big data. Paulmichl demonstrated this by examining the work of musical prodigy, Mozart – who Paulmichl noted is appreciated greatly by both music scientists, as well as the common music listener.

“When Mozart makes choices on writing a piece of work, the number of choices that he has and the kind of neural algorithms that his brain goes through to choose things is infinitesimally higher that what we call big data – it’s really small data in comparison,” he said.

Taking Mozart’s The Magic Flute as an example, Paulmichl, discussed the framework that Mozart used to make his choices by examining a music sheet outlining the number of bars, the time signature, the instrument and singer voicing.

“So from his perspective, he sits down, and starts to make what we as data scientists call quantitative choices,” explained Paulmichl. “Do I put a note here, down here, do I use a different instrument; do I use a parallel voicing for different violins – so these are all metrics that his brain has to decide.”

Exploring the mathematics of the music, Paulmichl concluded that in looking at The Magic Flute, Mozart had 4.72391E+21 creative variations (and then some) that he could have taken with the direction of it over the course of the piece. “We’re not talking about a trillion dataset; we’re talking about a sextillion or more,” he says adding that this is a very limited cut of the quantitative choice that his brain makes at every composition point.

“[A] sextillion or more…” puts the question of processing a trillion triples into perspective.

Another musical analogy?

Triples are the one finger version of Jingle Bells*:

*The gap is greater than the video represents but it is still amusing.

Does your analysis/data have one finger subtlety?

A self-updating road map of The Cancer Genome Atlas

Friday, May 17th, 2013

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

Non-Adoption of Semantic Web, Reason #1002

Monday, May 13th, 2013

Kingsley Idehen offers yet another explanation/excuses for non-adoption of the semantic web in On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen by Roberto V. Zicari.

The highlight of this interview reads:

The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point

You may recall Kingsley’s demonstration of the non-complexity of authoring for the Semantic Web in The Semantic Web Is Failing — But Why? (Part 3).

Could it be users sense the “lock-in” of RDF/Semantic Web?

Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?

K​ingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.

There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself. (emphasis added to last sentence)

It’s comforting to know RDF/Semantic Web “lock-in” has our best interest at heart.

See Kingley dodging the next question on Virtuoso’s ability scale:

Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?

K​ingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.

The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.

Full Disclosure: I haven’t actually counted all of Kingsley’s reasons for non-adoption of the Semantic Web. The number I assign here may be high or low.

The ChEMBL database as linked open data

Thursday, May 9th, 2013

The ChEMBL database as linked open data by Egon L Willighagen, Andra Waagmeester, Ola Spjuth, Peter Ansell, Antony J Williams, Valery Tkachenko, Janna Hastings, Bin Chen and David J Wild. (Journal of Cheminformatics 2013, 5:23 doi:10.1186/1758-2946-5-23).

Abstract:

Background Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs). RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easier to scale up inference and data analysis.

Results This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples. Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO; exposes more information from the database; and is now available as dereferencable, linked data. To demonstrate these new features, we present novel use cases showing further integration with other web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standard ontologies for querying.

Conclusions We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDF resource creates a foundation for integrated semantic web cheminformatics applications, such as the presented decision support.

You already know about the fragility of ontologies so no need to repeat that rant here.

Having material encoded with an ontology, on the other hand, after vetting, can be a source that you wrap with a topic map.

So all that effort isn’t lost.

5 heuristics for writing better SPARQL queries

Wednesday, April 3rd, 2013

5 heuristics for writing better SPARQL queries by Paul Groth.

From the post:

In the context of the Open PHACTS and the Linked Data Benchmark Council projects, Antonis Loizou and I have been looking at how to write better SPARQL queries. In the Open PHACTS project, we’ve been writing super complicated queries to integrate multiple data sources and from experience we realized that different combinations and factors can dramatically impact performance. With this experience, we decided to do something more systematic and test how different techniques we came up with mapped to database theory and worked in practice. We just submitted a paper for review on the outcome. You can find a preprint (On the Formulation of Performant SPARQL Queries) on arxiv.org at http://arxiv.org/abs/1304.0567. The abstract is below. The fancy graphs are in the paper.

Paper Abstract:

The combination of the flexibility of RDF and the expressiveness of SPARQL provides a powerful mechanism to model, integrate and query data. However, these properties also mean that it is nontrivial to write performant SPARQL queries. Indeed, it is quite easy to create queries that tax even the most optimised triple stores. Currently, application developers have little concrete guidance on how to write “good” queries. The goal of this paper is to begin to bridge this gap. It describes 5 heuristics that can be applied to create optimised queries. The heuristics are informed by formal results in the literature on the semantics and complexity of evaluating SPARQL queries, which ensures that queries following these rules can be optimised effectively by an underlying RDF store. Moreover, we empirically verify the efficacy of the heuristics using a set of openly available datasets and corresponding SPARQL queries developed by a large pharmacology data integration project. The experimental results show improvements in performance across 6 state-of-the-art RDF stores.

Just in case you have to query RDF stores as part of your topic map work.

Be aware that: The effectiveness of your SPARQL query will vary based on the RDF Store.

Or as the authors say:

SPARQL, due to its expressiveness , provides a plethora of different ways to express the same constraints, thus, developers need to be aware of the performance implications of the combination of query formulation and RDF Store. This work provides empirical evidence that can help developers in designing queries for their selected RDF Store. However, this raises questions about the effectives of writing complex generic queries that work across open SPARQL endpoints available in the Linked Open Data Cloud. We view the optimisation of queries independent of underlying RDF Store technology as a critical area of research to enable the most effective use of these endpoints. (page 21)

I hope their research is successful.

Varying performance, especially as reported in their paper, doesn’t bode well for cross-RDF Store queries.

Dydra

Tuesday, March 26th, 2013

Dydra

From the webpage:

Dydra

Dydra is a cloud-based graph database. Whether you’re using existing social network APIs or want to build your own, Dydra treats your customers’ social graph as exactly that.

With Dydra, your data is natively stored as a property graph, directly representing the relationships in the underlying data.

Expressive

With Dydra, you access and update your data via an industry-standard query language specifically designed for graph processing, SPARQL. It’s easy to use and we provide a handy in-browser query editor to help you learn.

Despite my misgivings about RDF (Simple Web Semantics), if you want to investigate RDF and SPARQL, Dydra would be a good way to get your feet wet.

You can get an idea of the skill level required by RDF/SPARQL.

Currently in beta, free with some resource limitations.

I particularly liked the line:

We manage every piece of the data store, including versioning, disaster recovery, performance, and more. You just use it.

RDF/SPARQL skills will remain a barrier but Dydra as does its best to make those the only barriers you will face. (And have reduced some of those.)

Definitely worth your attention, whether you simply want to practice on RDF/SPARQL as a data source or have other uses for it.

I first saw this in a tweet by Stian Danenbarger.

Congratulations! You’re Running on OpenCalais 4.7!

Monday, March 25th, 2013

Congratulations! You’re Running on OpenCalais 4.7!

From the post:

This morning we upgraded OpenCalais to release 4.7. Our focus with 4.7 was on a significant improvement in the detection and disambiguation of companies as well as some behind-the-scenes tune-ups and bug fixes.

If your content contains company names you should already be seeing a significant improvement in detection and disambiguation. While company detection has always been very good in OpenCalais, now it’s great.

If you’re one of our high-volume commercial clients (1M+ transactions per day), we’ll be rolling out your upgrade toward the end of the month.

And, remember, you can always drop by the OpenCalais viewer for a quick test or exploration of OpenCalais with zero programming involved.

If you don’t already know OpenCalais:

From a user perspective it’s pretty simple: You hand the Web Service unstructured text (like news articles, blog postings, your term paper, etc.) and it returns semantic metadata in RDF format. What’s happening in the background is a little more complicated.

Using natural language processing and machine learning techniques, the Calais Web Service examines your text and locates the entities (people, places, products, etc.), facts (John Doe works for Acme Corporation) and events (Jane Doe was appointed as a Board member of Acme Corporation). Calais then processes the entities, facts and events extracted from the text and returns them to the caller in RDF format.

Please also check out the Calais blog and forums to see where Calais is headed. Significant development activities include the ability for downstream content consumers to retrieve previously generated metadata using a Calais-provided GUID, additional input languages, and user-defined processing extensions.

Did I mention it is a free service up to 50,000 submissions a day? (see the license terms for details)

OpenCalais won’t capture every entity or relationship known to you but it will do a lot of the rote work for you. You can then fill in the specialized parts.

Tensors and Their Applications…

Saturday, March 23rd, 2013

Tensors and Their Applications in Graph-Structured Domains by Maximilian Nickel and Volker Tresp. (Slides.)

Along with the slides, you will like abstract and bibliography found at: Machine Learning on Linked Data: Tensors and their Applications in Graph-Structured Domains.

Abstract:

Machine learning has become increasingly important in the context of Linked Data as it is an enabling technology for many important tasks such as link prediction, information retrieval or group detection. The fundamental data structure of Linked Data is a graph. Graphs are also ubiquitous in many other fields of application, such as social networks, bioinformatics or the World Wide Web. Recently, tensor factorizations have emerged as a highly promising approach to machine learning on graph-structured data, showing both scalability and excellent results on benchmark data sets, while matching perfectly to the triple structure of RDF. This tutorial will provide an introduction to tensor factorizations and their applications for machine learning on graphs. By the means of concrete tasks such as link prediction we will discuss several factorization methods in-depth and also provide necessary theoretical background on tensors in general. Emphasis is put on tensor models that are of interest to Linked Data, which will include models that are able to factorize large-scale graphs with millions of entities and known facts or models that can handle the open-world assumption of Linked Data. Furthermore, we will discuss tensor models for temporal and sequential graph data, e.g. to analyze social networks over time.

Devising a system to deal with the heterogeneous nature of linked data.

Just skimming the slides I could see, this looks very promising.

I first saw this in a tweet by Stefano Bertolo.


Update: I just got an email from Maximilian Nickel and he has altered the transition between slides. Working now!

From slide 53 forward is pure gold for topic map purposes.

Heavy sledding but let me give you one statement from the slides that should capture your interest:

Instance matching: Ranking of entities by their similarity in the entity-latent-component space.

Although written about linked data, not limited to linked data.

What is more, Maximilian offers proof that the technique scales!

Complex, configurable, scalable determination of subject identity!

[Update: deleted note about issues with slides, which read: (Slides for ISWC 2012 tutorial, Chrome is your best bet. Even better bet, Chrome on Windows. Chrome on Ubuntu crashed every time I tried to go to slide #15. Windows gets to slide #46 before failing to respond. I have written to inquire about the slides.)]

A Distributed Graph Engine…

Friday, March 22nd, 2013

A Distributed Graph Engine for Web Scale RDF Data by Kai Zeng, Jiacheng Yang, Haixum Wang, Bin Shao and Zhongyuan Wang.

Abstract:

Much work has been devoted to supporting RDF data. But state-of-the-art systems and methods still cannot handle web scale RDF data e ffectively. Furthermore, many useful and general purpose graph-based operations (e.g., random walk, reachability, community discovery) on RDF data are not supported, as most existing systems store and index data in particular ways (e.g., as relational tables or as a bitmap matrix) to maximize one particular operation on RDF data: SPARQL query processing. In this paper, we introduce Trinity.RDF, a distributed, memory-based graph engine for web scale RDF data. Instead of managing the RDF data in triple stores or as bitmap matrices, we store RDF data in its native graph form. It achieves much better (sometimes orders of magnitude better) performance for SPARQL queries than the state-of-the-art approaches. Furthermore, since the data is stored in its native graph form, the system can support other operations (e.g., random walks, reachability) on RDF graphs as well. We conduct comprehensive experimental studies on real life, web scale RDF data to demonstrate the e ffectiveness of our approach.

From the conclusion:

We propose a scalable solution for managing RDF data as graphs in a distributed in-memory key-value store. Our query processing and optimization techniques support SPARQL queries without relying on join operations, and we report performance numbers of querying against RDF datasets of billions of triples. Besides scalability, our approach also has the potential to support queries and analytical tasks that are far more advanced than SPARQL queries, as RDF data is stored as graphs. In addition, our solution only utilizes basic (distributed) key-value store functions and thus can be ported to any in-memory key-value store.

A result that is:

  • scalable
  • goes beyond SPARQL
  • can be ported to any in-memory key-value store

Merits a very close read.

Makes me curious what other data models would work better if cast as graphs?

I first saw this in a tweet by Juan Sequeda.

Striking a Blow for Complexity

Thursday, March 21st, 2013

The W3C struck a blow for complexity today.

It’s blog entry entitled: Eleven SPARQL 1.1 Specifications are W3C Recommendations reads:

The SPARQL Working Group has completed development of its full-featured system for querying and managing data using the flexible RDF data model. It has now published eleven Recommendations for SPARQL 1.1, detailed in SPARQL 1.1 Overview. SPARQL 1.1 extends the 2008 Recommendation for SPARQL 1.0 by adding features to the query language such as aggregates, subqueries, negation, property paths, and an expanded set of functions and operators. Beyond the query language, SPARQL 1.1 adds other features that were widely requested, including update, service description, a JSON results format, and support for entailment reasoning. Learn more about the Semantic Web Activity.

I can’t wait for the movie version starring IBM’s Watson playing sudden death Jeopardy against Bob Ducharme, category SPARQL.

I’m betting on Bob!

Data.ac.uk

Thursday, March 21st, 2013

Data.ac.uk

From the website:

This is a landmark site for academia providing a single point of contact for linked open data development. It not only provides access to the know-how and tools to discuss and create linked data and data aggregation sites, but also enables access to, and the creation of, large aggregated data sets providing powerful and flexible collections of information.
Here at Data.ac.uk we’re working to inform national standards and assist in the development of national data aggregation subdomains.

I can’t imagine a greater contrast between my poor web authoring skills and a website than this one.

But having said that, I think you will be as disappointed as I was when you start looking for data on this “landmark site.”

There is some but not nearly enough to match the promise of such a cleverly designed website.

Perhaps they are hoping that someday RDF data (they also offer comma and tab delimited versions) will catch up to the site design.

I first saw this in a tweet by Frank van Harmelen.

Freebase Data Dumps

Thursday, March 21st, 2013

Freebase Data Dumps

From the webpage:

Data Dumps are a downloadable version of the data in Freebase. They constitute a snapshot of the data stored in Freebase and the Schema that structures it, and are provided under the same CC-BY license.

Full data dumps of every fact and assertion in Freebase are available as RDF and are updated every week. Deltas are not available at this time.

Total triples: 585 million
Compressed size: 14 GB
Uncompressed size: 87 GB
Data Format: Turtle RDF

I first saw this in a tweet by Thomas Steiner.

Open Annotation Data Model

Tuesday, March 19th, 2013

Open Annotation Data Model

Abstract:

The Open Annotation Core Data Model specifies an interoperable framework for creating associations between related resources, annotations, using a methodology that conforms to the Architecture of the World Wide Web. Open Annotations can easily be shared between platforms, with sufficient richness of expression to satisfy complex requirements while remaining simple enough to also allow for the most common use cases, such as attaching a piece of text to a single web resource.

An Annotation is considered to be a set of connected resources, typically including a body and target, where the body is somehow about the target. The full model supports additional functionality, enabling semantic annotations, embedding content, selecting segments of resources, choosing the appropriate representation of a resource and providing styling hints for consuming clients.

My first encounter with this proposal so I need to compare it to my Simple Web Semantics.

At first blush, the Open Annotation Core Model looks a lot heavier than Simple Web Semantics.

I need to reform my blog posts into a formal document and perhaps attach a comparison as an annex.

Semantic Search Over The Web (SSW 2013)

Monday, March 18th, 2013

3RD International Workshop onSemantic Search Over The Web (SSW 2013)

Dates:

Abstract Papers submission: May 31, 2013 – 15:00 (3:00 pm) EDT
(Short) Full Paper submission: June 7, 2013 – 15:00 (3:00 pm) EDT
Author notification: July 19, 2013
Camera-ready copy due: August 2, 2013
Workshop date: During VLDB (Aug 26 – Aug 30)

From the webpage:

We are witnessing a smooth evolution of the Web from a worldwide information space of linked documents to a global knowledge base, composed of semantically interconnected resources. To date, the correlated and semantically annotated data available on the web amounts to 25 billion RDF triples, interlinked by around 395 million RDF links. The continuous publishing and the integration of the plethora of semantic datasets from companies, government and public sector projects is leading to the creation of the so-called Web of Knowledge. Each semantic dataset contributes to extend the global knowledge and increases its reasoning capabilities. As a matter of facts, researchers are now looking with growing interest to semantic issues in this huge amount of correlated data available on the Web. Many progresses have been made in the field of semantic technologies, from formal models to repositories and reasoning engines. While the focus of many practitioners is on exploiting such semantic information to contribute to IR problems from a document centric point of view, we believe that such a vast, and constantly growing, amount of semantic data raises data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web.

The third edition of the International Workshop on Semantic Search over the Web (SSW) will discuss about data management issues related to the search over the web and the relationships with semantic web technologies, proposing new models, languages and applications.

The research issues can be summarized by the following problems:

  • How can we model and efficiently access large amounts of semantic web data?
  • How can we effectively retrieve information exploiting semantic web technologies?
  • How can we employ semantic search in real world scenarios?

The SSW Workshop invites researchers, engineers, service developers to present their research and works in the field of data management for semantic search. Papers may deal with methods, models, case studies, practical experiences and technologies.

Apologies for the uncertainty of the workshop date. (There is confusion about the date on the workshop site, one place says the 26th, the other the 30th. Check before you make reservation/travel arrangements.)

I differ with the organizers on some issues but on the presence of: “…data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web,” there is no disagreement.

That’s the trick isn’t it? In any confined or small group setting, just about any consistent semantic solution will work.

The hurly-burly of a constant stream of half-heard, partially understood communications across distributed and heterogeneous systems tests the true mettle of semantic solutions.

Not a quest for perfect communication but “good enough.”

LDBC – Second Technical User Community (TUC) Meeting

Monday, March 18th, 2013

LDBC: Linked Data Benchmark Council – Second Technical User Community (TUC) Meeting – 22/23rd April 2013.

From the post:

The LDBC consortium are pleased to announce the second Technical User Community (TUC) meeting.

This will be a two day event in Munich on the 22/23rd April 2013.

The event will include:

  • Introduction to the objectives and progress of the LDBC project.
  • Description of the progress of the benchmarks being evolved through Task Forces.
  • Users explaining their use-cases and describing the limitations they have found in current technology.
  • Industry discussions on the contents of the benchmarks.

All users of RDF and graph databases are welcome to attend. If you are interested, please contact: ldbc AT ac DOT upc DOT edu.

Further meeting details at the post.

Beacons of Availability

Sunday, March 17th, 2013

From Records to a Web of Library Data – Pt3 Beacons of Availability by Richard Wallis.

Beacons of Availability

As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.

As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.

Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.

I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.

I’m am an ardent sympathizer helping people to find “our stuff.”

I don’t disagree with the description of Google as: “…the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!”

But in all fairness to Google, I would remind you of Drabenstott’s research that found for the Library of Congress subject headings:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows:

children 32%
adults 40%
reference 53%
technical services librarians 56%

The Library of Congress subject classification has been around for more than a century and just over half of the librarians can use it correctly.

Let’s don’t wait more than a century to test the claim:*

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines.


* By “test” I don’t mean the sort of study, “…we recruited twelve LIS students but one had to leave before the study was complete….”

I am using “test” in the sense of a well designed and organized social science project with professional assistance from social scientists, UI test designers and the like.

I think OCLC is quite sincere in its promotion of linked data, but effectiveness is an empirical question, not one of sincerity.

Algebraix Data Achieves Unrivaled Semantic Benchmark Performance

Saturday, March 16th, 2013

Algebraix Data Achieves Unrivaled Semantic Benchmark Performance by Angela Guess.

From the post:

Algebraix Data Corporation today announced its SPARQL Server(TM) RDF database successfully executed all 17 of its queries on the SP2 benchmark up to one billion triples on one computer node. The SP2 benchmark is the most computationally complex for testing SPARQL performance and no other vendor has reported results for all queries on data sizes above five million triples.

Furthermore, SPARQL Server demonstrated linear performance in total SP2Bench query time on data sets from one million to one billion triples. These latest dramatic results are made possible by algebraic optimization techniques that maximize computing resource utilization.

“Our outstanding SPARQL Server performance is a direct result of the algebraic techniques enabled by our patented Algebraix technology,” said Charlie Silver, CEO of Algebraix Data. “We are investing heavily in the development of SPARQL Server to continue making substantial additional functional, performance and scalability improvements.”

Pretty much a copy of the press release from Algebraix.

You may find:

Doing the Math: The Algebraix DataBase Whitepaper: What it is, how it works, why we need it (PDF) by Robin Bloor, PhD

ALGEBRAIX Technology Mathematics Whitepaper (PDF), by Algebraix Data

and,

Granted Patents

more useful.

BTW, The SP²Bench SPARQL Performance Benchmark, will be useful as well.

Algebraix listed its patents but I supplied the links. Why the links were missing at Algebraix I cannot say.

If the “…no other vendor has reported results for all queries on data sizes above five million triples…” is correct, isn’t scaling an issue for SQARQL?

From Records to a Web of Library Data – Pt2 Hubs of Authority

Saturday, March 16th, 2013

From Records to a Web of Library Data – Pt2 Hubs of Authority by Richard Wallis.

From the post:

Hubs of Authority

Libraries, probably because of their natural inclination towards cooperation, were ahead of the game in data sharing for many years. The moment computing technology became practical, in the late sixties, cooperative cataloguing initiatives started all over the world either in national libraries or cooperative organisations. Two from personal experience come to mind, BLCMP started in Birmingham, UK in 1969 eventually evolved in to the leading Semantic Web organisation Talis, and in 1967 Dublin, Ohio saw the creation of OCLC. Both in their own way having had significant impact on the worlds of libraries, metadata, and the web (and me!).

One of the obvious impacts of inter-library cooperation over the years has been the authorities, those sources of authoritative names for key elements of bibliographic records. A large number of national libraries have such lists of agreed formats for author and organisational names. The Library of Congress has in addition to its name authorities, subjects, classifications, languages, countries etc. Another obvious success in this area is VIAF, the Virtual International Authority File, which currently aggregates over thirty authority files from all over the world – well used and recognised in library land, and increasingly across the web in general as a source of identifiers for people & organisations.

These, Linked Data enabled, sources of information are developing importance in their own right, as a natural place to link to, when asserting the thing, person, or concept you are identifying in your data. As Sir Tim Berners-Lee’s fourth principle of Linked Data tells us to “Include links to other URIs. so that they can discover more things”. VIAF in particular is becoming such a trusted, authoritative, source of URIs that there is now a VIAFbot responsible for interconnecting Wikipedia and VIAF to surface hundreds of thousands of relevant links to each other. A great hat-tip to Max Klein, OCLC Wikipedian in Residence, for his work in this area.

I don’t deny that VIAF is a very useful tool but if you search for personal name, “Marilyn Monroe,” it returns:

1. Miller, Arthur, 1915-2005
National Library of Australia National Library of the Czech Republic National Diet Library (Japan) Deutsche Nationalbibliothek RERO (Switzerland) SUDOC (France) Library and Archives Canada National Library of Israel (Latin) National Library of Sweden NUKAT Center (Poland) Bibliothèque nationale de France Biblioteca Nacional de España Library of Congress/NACO

Miller, Arthur (Arthur Asher), 1915-2005
National Library of the Netherlands-test

Miller, Arthur, 1915-
Vatican Library Biblioteca Nacional de Portugal

ميلر، ارثر، 1915-2005 م.
Bibliotheca Alexandrina (Egypt)

Miller, Arthur
Wikipedia (en)-test

מילר, ארתור, 1915-2005
National Library of Israel (Hebrew)

2. Monroe, Marilyn, 1926-1962
National Library of Israel (Latin) National Library of the Czech Republic National Diet Library (Japan) Deutsche Nationalbibliothek SUDOC (France) Library and Archives Canada National Library of Australia National Library of Sweden NUKAT Center (Poland) Bibliothèque nationale de France Biblioteca Nacional de España Library of Congress/NACO

Monroe, Marilyn
National Library of the Netherlands-test Wikipedia (en)-test RERO (Switzerland)

Monroe, Marilyn American actress, model, and singer, 1926-1962
Getty Union List of Artist Names

Monroe, Marilyn, pseud.
Biblioteca Nacional de Portugal

3. DiMaggio, Joe, 1914-1999
Library of Congress/NACO Bibliothèque nationale de France

Di Maggio, Joe 1914-1999
Deutsche Nationalbibliothek

Di Maggio, Joseph Paul, 1914-1999
National Diet Library (Japan)

DiMaggio, Joe, 1914-
National Library of Australia

Dimaggio, Joseph Paul, 1914-1999
SUDOC (France)

DiMaggio, Joe (Joseph Paul), 1914-1999
National Library of the Netherlands-test

Dimaggio, Joe
Wikipedia (en)-test

4. Monroe, Marilyn
Deutsche Nationalbibliothek

5. Hurst-Monroe, Marlene
Library of Congress/NACO

6. Wolf, Marilyn Monroe
Deutsche Nationalbibliothek

Maybe Sir Tim is right, users “…can discover more things.”

Some of them are related, some of them are not.

DRM/WWW, Wealth/Salvation: Theological Parallels

Thursday, March 14th, 2013

Cory Doctorow misses a teaching moment in his: What I wish Tim Berners-Lee understood about DRM.

Cory says:

Whenever Berners-Lee tells the story of the Web’s inception, he stresses that he was able to invent the Web without getting any permission. He uses this as a parable to explain the importance of an open and neutral Internet.

The “…without getting any permission” was a principle for Tim Berners-Lee when he was inventing the Web.

A principle then, not now.

Evidence? The fundamentals of RDF have been mired in the same model for fourteen (14) years. Impeding the evolution of the “Semantic” Web. Whatever its merits.

Another example? HTML5 violates prior definitions of URL in order to widen the reach of HTML5. (URL Homonym Problem: A Topic Map Solution)

Same “principle” as DRM support, expanding the label of “WWW” beyond what early supporters would recognize as the WWW.

HTML5 rewriting of URL and DRM support are membership building exercises.

The teaching moment comes from early Christian history.

You may (or may not) recall the parable of the rich young ruler (Matthew 19:16-30), where a rich young man asks Jesus what he must do to be saved?

Jesus replies:

One thing you still lack. Sell all that you have and distribute to the poor, and you will have treasure in heaven; and come, follow me.

And for the first hundred or more years of Christianity, so far as can be known, that rule, divesting yourself of property was followed.

Until, Clement of Alexandria. Clement took the position that indeed the rich could retain their goods, so long as they used it charitably. (Now there’s a loophole!)

Created two paths to salvation, one for anyone foolish enough to take the Bible at its word and another for anyone would wanted to call themselves Christians, without any inconvenience or discomfort.

Following Clement of Alexandria, Tim Berners-Lee is creating two paths to the WWW.

One for people who are foolish enough to innovate and share information, the innovation model of the WWW that Cory speaks so highly of.

Another path for people (DRM crowd) who neither spin nor toil but who want to burden everyone who does.

Membership as a principle isn’t surprising considering how TBL sees himself in the mirror:

TBL as WWW Pope

Data Catalog Vocabulary (DCAT) [Last Call ends 08 April 2013]

Tuesday, March 12th, 2013

Data Catalog Vocabulary (DCAT)

Abstract:

DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. This document defines the schema and provides examples for its use.

By using DCAT to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs. It further enables decentralized publishing of catalogs and facilitates federated dataset search across sites. Aggregated DCAT metadata can serve as a manifest file to facilitate digital preservation.

If you have comments, now would be a good time to finish them up for submission.

I first saw this in a tweet by Sandro Hawke.

RDF Data Cube Vocabulary [Last Call ends 08 April 2013]

Tuesday, March 12th, 2013

RDF Data Cube Vocabulary

Abstract:

There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows or other multi-dimensional data sets.

If you have comments, now would be a good time to finish them up for submission.

I first saw this in a tweet by Sandro Hawke.

SPIN: SPARQL Inferencing Notation

Tuesday, March 12th, 2013

SPIN: SPARQL Inferencing Notation

From the webpage:

SPIN is a W3C Member Submission that has become the de-facto industry standard to represent SPARQL rules and constraints on Semantic Web models. SPIN also provides meta-modeling capabilities that allow users to define their own SPARQL functions and query templates. Finally, SPIN includes a ready to use library of common functions.

SPIN in Five Slides.

In case you encounter SPARQL rules and constraints.

I first saw this in a tweet by Stian Danenbarger.

prefix.cc

Tuesday, March 12th, 2013

prefix.cc: namespace lookup for RDF developers

From the about page:

The intention of this service is to simplify a common task in the work of RDF developers: remembering and looking up URI prefixes.

You can look up prefixes from the search box on the homepage, or directly by typing URLs into your browser bar, such as http://prefix.cc/foaf or http://prefix.cc/foaf,dc,owl.ttl.

New prefix mappings can be added by anyone. If multiple conflicting URIs are submitted for the same namespace, visitors can vote for the one they consider the best. You are only allowed one vote or namespace submission per day.

For n3, xml, rdfa, sparql, the result interface shows the URI prefixes in use.

But if there is more than one URI prefix, difference URI prefixes appear with each example.

Don’t multiple, conflicting URI prefixes seem problematic to you?

Precursors to Simple Web Semantics

Thursday, February 21st, 2013

A couple of precursors to Simple Web Semantics have been brought to my attention.

Wanted to alert you so you can consider these prior/current approaches while evaluating Simple Web Semantics.

The first one was from Rob Weir (IBM), who suggested I look at “smart tags” from Microsoft and sent the link to Soft tags (Wikipedia).

The second one was from Nick Howard (a math wizard I know) who pointed out the similarity to bookmarklets. On that see: Bookmarklet (Wikipedia).

I will be diving deeper into both of these technologies.

Not so much a historical study but what did/did not work, etc.

Other suggestions, directions, etc. are most welcome!

I have a another refinement to the syntax that I will be posting tomorrow.

Literature Survey of Graph Databases

Tuesday, February 19th, 2013

Literature Survey of Graph Databases by Bryan Thompson.

I can understand Danny Bickson, Literature survey of graph databases, being excited about the coverage of GraphChi in this survey.

However, there are other names you will recognize as well (TOC order):

  • RDF3X
  • Diplodocus
  • GraphChi
  • YARS2
  • 4store
  • Virtuoso
  • Bigdata
  • SHARD
  • Graph partitioning
  • Accumulo
  • Urika
  • Scalable RDF query processing on clusters and supercomputers (a system with no name at Rensselaer Polytechnic)

As you can tell from the system names, the survey focuses on processing of RDF.

In reviewing one system, Bryan remarks:

Only small data sets were considered (100s of millions of edges). (emphasis added)

I think that captures the focus of the paper better than any comment I can make.

A must read for graph heads!

Simple Web Semantics – Index Post

Monday, February 18th, 2013

Sam Hunting suggested that I add indexes to the Simple Web Semantics posts to facilitate navigating from one to the other.

It occurred to me that having a single index page could also be useful.

The series began with:

Reasoning why something isn’t working is important to know before proposing a solution.

I have gotten good editorial feedback on the proposal and will be posting a revision in the next couple of days.

Nothing substantially different but clearer and more precise.

If you have any comments or suggestions, please make them at your earliest convenience.

I am always open to comments but the sooner they arrive the sooner I can make improvements.

Simple Web Semantics (SWS) – Syntax Refinement

Sunday, February 17th, 2013

In Saving the “Semantic” Web (part 5), the only additional HTML syntax I proposed was:

<meta name=”dictionary” content=”URI”>

in the <head> element of an HTML document.

(Where you would locate the equivalent declaration of a URI dictionary in other document formats will vary.)

But that sets the URI dictionary for an entire document.

What if you want more fine grained control over the URI dictionary for a particular URI?

It would be possible to do something complicated with namespaces, containers, scope, etc. but the simpler solution would be:

<a dictionary="URI" href="yourURI">

Either the URI is governed by the declaration for the entire page or it has a declared dictionary URI.

Or to summarize the HTML syntax of SWS at this point:

<meta name=”dictionary” content=”URI”>

<a dictionary="URI" href="yourURI">

NBA Stats Like Never Before [No RDF/Linked Data/Topic Maps In Sight]

Saturday, February 16th, 2013

NBA Stats Like Never Before by Timo Elliott.

From the post:

The National Baseball Association today unveiled a new site for fans of games statistics: NBA.com/stats, powered by SAP Analytics technology. The multi-year marketing partnership between SAP and the NBA was announced six months ago:

“We are constantly researching new and emerging technologies in an effort to provide our fans with new ways to connect with our game,” said NBA Commissioner David Stern. “SAP is a leader in providing innovative software solutions and an ideal partner to provide a dynamic and comprehensive statistical offering as fans interact with NBA basketball on a global basis.”

“SAP is honored to partner with the NBA, one of the world’s most respected sports organizations,” said Bill McDermott, co-CEO, SAP. “Through SAP HANA, fans will be able to experience the NBA as never before. This is a slam dunk for SAP, the NBA and the many fans who will now have access to unprecedented insight and analysis.”

The free database contains every box score of every game played since the league’s inception in 1946, including graphical displays of players shooting tendencies.

To the average fan NBA.com/Stats delivers information that is of immediate interest to them, not their computers.

Another way to think about it:

Computers don’t make purchasing decisions, users do.

Something to think about when deciding on your next semantic technology.

Saving the “Semantic” Web (part 5)

Friday, February 15th, 2013

Simple Web Semantics

For what it’s worth, what follows in this post is a partial, non-universal and useful only in some cases proposal.

That has been forgotten by this point but in my defense, I did try to warn you. ;-)

1. Division of Semantic Labor

The first step towards useful semantics on the web must be a division of semantic labor.

I won’t recount the various failures of the Semantic Web, topic maps and other initiatives to “educate” users on how they should encode semantics.

All such efforts have, are now and will fail.

That is not a negative comment on users.

In another life I advocated tools that would enable biblical scholars to work in XML, without having to learn angle-bang syntax. It wasn’t for lack of intelligence, most of them were fluent in five or six ancient languages.

They were focused on being biblical scholars and had no interest in learning the minutiae of XML encoding.

After many years, due to a cast of hundreds if not thousands, OpenOffice, OpenDocumentFormat (ODF) and XML editing became available to the ordinary users.

Not the fine tuned XML of the Text Encoding Initiative (TEI) or DocBook, but having a 50 million plus user share is better than being in the 5 to 6 digit range.

Users have not succeeded in authoring structured data, such as RDF, but have demonstrated competence at authoring <a> elements with URIs.

I propose the following division of semantic labor:

Users – Responsible for identification of subjects in content they author, using URIs in the <a> element.

Experts – Responsible for annotation (by whatever means) of URIs that can be found in <a> elements in content.

2. URIs as Pointers into a Dictionary

One of the comments in these series pointed out that URIs are like “pointers into a dictionary.” I like that imagery and it is easier to understand than the way I intended to say it.

If you think of words as pointers into a dictionary, how many dictionaries does a word point into?

And contrast your answer with the number of dictionaries into which a URI points?

If we are going to use URIs as “pointers into a dictionary,” then there should be no limit on the number of dictionaries into which they can point.

A URI can be posed to any number of dictionaries as a query, with possibly different results from each dictionary.

3. Of Dictionaries

Take for example the URI, http://data.nytimes.com/47271465269191538193 as an example of a URI that can appear in a dictionary.

If you follow that URI, you will notice a couple of things:

  1. It isn’t content suitable for primary or secondary education.
  2. The content is limited to that of the New York Times.
  3. The content of the NYT consists of article pointers

Not to mention it is a “pull” interface that requires effort on the part of users, as opposed to a “push” interface that reduces that effort.

What if rather than “following” the URI http://data.nytimes.com/47271465269191538193, you could submit that same URI to another dictionary, one than had different information?

A dictionary that for that URI returns:

  1. Links to content suitable for primary or secondary education.
  2. Broader content than just New York Times.
  3. Curated content and not just article pointers

Just as we have semantic diversity:

URI dictionaries shall not be required to use a particular technology or paradigm.

4. Immediate Feedback

Whether you will admit it or not, we have all coded HTML and then loaded it in a browser to see the results.

That’s called “immediate feedback” and made HTML, the early versions anyway, extremely successful.

When <a> elements with URIs are used to identify subjects, how can we duplicate that “immediate feedback” experience?

My suggestion is that users encode in the <head> of their documents a meta element that reads:

<meta name=”dictionary” content=”URI”>

And insert either JavaScript or JQuery code that creates an array of all the URIs in the document, passes those URIs to the dictionary specified by the user and then displays a set of values when a user mouses over a particular URI.

Think of it as being the equivalent of spell checking except for subjects. You could even call it “subject checking.”

For most purposes, dictionaries should only return 3 or 4 key/values pairs, enough for users to verify their choice of a URI. With an option to see more information.

True enough, I haven’t asked for users to say which of those properties identify the subject in question and I don’t intend to. That lies in the domain of experts.

The inline URI mechanism lends itself to automatic insertion of URIs, which users could then verify capture their meaning. (Wikifier is a good example, assuming you have a dictionary based on Wikipedia URIs.)

Users should be able to choose the dictionaries they prefer for identification of subjects. Further, users should be able to verify their identifications from observing properties associated with a URI.

5. Incentives, Economic and Otherwise

There are economic and other incentives that arise from “Simple Web Semantics.”

First, divorcing URI dictionaries from any particular technology will create an easy on ramp for dictionary creators to offer as many or few services as they choose. Users can vote with their feet on which URI dictionaries meet their needs.

Second, divorcing URIs from their sources creates the potential for economic opportunities and competition in the creation of URI dictionaries. Dictionary creators can serve up definitions for popular URIs, along with pointers to other content, free and otherwise.

Third, giving users the right to choose their URI dictionaries is a step towards returning democracy to the WWW.

Fourth, giving users immediate feedback based on URIs they choose, makes users the judges of their own semantics, again.

Fifth, with the rise of URI dictionaries, the need to maintain URIs, “cool” or otherwise, simply disappears. No one maintains the existence of words. We have dictionaries.

There are technical refinements that I could suggest but I wanted to draw the proposal in broad strokes and improve it based on your comments.

Comments/Suggestions?

PS: As I promised at the beginning, this proposal does not address many of the endless complexities of semantic integration. If you need a different solution, for a different semantic integration problem, you know where to find me.


Saving the “Semantic” Web (part 4)

Wednesday, February 13th, 2013

Democracy vs. Aristocracy

Part of a recent comment on this series reads:

What should we have been doing instead of the semantic web? ISO Topic Maps? There is some great work in there, but has it been a better success?

That is an important question and I wanted to capture it outside of comments on a prior post.

Earlier in this series of posts I pointed out the success of HTML, especially when contrasted with Semantic Web proposals.

Let me hasten to add the same observation is true for ISO Topic Maps (HyTime or later versions).

The critical difference between HTML (the early and quite serviceable versions) and Semantic Web/Topic Maps is that the former democratizes communication and the latter fosters a technical aristocracy.

Every user who can type and some who hunt-n-peck, can author HTML and publish their content for others around the world to read, discuss, etc.

That is a very powerful and democratizing notion about content creation.

The previous guardians, gate keepers, insiders, and their familiars, who didn’t add anything of value to prior publications processes, are still reeling from the blow.

Even as old aristocracies crumble, new ones evolve.

Technical aristocracies for example. A phrase relevant to both the Semantic Web and ISO Topic Maps.

Having tasted freedom, the crowds aren’t as accepting of the lash/leash as they once were. Nor of the aristocracies who would wield them. Nor should they be.

Which make me wonder: Why the emphasis on creating dumbed down semantics for computers?

We already have billions of people who are far more competent semantically than computers.

Where are our efforts to enable them to transverse the different semantics of other users?

Such as the semantics of the aristocrats who have self-anointed themselves to labor on their behalf?

If you have guessed that I have little patience with aristocracies, you are right in one.

I came by that aversion honestly.

I practiced law in a civilian jurisdiction for a decade. A specialist language, law, can be more precise, but it also excludes others from participation. The same experience was true when I studied theology and ANE languages. A bit later, in markup technologies (then SGML/HyTime), the same lesson was repeated. What I do with ODF and topic maps are two more specialized languages.

Yet a reasonably intelligent person can discuss issues in any of those fields, if they can get past the language barriers aristocrats take so much comfort in maintaining.

My answer to what we should be doing is:

Looking for ways to enable people to traverse and enjoy the semantic diversity that accounts for the richness of the human experience.

PS: Computers have a role to play in that quest, but a subordinate one.