Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 7, 2013

Creating Knowledge out of Interlinked Data…

Filed under: Linked Data,LOD,Open Data,Semantic Web — Patrick Durusau @ 6:55 pm

Creating Knowledge out of Interlinked Data – STATISTICAL OFFICE WORKBENCH by Bert Van Nuffelen and Karel Kremer.

From the slides:

LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany.

LOD2 will integrate and syndicate Linked Data with existing large-scale applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment.

LOD2 Stack Release 3.0 overview

Connecting the dots: Workbench for Statistical Office

In case you are interested:

LOD2 homepage

Ubuntu 12.04 Repository

VM User / Password: lod2demo / lod2demo

LOD2 blog

The LOD2 project expires in August of 2014.

Linked Data is going to be around, one way or the other, for quite some time.

My suggestion: Grab the last VM from LOD2 and a copy of its OS, store in a location that migrates as data systems change.

November 5, 2013

Implementations of RDF 1.1?

Filed under: RDF,Semantic Web — Patrick Durusau @ 5:31 pm

W3C Invites Implementations of five Candidate Recommendations for version 1.1 of the Resource Description Framework (RDF)

From the post:

The RDF Working Group today published five Candidate Recommendations for version 1.1 of the Resource Description Framework (RDF), a widespread and stable technology for data interoperability:

  • RDF 1.1 Concepts and Abstract Syntax defines the basics which underly all RDF syntaxes and systems. It provides for general data interoperability.
  • RDF 1.1 Semantics defines the precise semantics of RDF data, supporting use with a wide range of “semantic” or “knowledge” technologies.
  • RDF 1.1 N-Triples defines a simple line-oriented syntax for serializing RDF data. N-Triples is a minimalist subset of Turtle.
  • RDF 1.1 TriG defines an extension to Turtle (aligned with SPARQL) for handling multiple RDF Graphs in a single document.
  • RDF 1.1 N-Quads defines an extension to N-Triples for handling multiple RDF Graphs in a single document.

All of these technologies are now stable and ready to be widely implemented. Each specification (except Concepts) has an associated Test Suite and includes a link to an Implementation Report showing how various software currently fares on the tests. If you maintain RDF software, please review these specifications, update your software if necessary, and (if relevant) send in test results as explained in the Implementation Report.

RDF 1.1 is a refinement to the 2004 RDF specifications, designed to simplify and improve RDF without breaking existing deployments.

In case you are curious about where this lies in the W3C standards process, see: W3C Technical Report Development Process.

I never had a reason to ferret it out before but now that I did, I wanted to write it down.

Has it been ten (10) years since RDF 1.0?

I think it has been fifteen (15) since the Semantic Web was going to teach the web to sing or whatever the slogan was.

Like all legacy data, RDF will never go away, later systems will have to account for it.

Like COBOL I suppose.

October 18, 2013

Semantic Web adoption and the users

Filed under: Marketing,Semantic Web — Patrick Durusau @ 6:58 pm

Semantic Web adoption and the users by Lars Marius Garshol.

From the post:

A hot topic at ESWC 2013, and many other places besides, was the issue of Semantic Web adoption, which after a decade and a half is still less than it should be. The thorny question is: what can be done about it? David Karger did a keynote on the subject at ESWC 2013 where he argued that the Semantic Web can help users manage their data. I think he’s right, but that this is only a very narrow area of application. In any case, end users are not the people we should aim for if adoption of Semantic Web technologies is to be the goal.

End users and technology

In a nutshell, end users do not adopt technology, they choose tools. They find an application they think solves their problem, then buy or install that. They want to keep track of their customers, so they buy a CRM tool. What technology the tool is based on is something they very rarely care about, and rightly so, as it’s the features of the tool itself that generally matters to them.

Thinking about comparable cases may help make this point more clearly. How did relational databases succeed? Not by appealing to end users. When you use an RDBMS-based application today, are users aware what’s under the hood? Very rarely. Similarly with XML. At heart it’s a very simple technology, but even so it was not end users who bought into it, but rather developers, consultants, software vendors, and architects.

If the Semantic Web technologies ever succeed, it will be by appealing to the same groups. Unfortunately, the community is doing a poor job of that now.

Lars describes the developer community as being a hard sell for technology, in part because it is inherently conservative.

But what is the one thing that both users and developers have in common?

Would you say that both users and developers are lazy?

Semantic technologies of all types, take more effort, more thinking, than the alternatives. Even rote tagging takes some effort. Useful tagging/annotation takes a good bit more.

Is adoption of semantic technologies or should I say non-adoption of semantic technologies, another example of Kahneman’s System 1?

You may recall the categories are:

  • System 1: Fast, automatic, frequent, emotional, stereotypic, subconscious
  • System 2: Slow, effortful, infrequent, logical, calculating, conscious

In the book, Thinking, Fast and Slow, Kahneman makes a compelling case that without a lot of effort, we all tend to lapse into System 1.

If that is the case, Lars’s statement:

[Users] find an application they think solves their problem, then buy or install that.

could be pointing us in the right direction.

Users aren’t interested in building a solution (all semantic technologies) but are interested in buying a solution.

By the same token:

Developers aren’t interested in understanding and documenting semantics.

All of which makes me curious:

Why do semantic technology advocates resist producing semantic products of interest to users or developers?

Or is producing a semantic technology easier than producing a semantic product?

October 10, 2013

MarkLogic Rolls Out the Red Carpet for…

Filed under: MarkLogic,RDF,Semantic Web — Patrick Durusau @ 7:09 pm

MarkLogic Rolls Out the Red Carpet for Semantic Triples by Alex Woodie.

From the post:

You write a query with great care, and excitedly hit the “enter” button, only to see a bunch of gobbledygook spit out on the screen. MarkLogic says the chances of this happening will decrease thanks to the new RDF Triple Store feature that it formally introduced today with the launch of version 7 of its eponymous NoSQL database.

The capability to store and search semantic triples in MarkLogic 7 is one of the most compelling new features of the new NoSQL database. The concept of semantic triples is central to the Resource Description Framework (RDF) way of storing and searching for information. Instead of relating information in a database using an “entity-relationship” or “class diagram” model, the RDF framework enables links between pieces of data to be searched using the “subject-predicate-object” concept, which more closely corresponds to the way humans think and communicate.

The real power of this approach becomes evident when one considers the hugely disparate nature of information on the Internet. An RDF powered application can build links between different pieces of data, and effectively “learn” from the connections created by the semantic triples. This is the big (and as yet unrealized) pipe dream of the semantic Web.

RDF has been around for a while, and while you probably wouldn’t call it mainstream, there are a handful of applications using this approach. What makes MarkLogic’s approach unique is that it’s storing the semantic triples–the linked data–right inside the main NoSQL database, where it can make use of all the rich data and metadata stored in documents and other semi-structured files that NoSQL databases like MarkLogic are so good at storing.

This approach puts semantic triples right where it can do the most good. “Until now there has been a disconnect between the incredible potential of semantics and the value organizations have been able to realize,” states MarkLogic’s senior vice president of product strategy, Joe Pasqua.

“Managing triples in dedicated triple stores allowed people to see connections, but the original source of that data was disconnected, ironically losing context,” he continues. “By combining triples with a document store that also has built-in querying and APIs for delivery, organizations gain the insights of triples while connecting the data to end users who can search documents with the context of all the facts at their fingertips.”

A couple of things caught my eye in this post.

First, the comment that:

RDF has been around for a while, and while you probably wouldn’t call it mainstream, there are a handful of applications using this approach.

I can’t disagree so why would MarkLogic make RDF support a major feature of this release?

Second, the next sentence reads:

What makes MarkLogic’s approach unique is that it’s storing the semantic triples–the linked data–right inside the main NoSQL database, where it can make use of all the rich data and metadata stored in documents and other semi-structured files that NoSQL databases like MarkLogic are so good at storing.

I am reading that to mean that if you store all the documents in which triples appear, along with the triples, you have more context. Yes?

Trivially true but I not sure how significant an advantage that would be. Shouldn’t all that “contextual” metadata be included with the triples?

But I haven’t gotten a copy of version 7 so that’s all speculation on my part.

If you have a copy of MarkLogic 7, care to comment?

Thanks!

September 19, 2013

Context Aware Searching

Filed under: Context,RDF,Searching,Semantic Graph,Semantic Web — Patrick Durusau @ 9:53 am

Scaling Up Personalized Query Results for Next Generation of Search Engines

From the post:

North Carolina State University researchers have developed a way for search engines to provide users with more accurate, personalized search results. The challenge in the past has been how to scale this approach up so that it doesn’t consume massive computer resources. Now the researchers have devised a technique for implementing personalized searches that is more than 100 times more efficient than previous approaches.

At issue is how search engines handle complex or confusing queries. For example, if a user is searching for faculty members who do research on financial informatics, that user wants a list of relevant webpages from faculty, not the pages of graduate students mentioning faculty or news stories that use those terms. That’s a complex search.

“Similarly, when searches are ambiguous with multiple possible interpretations, traditional search engines use impersonal techniques. For example, if a user searches for the term ‘jaguar speed,’ the user could be looking for information on the Jaguar supercomputer, the jungle cat or the car,” says Dr. Kemafor Anyanwu, an assistant professor of computer science at NC State and senior author of a paper on the research. “At any given time, the same person may want information on any of those things, so profiling the user isn’t necessarily very helpful.”

Anyanwu’s team has come up with a way to address the personalized search problem by looking at a user’s “ambient query context,” meaning they look at a user’s most recent searches to help interpret the current search. Specifically, they look beyond the words used in a search to associated concepts to determine the context of a search. So, if a user’s previous search contained the word “conservation” it would be associated with concepts likes “animals” or “wildlife” and even “zoos.” Then, a subsequent search for “jaguar speed” would push results about the jungle cat higher up in the results — and not the automobile or supercomputer. And the more recently a concept has been associated with a search, the more weight it is given when ranking results of a new search.

I rather like the contrast of ambiguous searches being resolved with “impersonal techniques.”

The paper, Scaling Concurrency of Personalized Semantic Search over Large RDF Data by Haizhou Fu, Hyeongsik Kim, and Kemafor Anyanwu, has this abstract:

Recent keyword search techniques on Semantic Web are moving away from shallow, information retrieval-style approaches that merely find “keyword matches” towards more interpretive approaches that attempt to induce structure from keyword queries. The process of query interpretation is usually guided by structures in data, and schema and is often supported by a graph exploration procedure. However, graph exploration-based interpretive techniques are impractical for multi-tenant scenarios for large database because separate expensive graph exploration states need to be maintained for different user queries. This leads to significant memory overhead in situations of large numbers of concurrent requests. This limitation could negatively impact the possibility of achieving the ultimate goal of personalizing search. In this paper, we propose a lightweight interpretation approach that employs indexing to improve throughput and concurrency with much less memory overhead. It is also more amenable to distributed or partitioned execution. The approach is implemented in a system called “SKI” and an experimental evaluation of SKI’s performance on the DBPedia and Billion Triple Challenge datasets show orders-of-magnitude performance improvement over existing techniques.

If you are interesting in scaling issues for topic maps, note the use of indexing as opposed to graph exploration techniques in this paper.

Also consider mining “discovered” contexts that lead to “better” results from the viewpoint of users. Those could be the seeds for serializing those contexts as topic maps.

Perhaps even directly applicable to work by researchers, librarians, intelligence analysts.

Seasoned searchers use richer contexts in searching that the average user and if those contexts are captured, they could enrich the search contexts of the average user.

August 24, 2013

Missing Layer in the Semantic Web Stack!

Filed under: Humor,Semantic Web — Patrick Durusau @ 2:48 pm

Samatha Bail has discovered a missing layer in the Semantic Web Stack!

Revised Semantic Web Stack

In topic maps we call that semantic diversity. 😉

July 24, 2013

Integrating Linked Data into Discovery [Solr Success for Topic Maps?]

Filed under: Linked Data,Semantic Web,Topic Maps — Patrick Durusau @ 2:06 pm

Integrating Linked Data into Discovery by Götz Hatop.

Abstract:

Although the Linked Data paradigm has evolved from a research idea to a practical approach for publishing structured data on the web, the performance gap between currently available RDF data stores and the somewhat older search technologies could not be closed. The combination of Linked Data with a search engine can help to improve ad-hoc retrieval. This article presents and documents the process of building a search index for the Solr search engine from bibliographic records published as linked open data.

Götz makes an interesting contrast between the Semantic Web and Solr:

In terms of the fast evolving technologies in the web age, the Semantic Web can already be called an old stack. For example, RDF was originally recommended by the W3C on the February 22, 1999. Greenberg [8] points out many similarities between libraries and the Semantic Web: Both have been invented as a response to information abundance, their mission is grounded in service and information access, and libraries and the Semantic Web both benefit from national and international standards. Nevertheless, the technologies defined within the Semantic Web stack are not well established in libraries today, and the Semantic Web community is not fully aware of the skills, talent, and knowledge that catalogers have and which may be of help to advance the Semantic Web.

On the other hand, the Apache Solr [9] search system has taken the library world by storm. From Hathi Trust to small collections, Solr has become the search engine of choice for libraries. It is therefore not surprising, that the VuFind discovery system uses Solr for its purpose, and is not built upon a RDF triple store. Fortunately, the software does not make strong assumptions about the underlying index structure and can coexist with non-MARC data as soon as these data are indexed conforming to the scheme provided by VuFind.

The lack of “…strong assumptions about the underlying index structure…” enables users to choose their own indexing strategies.

That is an indexing strategy is not forced on all users.

You could just as easily say that no built-in semantics are forced on users by Solr.

Want Solr success for topic maps?

Free users from built-in semantics. Enable them to use topic maps to map their models, their way.

Or do we fear the semantics of others?

July 8, 2013

Detecting Semantic Overlap and Discovering Precedents…

Detecting Semantic Overlap and Discovering Precedents in the Biodiversity Research Literature by Graeme Hirst, Nadia Talenty, and Sara Scharfz.

Abstract:

Scientific literature on biodiversity is longevous, but even when legacy publications are available online, researchers often fail to search it adequately or effectively for prior publications; consequently, new research may replicate, or fail to adequately take into account, previously published research. The mechanisms of the Semantic Web and methods developed in contemporary research in natural language processing could be used, in the near-term future, as the basis for a precedent-finding system that would take the text of an author’s early draft (or a submitted manuscript) and find potentially related ideas in published work. Methods would include text-similarity metrics that take different terminologies, synonymy, paraphrase, discourse relations, and structure of argumentation into account.

Footnote one (1) of the paper gives an idea of the problem the authors face:

Natural history scientists work in fragmented, highly distributed and parochial communities, each with domain specific requirements and methodologies [Scoble 2008]. Their output is heterogeneous, high volume and typically of low impact, but with a citation half-life that may run into centuries” (Smith et al. 2009). “The cited half-life of publications in taxonomy is longer than in any other scientific discipline, and the decay rate is longer than in any scientific discipline” (Moritz 2005). Unfortunately, we have been unable to identify the study that is the basis for Moritz’s remark.

The paper explores in detail issues that have daunted various search techniques, when the material is available in electronic format at all.

The authors make a general proposal for addressing these issues, with mention of the Semantic Web but omit from their plan:

The other omission is semantic interpretation into a logical form, represented in XML, that draws on ontologies in the style of the original Berners-Lee, Hendler, and Lassila (2001) proposal for the Semantic Web. The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here. This is not to say that logical forms would be useless. On the contrary, they are employed by some approaches to paraphrase and textual entailment (section 4.1 above) and hence might appear in the system if only for that reason; but even so, they would form only one component of a broader and somewhat looser kind of semantic representation.

That’s the problem with the Semantic Web in a nutshell:

The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here.

What if I want to be logically precise sometimes but not others?

What if I want to be more precise in some places and less precise in others?

What if I want to have different degrees or types of imprecision?

With topic maps the question is: How im/precise do you want to be?

June 25, 2013

The Problem with RDF and Nuclear Power

Filed under: Marketing,RDF,Semantic Web — Patrick Durusau @ 3:09 pm

The Problem with RDF and Nuclear Power by Manu Sporny.

Manu starts his post:

Full disclosure: I am the chair of the RDFa Working Group, the JSON-LD Community Group, a member of the RDF Working Group, as well as other Semantic Web initiatives. I believe in this stuff, but am critical about the path we’ve been taking for a while now.

(…)

RDF shares a number of these similarities with nuclear power. RDF is one of the best data modeling mechanisms that humanity has created. Looking into the future, there is no equally-powerful, viable alternative. So, why has progress been slow on this very exciting technology? There was no public mis-information campaign, so where did this negative view of RDF come from?

In short, RDF/XML was the Semantic Web’s 3 Mile Island incident. When it was released, developers confused RDF/XML (bad) with the RDF data model (good). There weren’t enough people and time to counter-act the negative press that RDF was receiving as a result of RDF/XML and thus, we are where we are today because of this negative perception of RDF. Even Wikipedia’s page on the matter seems to imply that RDF/XML is RDF. Some purveyors of RDF think that the public perception problem isn’t that bad. I think that when developers hear RDF, they think: “Not in my back yard”.

The solution to this predicament: Stop mentioning RDF and the Semantic Web. Focus on tools for developers. Do more dogfooding.

Over the years I have become more and more agnostic towards data models.

The real question for any data model is whether it fits your requirements. What other test would you have?

For merging data held in different data models or data models that don’t recognize the same subject identified differently, then subject identity and its management comes into play.

Subject identity and its management not being an area that has only one answer for any particular problem.

Manu does have concrete suggestions for how to advance topic maps, either as a practice of subject identity or a particular data model:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of RDF, theoretical value, or design. Deliver production-ready, open-source software tools.
  3. Build a network of believers by spending more of your time working with Web developers and open-source projects to convince them to publish Linked Data. Dogfood our work.

A topic map version of those suggestions:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of topic maps, theoretical value, or design. Deliver high quality content from merging diverse data sources. (Tools will take care of themselves if the content is valuable enough.)
  3. Build a network of customers by spending more of your time using topic maps to distinguish your content from content from the average web sewer.

As an information theorist I should be preaching to myself. Yes?

😉

As the semantic impedance of the “Semantic Web,” “big data,” “NSA Data Cloud,” increases, the opportunities for competitive, military, industrial advantage from reliable semantic integration will increase.

Looking for showcase opportunities.

Suggestions?

June 13, 2013

Visual Data Web

Filed under: Graphics,Semantic Web,Visualization — Patrick Durusau @ 12:37 pm

Visual Data Web

From the website:

This website provides an overview of our attempts to a more visual Data Web.

The term Data Web refers to the evolution of a mainly document-centric Web toward a more data-oriented Web. In its narrow sense, the term describes pragmatic approaches of the Semantic Web, such as RDF and Linked Data. In a broader sense, it also includes less formal data structures, such as microformats, microdata, tagging, and folksonomies.

The term Visual Data Web reflects our goal of making the Data Web visually more experienceable, also for average Web users with little to no knowledge about the underlying technologies. This website presents developments, related publications, and current activities to generate new ideas, methods, and tools that help making the Data Web easier accessible, more visible, and thus more attractive.

The recent NSA scandal underlined the smallness of “web scale.” The NSA data was orders of magnitude greater than “web scale.”

Still, experimenting with visualization, even on “web scale” data, may lead to important lessons on visualization.

I first saw this in a tweet by Stian Danenbarger.

June 11, 2013

Rya: A Scalable RDF Triple Store for the Clouds

Filed under: Cloud Computing,RDF,Rya,Semantic Web — Patrick Durusau @ 2:06 pm

Rya: A Scalable RDF Triple Store for the Clouds by Roshan Punnoose, Adina Crainiceanu, and David Rapp.

Abstract:

Resource Description Framework (RDF) was designed with the initial goal of developing metadata for the Internet. While the Internet is a conglomeration of many interconnected networks and computers, most of today’s best RDF storage solutions are confined to a single node. Working on a single node has significant scalability issues, especially considering the magnitude of modern day data. In this paper we introduce a scalable RDF data management system that uses Accumulo, a Google Bigtable variant. We introduce storage methods, indexing schemes, and query processing techniques that scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL. Our performance evaluation shows that in most cases, our system outperforms existing distributed RDF solutions, even systems much more complex than ours.

Based on Accumulo (open-source NoSQL database by the NSA).

Interesting re-thinking of indexing of triples.

Future work includes owl:sameAs, owl:inverseOf and other inferencing rules.

Certainly a project to watch.

June 3, 2013

Coming soon: new, expanded edition of “Learning SPARQL”

Filed under: RDF,Semantic Web,SPARQL — Patrick Durusau @ 9:49 am

Coming soon: new, expanded edition of “Learning SPARQL” by Bob DuCharme.

From the post:

55% more pages! 23% fewer mentions of the semantic web!

sparql

I’m very pleased to announce that O’Reilly will make the second, expanded edition of my book Learning SPARQL available sometime in late June or early July. The early release “raw and unedited” version should be available this week.

I wonder if Bob is going to start an advertising trend with “fewer mentions of the semantic web?”

😉

Looking forward to the update!

Not that I care about SPARQL all that much but I’ll learn something.

Besides, I have liked Bob’s writing style from back in the SGML days.

June 1, 2013

4Store

Filed under: 4store,RDF,Semantic Web — Patrick Durusau @ 4:00 pm

4Store

From the about page:

4store is a database storage and query engine that holds RDF data. It has been used by Garlik as their primary RDF platform for three years, and has proved itself to be robust and secure.

4store’s main strengths are its performance, scalability and stability. It does not provide many features over and above RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist.

This was mentioned in a report by Bryan Thompson so I wanted to capture a link to the original site.

The latest tarball is dated 10 July 2012.

May 21, 2013

Beyond Enterprise Search…

Filed under: Linked Data,MarkLogic,Searching,Semantic Web — Patrick Durusau @ 2:49 pm

Beyond Enterprise Search… by adamfowleruk.

From the post:

Searching through all your content is fine – until you get a mountain of it with similar content, differentiated only by context. Then you’ll need to understand the meaning within the content. In this post I discuss how to do this using semantic techniques…

Organisations today have realised that for certain applications it is useful to have a consolidated search approach over several catalogues. This is most often the case when customers can interact with several parts of the company – sales, billing, service, delivery, fraud checks.

This approach is commonly called Enterprise Search, or Search and Discovery, which is where your content across several repositories is indexed in a separate search engine. Typically this indexing occurs some time after the content is added. In addition, it is not possible for a search engine to understand the fully capabilities of every content system. This means complex mappings are needed between content, meta data and security. In some cases, this may be retrofitted with custom code as the systems do not support a common vocabulary around these aspects of information management.

Content Search

We are all used to content search, so much so that for today’s teenagers a search bar with a common (‘Google like’) grammar is expected. This simple yet powerful interface allows us to search for content (typically web pages and documents) that contain all the words or phrases that we need. Often this is broadened by the use of a thesaurus and word stemming (plays and played stems to the verb play), and combined with some form of weighting based on relative frequency within each unit of content.

Other techniques are also applied. Metadata is extracted or implied – author, date created, modified, security classification, Dublin Core descriptive data. Classification tools can be used (either at the content store or search indexing stages) to perform entity extraction (Cheese is a food stuff) and enrichment (Sheffield is a place with these geospatial co-ordinates). This provides a greater level of description of the term being searched for over and above simple word terms.

Using these techniques, additional search functionality can be provided. Search for all shops visible on a map using a bounding box, radius or polygon geospatial search. Return only documents where these words are within 6 words of each other. Perhaps weight some terms as more important than others, or optional.

These techniques are provided by many of the Enterprise class search engines out there today. Even Open Source tools like Lucene and Solr are catching up with this. They have provided access to information where before we had to rely on Information and Library Services staff to correctly classify incoming documents manually, as they did back in the paper bound days of yore.

Content search only gets you so far though.

I was amening with the best of them until Adam reached the part about MarkLogic 7 going to add Semantic Web capabilities. 😉

I didn’t see any mention of linked data replicating the semantic diversity that currently exists in data stores.

Making data more accessible isn’t going to make it less diverse.

Although making data more accessible may drive the development of ways to manage semantic diversity.

So perhaps there is a useful side to linked data after all.

May 18, 2013

A Trillion Triples in Perspective

Filed under: BigData,Music,RDF,Semantic Web — Patrick Durusau @ 10:11 am

Mozart Meets MapReduce by Isaac Lopez.

From the post:

Big data has been around since the beginning of time, says Thomas Paulmichl, founder and CEO of Sigmaspecto, who says that what has changed is how we process the information. In a talk during Big Data Week, Paulmichl encouraged people to open up their perspective on what big data is, and how it can be applied.

During the talk, he admonished people to take a human element into big data. Paulmichl demonstrated this by examining the work of musical prodigy, Mozart – who Paulmichl noted is appreciated greatly by both music scientists, as well as the common music listener.

“When Mozart makes choices on writing a piece of work, the number of choices that he has and the kind of neural algorithms that his brain goes through to choose things is infinitesimally higher that what we call big data – it’s really small data in comparison,” he said.

Taking Mozart’s The Magic Flute as an example, Paulmichl, discussed the framework that Mozart used to make his choices by examining a music sheet outlining the number of bars, the time signature, the instrument and singer voicing.

“So from his perspective, he sits down, and starts to make what we as data scientists call quantitative choices,” explained Paulmichl. “Do I put a note here, down here, do I use a different instrument; do I use a parallel voicing for different violins – so these are all metrics that his brain has to decide.”

Exploring the mathematics of the music, Paulmichl concluded that in looking at The Magic Flute, Mozart had 4.72391E+21 creative variations (and then some) that he could have taken with the direction of it over the course of the piece. “We’re not talking about a trillion dataset; we’re talking about a sextillion or more,” he says adding that this is a very limited cut of the quantitative choice that his brain makes at every composition point.

“[A] sextillion or more…” puts the question of processing a trillion triples into perspective.

Another musical analogy?

Triples are the one finger version of Jingle Bells*:

*The gap is greater than the video represents but it is still amusing.

Does your analysis/data have one finger subtlety?

May 17, 2013

A self-updating road map of The Cancer Genome Atlas

Filed under: Bioinformatics,Biology,Biomedical,Medical Informatics,RDF,Semantic Web,SPARQL — Patrick Durusau @ 4:33 pm

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

May 13, 2013

Putting Linked Data on the Map

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 2:32 pm

Putting Linked Data on the Map by Richard Wallis.

In fairness to Linked Data/Semantic Web, I really should mention this post by one of its more mainstream advocates:

Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.

There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise. They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core. Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence. The published data is only visible within their enterprise.

Dbpedia is another excellent candidate. From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs. But for some reason developers don’t seem to see it as a compelling example. Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.

A third example, which I want to focus on here, is Ordnance Survey. Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside. A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data. Which is what they did a couple of years ago.

The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment. But first I must emphasise something that is often missed.

Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’. With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier…

(images omitted)

Richard does a great job describing the Linked Data APIs from the Ordnance Survey.

My only quibble is with his point:

Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’.

True enough but it omits the authoring side of Linked Data.

Or understanding the data to be consumed.

With HTML, authoring hyperlinks was only marginally more difficult than “using” hyperlinks.

And the consumption of a hyperlink, beyond mime types, was unconstrained.

So linked data isn’t “just there.”

It’s there with an authoring burden that remains unresolved and that constrains consumption, should you decide to follow “standard [http] web protocols” and Linked Data.

I am sure the Ordnance Survey Linked Data and other Linked Data resources Richard mentions will be very useful, to some people in some contexts.

But pretending Linked Data is easier than it is, will not lead to improved Linked Data or other semantic solutions.

Non-Adoption of Semantic Web, Reason #1002

Filed under: RDF,Semantic Web,Virtuoso — Patrick Durusau @ 8:22 am

Kingsley Idehen offers yet another explanation/excuses for non-adoption of the semantic web in On Hybrid Relational Databases. Interview with Kingsley Uyi Idehen by Roberto V. Zicari.

The highlight of this interview reads:

The only obstacle to Semantic Web technologies in the enterprise lies in better articulation of the value proposition in a manner that reflects the concerns of enterprises. For instance, the non disruptive nature of Semantic Web technologies with regards to all enterprise data integration and virtualization initiatives has to be the focal point

You may recall Kingsley’s demonstration of the non-complexity of authoring for the Semantic Web in The Semantic Web Is Failing — But Why? (Part 3).

Could it be users sense the “lock-in” of RDF/Semantic Web?

Q14. Big Data Analysis: could you connect Virtuoso with Hadoop? How does Viruoso relate to commercial data analytics platforms, e.g Hadapt, Vertica?

K​ingsley Uyi Idehen: You can integrate data managed by Hadoop based ETL workflows via ODBC or Web Services driven by Hapdoop clusters that expose RESTful interaction patterns for data access. As for how Virtuoso relates to the likes of Vertica re., analytics, this is about Virtuoso being the equivalent of Vertica plus the added capability of RDF based data management, Linked Data Deployment, and share-nothing clustering. There is no job that Vertica performs that Virtuoso can’t perform.

There are several jobs that Virtuoso can perform that Vertica, VoltDB, Hadapt, and many other NoSQL and NewSQL simply cannot perform with regards to scalable, high-performance RDF data management and Linked Data deployment. Remember, RDF based Linked Data is all about data management and data access without any kind of platform lock-in. Virtuoso locks you into a value proposition (performance and scale) not the platform itself. (emphasis added to last sentence)

It’s comforting to know RDF/Semantic Web “lock-in” has our best interest at heart.

See Kingley dodging the next question on Virtuoso’s ability scale:

Q15. Do you also benchmark loading trillion of RDF triples? Do you have current benchmark results? How much time does it take to querying them?

K​ingsley Uyi Idehen: As per my earlier responses, there is no shortage of benchmark material for Virtuoso.

The benchmarks are also based on realistic platform configurations unlike the RDBMS patterns of the past which compromised the utility of TPC benchmarks.

Full Disclosure: I haven’t actually counted all of Kingsley’s reasons for non-adoption of the Semantic Web. The number I assign here may be high or low.

May 9, 2013

The ChEMBL database as linked open data

Filed under: Cheminformatics,Linked Data,RDF,Semantic Web — Patrick Durusau @ 1:51 pm

The ChEMBL database as linked open data by Egon L Willighagen, Andra Waagmeester, Ola Spjuth, Peter Ansell, Antony J Williams, Valery Tkachenko, Janna Hastings, Bin Chen and David J Wild. (Journal of Cheminformatics 2013, 5:23 doi:10.1186/1758-2946-5-23).

Abstract:

Background Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs). RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easier to scale up inference and data analysis.

Results This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples. Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO; exposes more information from the database; and is now available as dereferencable, linked data. To demonstrate these new features, we present novel use cases showing further integration with other web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standard ontologies for querying.

Conclusions We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDF resource creates a foundation for integrated semantic web cheminformatics applications, such as the presented decision support.

You already know about the fragility of ontologies so no need to repeat that rant here.

Having material encoded with an ontology, on the other hand, after vetting, can be a source that you wrap with a topic map.

So all that effort isn’t lost.

May 7, 2013

VIVO (Update)

Filed under: Semantic Web,VIVO — Patrick Durusau @ 2:43 pm

Seeing the webinar on ViVO, Semantic Mashups Across Large, Heterogeneous Institutions:… reminded me it had been more than a year since I had visited the VIVO site (VIVO – An interdisciplinary national network).

I visited VIVO today and they appear to have the same seven (7) institutional sponsors as they did in March of 2012.

I did not find a list of other institutions that have downloaded and installed VIVO.

The University of Pennsylvania Veterinary Library has but I did not see any others.

Does anyone else have a feel for the use of this project?

Thanks!

Semantic Mashups Across Large, Heterogeneous Institutions:…

Filed under: Semantic Web,VIVO — Patrick Durusau @ 2:33 pm

Semantic Mashups Across Large, Heterogeneous Institutions: Experiences from the VIVO Service

May 22, 2013
1:00 – 2:30 p.m. (Eastern Time)

About the webinar:

VIVO is a semantic web application focused on discovering researchers and research publications in the life sciences. The service, which uses open-source software originally developed and implemented at Cornell University, operates by harvesting data about researcher interests, activities, and accomplishments from academic, administrative, professional, and funding sources. Using a built-in, editable ontology for describing things such as People, Courses, and Publications, data is transformed into a Semantic-Web-compliant form. VIVO provides automated and self-updating processes for improving data quality and authenticity. Starting with a classic Google-style search box, VIVO users can browse search results structured around people, research interests, courses, publications, and the like — data that can be exposed for re-use by other systems in a machine-readable format.

This webinar, held by a veteran at the Albert R. Mann Library Information Technology Services department at Cornell, where the VIVO project was born, presents the perspective of a software developer on the practicalities of building a high-quality Semantic-Web search service on existing data maintained in dozens of formats and software platforms at large, diverse institutions. The talk will highlight services that leverage the Semantic Web platform in innovative ways, e.g., for finding researchers based on the text content of a particular Web page and for visualizing networks of collaboration across institutions.

Hosted by NISO, you may want to check the registration fees before deciding to attend. Or having a member of your department to attend and share what they learn.

May 6, 2013

FuturICT:… [No Semantics?]

Filed under: BigData,Collaboration,Semantic Diversity,Semantic Web — Patrick Durusau @ 10:28 am

FuturICT: Participatory Computing for Our Complex World

From the FuturICT FET Flagship Project Summary:

FuturICT is a visionary project that will deliver new science and technology to explore, understand and manage our connected world. This will inspire new information and communication technologies (ICT) that are socially adaptive and socially interactive, supporting collective awareness.

Revealing the hidden laws and processes underlying our complex, global, socially interactive systems constitutes one of the most pressing scientific challenges of the 21st Century. Integrating complexity science with ICT and the social sciences, will allow us to design novel robust, trustworthy and adaptive technologies based on socially inspired paradigms. Data from a variety of sources will help us to develop models of techno-socioeconomic systems. In turn, insights from these models will inspire a new generation of socially adaptive, self-organised ICT systems. This will create a paradigm shift and facilitate a symbiotic co-evolution of ICT and society. In response to the European Commission’s call for a ‘Big Science’ project, FuturICT will build a largescale, pan European, integrated programme of research which will extend for 10 years and beyond.

Did you know that the term “semantic” appears only twice in the FuturICT Project Outline? And both times as in the “semantic web?”

Not a word of how models, data sources, paradigms, etc., with different semantics are going to be wedded into a coherent whole.

View it as an opportunity to deliver FuturlCT results using topic maps beyond this project.

April 28, 2013

Agile Knowledge Engineering and Semantic Web (AKSW)

Filed under: RDFa,Semantic Web — Patrick Durusau @ 3:28 pm

Agile Knowledge Engineering and Semantic Web (AKSW)

From the webpage:

The Research Group Agile Knowledge Engineering and Semantic Web (AKSW) is hosted by the Chair of
Business Information Systems (BIS) of the Institute of Computer Science (IfI) / University of Leipzig as well as the Institute for Applied Informatics (InfAI).

Goals

  • Development of methods, tools and applications for adaptive Knowledge Engineering in the context of the Semantic Web
  • Research of underlying Semantic Web technologies and development of fundamental Semantic Web tools and applications
  • Maturation of strategies for fruitfully combining the Social Web paradigms with semantic knowledge representation techniques

AKSW is committed to the free software, open source, open access and open knowledge movements.

Complete listing of projects.

I have mentioned several of these projects before. On seeing a reminder of the latest release of RDFaCE (RDFa Content Editor), I thought I should post on the common source of those projects.

April 27, 2013

The Motherlode of Semantics, People

Filed under: Conferences,Crowd Sourcing,Semantic Web,Semantics — Patrick Durusau @ 8:08 am

1st International Workshop on “Crowdsourcing the Semantic Web” (CrowdSem2013)

Submission deadline: July 12, 2013 (23:59 Hawaii time)

From the post:

1st International Workshop on “Crowdsourcing the Semantic Web” in conjunction with the 12th Interantional Seamntic Web Conference (ISWC 2013), 21-25 October 2013, in Sydney, Australia. This interactive workshop takes stock of the emergent work and chart the research agenda with interactive sessions to brainstorm ideas and potential applications of collective intelligence to solving AI hard semantic web problems.

The Global Brain Semantic Web—a Semantic Web interleaving a large number of human and machine computation—has great potential to overcome some of the issues of the current Semantic Web. In particular, semantic technologies have been deployed in the context of a wide range of information management tasks in scenarios that are increasingly significant in both technical (data size, variety and complexity of data sources) and economical terms (industries addressed and their market volume). For many of these tasks, machine-driven algorithmic techniques aiming at full automation do not reach a level of accuracy that many production environments require. Enhancing automatic techniques with human computation capabilities is becoming a viable solution in many cases. We believe that there is huge potential at the intersection of these disciplines – large scale, knowledge-driven, information management and crowdsourcing – to solve technically challenging problems purposefully and in a cost effective manner.

I’m encouraged.

The Semantic Web is going to start asking the entities (people) that originate semantics about semantics.

Going the motherlode of semantics.

Now to see what they do with the answers.

April 24, 2013

Brain: … [Topic Naming Constraint Reappears]

Filed under: Bioinformatics,OWL,Semantic Web — Patrick Durusau @ 1:41 pm

Brain: biomedical knowledge manipulation by Samuel Croset, John P. Overington and Dietrich Rebholz-Schuhmann. (Bioinformatics (2013) 29 (9): 1238-1239. doi: 10.1093/bioinformatics/btt109)

Abstract:

Summary: Brain is a Java software library facilitating the manipulation and creation of ontologies and knowledge bases represented with the Web Ontology Language (OWL).

Availability and implementation: The Java source code and the library are freely available at https://github.com/loopasam/Brain and on the Maven Central repository (GroupId: uk.ac.ebi.brain). The documentation is available at https://github.com/loopasam/Brain/wiki.

Contact: croset@ebi.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Odd how things like the topic naming constraint show up in unexpected contexts. 😉

This article may be helpful if you are required to create or read OWL based data.

But as I read the article I saw:

The names (short forms) of OWL entities handled by a Brain object have to be unique. It is for instance not possible to add an OWL class, such as http://www.example.org/Cell to the ontology if an OWL entity with the short form ‘Cell’ already exists.

The explanation?

Despite being in contradiction with some Semantic Web principles, this design prevents ambiguous queries and hides as much as possible the cumbersome interaction with prefixes and Internationalized Resource Identifiers (IRI).

I suppose but doesn’t ambiguity exist in the mind of the user? That is they use a term than can have more than one meaning?

Having unique terms simply means inventing odd terms that no user will know.

Rather than unambiguous isn’t that unfound?

April 21, 2013

Collaborative annotation… [Human + Machine != Semantic Monotony]

Collaborative annotation for scientific data discovery and reuse by Kirk Borne. (Borne, K. (2013), Collaborative annotation for scientific data discovery and reuse. Bul. Am. Soc. Info. Sci. Tech., 39: 44–45. doi: 10.1002/bult.2013.1720390414)

Abstract:

Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.

A persuasive call to arms to develop “collaborative annotation:”

Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams.

And more specifically for the Large Synoptic Survey Telescope (LSST):

The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System, www.biodas.org).

As you might expect, semantic diversity is going to be present with “collaborative annotation.”

Semantic Monotony (aka Semantic Web) has failed for machines alone.

No question it will fail for humans + machines.

Are you ready to step up to the semantic diversity of collaborative annotation (humans + machines)?

March 19, 2013

Open Annotation Data Model

Filed under: Annotation,RDF,Semantic Web,W3C — Patrick Durusau @ 10:34 am

Open Annotation Data Model

Abstract:

The Open Annotation Core Data Model specifies an interoperable framework for creating associations between related resources, annotations, using a methodology that conforms to the Architecture of the World Wide Web. Open Annotations can easily be shared between platforms, with sufficient richness of expression to satisfy complex requirements while remaining simple enough to also allow for the most common use cases, such as attaching a piece of text to a single web resource.

An Annotation is considered to be a set of connected resources, typically including a body and target, where the body is somehow about the target. The full model supports additional functionality, enabling semantic annotations, embedding content, selecting segments of resources, choosing the appropriate representation of a resource and providing styling hints for consuming clients.

My first encounter with this proposal so I need to compare it to my Simple Web Semantics.

At first blush, the Open Annotation Core Model looks a lot heavier than Simple Web Semantics.

I need to reform my blog posts into a formal document and perhaps attach a comparison as an annex.

March 18, 2013

Semantic Search Over The Web (SSW 2013)

Filed under: Conferences,RDF,Semantic Diversity,Semantic Graph,Semantic Search,Semantic Web — Patrick Durusau @ 2:00 pm

3RD International Workshop onSemantic Search Over The Web (SSW 2013)

Dates:

Abstract Papers submission: May 31, 2013 – 15:00 (3:00 pm) EDT
(Short) Full Paper submission: June 7, 2013 – 15:00 (3:00 pm) EDT
Author notification: July 19, 2013
Camera-ready copy due: August 2, 2013
Workshop date: During VLDB (Aug 26 – Aug 30)

From the webpage:

We are witnessing a smooth evolution of the Web from a worldwide information space of linked documents to a global knowledge base, composed of semantically interconnected resources. To date, the correlated and semantically annotated data available on the web amounts to 25 billion RDF triples, interlinked by around 395 million RDF links. The continuous publishing and the integration of the plethora of semantic datasets from companies, government and public sector projects is leading to the creation of the so-called Web of Knowledge. Each semantic dataset contributes to extend the global knowledge and increases its reasoning capabilities. As a matter of facts, researchers are now looking with growing interest to semantic issues in this huge amount of correlated data available on the Web. Many progresses have been made in the field of semantic technologies, from formal models to repositories and reasoning engines. While the focus of many practitioners is on exploiting such semantic information to contribute to IR problems from a document centric point of view, we believe that such a vast, and constantly growing, amount of semantic data raises data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web.

The third edition of the International Workshop on Semantic Search over the Web (SSW) will discuss about data management issues related to the search over the web and the relationships with semantic web technologies, proposing new models, languages and applications.

The research issues can be summarized by the following problems:

  • How can we model and efficiently access large amounts of semantic web data?
  • How can we effectively retrieve information exploiting semantic web technologies?
  • How can we employ semantic search in real world scenarios?

The SSW Workshop invites researchers, engineers, service developers to present their research and works in the field of data management for semantic search. Papers may deal with methods, models, case studies, practical experiences and technologies.

Apologies for the uncertainty of the workshop date. (There is confusion about the date on the workshop site, one place says the 26th, the other the 30th. Check before you make reservation/travel arrangements.)

I differ with the organizers on some issues but on the presence of: “…data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web,” there is no disagreement.

That’s the trick isn’t it? In any confined or small group setting, just about any consistent semantic solution will work.

The hurly-burly of a constant stream of half-heard, partially understood communications across distributed and heterogeneous systems tests the true mettle of semantic solutions.

Not a quest for perfect communication but “good enough.”

March 17, 2013

Beacons of Availability

Filed under: Linked Data,LOD,RDF,Semantic Web — Patrick Durusau @ 10:39 am

From Records to a Web of Library Data – Pt3 Beacons of Availability by Richard Wallis.

Beacons of Availability

As I indicated in the first of this series, there are descriptions of a broader collection of entities, than just books, articles and other creative works, locked up in the Marc and other records that populate our current library systems. By mining those records it is possible to identify those entities, such as people, places, organisations, formats and locations, and model & describe them independently of their source records.

As I discussed in the post that followed, the library domain has often led in the creation and sharing of authoritative datasets for the description of many of these entity types. Bringing these two together, using URIs published by the Hubs of Authority, to identify individual relationships within bibliographic metadata published as RDF by individual library collections (for example the British National Bibliography, and WorldCat) is creating Library Linked Data openly available on the Web.

Why do we catalogue? is a question, I often ask, with an obvious answer – so that people can find our stuff. How does this entification, sharing of authorities, and creation of a web of library linked data help us in that goal. In simple terms, the more libraries can understand what resources each other hold, describe, and reference, the more able they are to guide people to those resources. Sounds like a great benefit and mission statement for libraries of the world but unfortunately not one that will nudge the needle on making library resources more discoverable for the vast majority of those that can benefit from them.

I have lost count of the number of presentations and reports I have seen telling us that upwards of 80% of visitors to library search interfaces start in Google. A similar weight of opinion can be found that complains how bad Google, and the other search engines, are at representing library resources. You will get some balancing opinion, supporting how good Google Book Search and Google Scholar are at directing students and others to our resources. Yet I am willing to bet that again we have another 80-20 equation or worse about how few, of the users that libraries want to reach, even know those specialist Google services exist. A bit of a sorry state of affairs when the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines. Plus it can help with the problem of identifying where a user can gain access to that resource to loan, download, view via a suitable license, or purchase, etc.

I’m am an ardent sympathizer helping people to find “our stuff.”

I don’t disagree with the description of Google as: “…the major source of searching for our target audience, is also acknowledged to be one of the least capable at describing and linking to the resources we want them to find!”

But in all fairness to Google, I would remind you of Drabenstott’s research that found for the Library of Congress subject headings:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows:

children 32%
adults 40%
reference 53%
technical services librarians 56%

The Library of Congress subject classification has been around for more than a century and just over half of the librarians can use it correctly.

Let’s don’t wait more than a century to test the claim:*

Library linked data helps solve both the problem of better description and findability of library resources in the major search engines.


* By “test” I don’t mean the sort of study, “…we recruited twelve LIS students but one had to leave before the study was complete….”

I am using “test” in the sense of a well designed and organized social science project with professional assistance from social scientists, UI test designers and the like.

I think OCLC is quite sincere in its promotion of linked data, but effectiveness is an empirical question, not one of sincerity.

March 16, 2013

Algebraix Data Achieves Unrivaled Semantic Benchmark Performance

Filed under: Benchmarks,RDF,Semantic Web — Patrick Durusau @ 7:24 pm

Algebraix Data Achieves Unrivaled Semantic Benchmark Performance by Angela Guess.

From the post:

Algebraix Data Corporation today announced its SPARQL Server(TM) RDF database successfully executed all 17 of its queries on the SP2 benchmark up to one billion triples on one computer node. The SP2 benchmark is the most computationally complex for testing SPARQL performance and no other vendor has reported results for all queries on data sizes above five million triples.

Furthermore, SPARQL Server demonstrated linear performance in total SP2Bench query time on data sets from one million to one billion triples. These latest dramatic results are made possible by algebraic optimization techniques that maximize computing resource utilization.

“Our outstanding SPARQL Server performance is a direct result of the algebraic techniques enabled by our patented Algebraix technology,” said Charlie Silver, CEO of Algebraix Data. “We are investing heavily in the development of SPARQL Server to continue making substantial additional functional, performance and scalability improvements.”

Pretty much a copy of the press release from Algebraix.

You may find:

Doing the Math: The Algebraix DataBase Whitepaper: What it is, how it works, why we need it (PDF) by Robin Bloor, PhD

ALGEBRAIX Technology Mathematics Whitepaper (PDF), by Algebraix Data

and,

Granted Patents

more useful.

BTW, The SP²Bench SPARQL Performance Benchmark, will be useful as well.

Algebraix listed its patents but I supplied the links. Why the links were missing at Algebraix I cannot say.

If the “…no other vendor has reported results for all queries on data sizes above five million triples…” is correct, isn’t scaling an issue for SQARQL?

« Newer PostsOlder Posts »

Powered by WordPress