Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 25, 2014

OCLC Preview 194 Million…

Filed under: Linked Data,LOD,OCLC,WorldCat — Patrick Durusau @ 4:07 pm

OCLC Preview 194 Million Open Bibliographic Work Descriptions by Richard Wallis.

From the post:

I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons. A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.

Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.

Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:

  1. The release of 194 Million Linked Data Bibliographic Work descriptions
  2. The WorldCat Linked Data Explorer interface

A preview release to be sure but one worth following!

Particularly with 194 million bibliographic work descriptions!

See Ralph’s post for the details.

February 22, 2014

Getty Art & Architecture Thesaurus Now Available

Filed under: Architecture,Art,Linked Data,Museums,Thesaurus — Patrick Durusau @ 8:36 pm

Art & Architecture Thesaurus Now Available as Linked Open Data by James Cuno.

From the post:

We’re delighted to announce that today, the Getty has released the Art & Architecture Thesaurus (AAT)® as Linked Open Data. The data set is available for download at vocab.getty.edu under an Open Data Commons Attribution License (ODC BY 1.0).

The Art & Architecture Thesaurus is a reference of over 250,000 terms on art and architectural history, styles, and techniques. It’s one of the Getty Research Institute’s four Getty Vocabularies, a collection of databases that serves as the premier resource for cultural heritage terms, artists’ names, and geographical information, reflecting over 30 years of collaborative scholarship. The other three Getty Vocabularies will be released as Linked Open Data over the coming 18 months.

In recent months the Getty has launched the Open Content Program, which makes thousands of images of works of art available for download, and the Virtual Library, offering free online access to hundreds of Getty Publications backlist titles. Today’s release, another collaborative project between our scholars and technologists, is the next step in our goal to make our art and research resources as accessible as possible.

What’s Next

Over the next 18 months, the Research Institute’s other three Getty Vocabularies—The Getty Thesaurus of Geographic Names (TGN)®, The Union List of Artist Names®, and The Cultural Objects Name Authority (CONA)®—will all become available as Linked Open Data. To follow the progress of the Linked Open Data project at the Research Institute, see their page here.

A couple of points of particular interest:

Getty documentation says this is the first industrial application of ISO 25964 Information and documentation – Thesauri and interoperability with other vocabularies..

You will probably want to read AAT Semantic Representation rather carefully.

A great source of data and interesting reading on the infrastructure as well.

I first saw this in a tweet by Semantic Web Company.

February 5, 2014

SNB Graph Generator

Filed under: Benchmarks,Linked Data — Patrick Durusau @ 3:47 pm

Social Network Benchmark (SNB) Graph Generator by Peter Boncz.

Slides from FOSDEM2014.

Be forewarned, the slide are difficult to read due to heavy background images.

Slide 17 will be of interest because of computed “…similarity of two nodes based on their (correlated) properties.” (Rhymes with “merging.”) Computationally expensive.

Slide 18, disregard nodes with too large similarity distance.

Slide 41 points to:

github.com/ldbc

ldbc.eu:8090/display/TUC

And a truncated link that I think points to:

LDBC_Status of the Semantic Publishing Benchmark.pdf but it is difficult to say because that link opens a page of fifteen (15) PDF files.

If you select “download all” it will deliver the files to you in one zip file.

February 1, 2014

Advertising RDF and Linked Data:… [Where’s the beef?]

Filed under: EU,Linked Data,RDF — Patrick Durusau @ 5:23 pm

Advertising RDF and Linked Data: SPARQL Queries on EU Data

From the webpage:

This is a collection of SPARQL queries on EU data that shows benefits of converting it to RDF and linking it, i.e. queries that reveal non-trivial information that would have been hard to reconstruct by hunting it down over separate/unlinked data sources.

At first I thought this would be a cool demonstration of the use of SPARQL, with the queries as links and more fully set forth below.

Nada. The non-working hyperlinks in the list of queries I suspect were meant to be internal links to the fuller exposition of the queries.

Then when I get to the queries, the only one that promises:

Link to query result: http://www4.wiwiss.fu-berlin.de/eures/sparql

Returns a 404.

The other links appear to be links to webpages that given a SPARQL, which if I had a SPARQL client, I could paste the SPARQL query in to see the result.

I would mirror the question:

Effort of obtaining those results without RDFizing and linking:

with:

Effort to see “…benefits of convering [EU data] to RDF and linking it” without a SPARQL client, very high/impossible.

That’s not just a criticism of RDF. Topic maps made a different mistake but it had the same impact.

The question for any user is “where’s the beef?” What am I gaining? Now, not some unknown number of tomorrows from now. Today!

PS: The EU data cloud has dropped the “Linked Open Data Around-the-Clock” moniker I reported in September of 2011. Same place, different branding. I suspect that is why governments like the web so much. Implementing newspeak policy is just a save away.

January 29, 2014

Applying linked data approaches to pharmacology:…

Applying linked data approaches to pharmacology: Architectural decisions and implementation by Alasdair J.G. Gray, et. al.

Abstract:

The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. In this application report, we describe a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies. We describe the architecture of the platform focusing on seven design decisions that drove its development with the aim of informing others developing similar software in this or other domains. The utility of the platform is demonstrated by the variety of drug discovery applications being built to access the integrated data.

An alpha version of the OPS platform is currently available to the Open PHACTS consortium and a first public release will be made in late 2012, see http://www.openphacts.org/ for details.

The paper acknowledges that present database entries lack semantics.

A further challenge is the lack of semantics associated with links in traditional database entries. For example, the entry in UniProt for the protein “kinase C alpha type homo sapien4 contains a link to the Enzyme database record 5, which has complementary data about the same protein and thus the identifiers can be considered as being equivalent. One approach to resolve this, proposed by Identifiers.org, is to provide a URI for the concept which contains links to the database records about the concept [27]. However, the UniProt entry also contains a link to the DrugBank compound “Phosphatidylserine6. Clearly, these concepts are not identical as one is a protein and the other a chemical compound. The link in this case is representative of some interaction between the compound and the protein, but this is left to a human to interpret. Thus, for successful data integration one must devise strategies that address such inconsistencies within the existing data.

I would have said databases lack properties to identify the subjects in question but there is little difference in the outcome of our respective positions, i.e., we need more semantics to make robust use of existing data.

Perhaps even more importantly, the paper treats “equality” as context dependent:

Equality is context dependent

Datasets often provide links to equivalent concepts in other datasets. These result in a profusion of “equivalent” identifiers for a concept. Identifiers.org provide a single identifier that links to all the underlying equivalent dataset records for a concept. However, this constrains the system to a single view of the data, albeit an important one.

A novel approach to instance level links between the datasets is used in the OPS platform. Scientists care about the types of links between entities: different scientists will accept concepts being linked in different ways and for different tasks they are willing to accept different forms of relationships. For example, when trying to find the targets that a particular compound interacts with, some data sources may have created mappings to gene rather than protein identifiers: in such instances it may be acceptable to users to treat gene and protein IDs as being in some sense equivalent. However, in other situations this may not be acceptable and the OPS platform needs to allow for this dynamic equivalence within a scientific context. As a consequence, rather than hard coding the links into the datasets, the OPS platform defers the instance level links to be resolved during query execution by the Identity Mapping Service (IMS). Thus, by changing the set of dataset links used to execute the query, different interpretations over the data can be provided.

Opaque mappings between datasets, i.e., mappings that don’t assign properties to source, target and then say what properties or conditions must be met for the mapping to be vaild, are of little use. Rely on opaque mappings at your own risk.

On the other hand, I fully agree that equality is context dependent and the choice of the criteria for equivalence should be left up to users. I suppose in that sense if users wanted to rely on opaque mappings, that would be their choice.

While an exciting paper, it is discussing architectural decisions and so we are not at the point of debating these issues in detail. It promises to be an exciting discussion!

January 17, 2014

Three Linked Data Vocabularies

Filed under: Linked Data,Vocabularies — Patrick Durusau @ 7:27 pm

Three Linked Data Vocabularies are W3C Recommendations

From the post:

Three Recommendations were published today to enhance data interoperability, especially in government data. Each one specifies an RDF vocabulary (a set of properties and classes) for conveying a particular kind of information:

  • The Data Catalog (DCAT) Vocabulary is used to provide information about available data sources. When data sources are described using DCAT, it becomes much easier to create high-quality integrated and customized catalogs including entries from many different providers. Many national data portals are already using DCAT.
  • The Data Cube Vocabulary brings the cube model underlying SDMX (Statistical Data and Metadata eXchange, a popular ISO standard) to Linked Data. This vocabulary enables statistical and other regular data, such as measurements, to be published and then integrated and analyzed with RDF-based tools.
  • The Organization Ontology provides a powerful and flexible vocabulary for expressing the official relationships and roles within an organization. This allows for interoperation of personnel tools and will support emerging socially-aware software.

More vocabularies for mapping into their respective areas, backwards for pre-existing vocabularies and forward for vocabularies that succeed them.

January 16, 2014

JSON-LD Is A W3C Recommendation

Filed under: JSON,Linked Data,LOD,RDF — Patrick Durusau @ 1:53 pm

JSON-LD Is A W3C Recommendation

From the post:

The RDF Working Group has published two Recommendations today:

  • JSON-LD 1.0. JSON is a useful data serialization and messaging format. This specification defines JSON-LD, a JSON-based format to serialize Linked Data. The syntax is designed to easily integrate into deployed systems that already use JSON, and provides a smooth upgrade path from JSON to JSON-LD. It is primarily intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Web services, and to store Linked Data in JSON-based storage engines.
  • JSON-LD 1.0 Processing Algorithms and API. This specification defines a set of algorithms for programmatic transformations of JSON-LD documents. Restructuring data according to the defined transformations often dramatically simplifies its usage. Furthermore, this document proposes an Application Programming Interface (API) for developers implementing the specified algorithms.

It would make a great question on a markup exam to ask whether JSON reminded you more of the “Multicode Basic Concrete Syntax” or a “Variant Concrete Syntax?” For either answer, explain.

In any event, you will be encountering JSON-LD so these recommendations will be helpful.

December 26, 2013

The Case for Linking World Law Data

Filed under: Law,Linked Data — Patrick Durusau @ 4:40 pm

The Case for Linking World Law Data by Sergio Puig and Enric G. Torrents.

Abstract:

The present paper advocates for the creation of a federated, hybrid database in the cloud, integrating law data from all available public sources in one single open access system – adding, in the process, relevant meta-data to the indexed documents, including the identification of social and semantic entities and the relationships between them, using linked open data techniques and standards such as RDF. Examples of potential benefits and applications of this approach are also provided, including, among others, experiences from of our previous research, in which data integration, graph databases and social and semantic networks analysis were used to identify power relations, litigation dynamics and cross-references patterns both intra and inter-institutionally, covering most of the World international economic courts.

From the conclusion:

We invite any individual and organization to join in and participate in this open endeavor, to shape together this project, Neocodex, aspiring to replicate the impact that Justinian’s Corpus Juris Civilis, the original Codex, had in the legal systems of the Early Middle Ages.

Yes, well, I can’t say the authors lack for ambition. 😉

As you know, the Corpus Juris Civilis has heavily influenced the majority of legal jurisdictions today. (Civil Law)

Do be mindful that the OASIS Legal Citation Markup (LegalCiteM) TC is having its organizational meeting on 12th February 2014, in case you are interested in yet another legal citation effort.

Why anyone thinks we need another legal citation system, that leaves the previous one on the cutting room floor, is beyond me.

Yes, a new legal citation system might be non-proprietary, royalty-free, web-based, etc., but without picking up current citation practices, it will also be dead on arrival (DOA).

Dandlion’s New Bloom:…

Filed under: Knowledge Graph,Linked Data — Patrick Durusau @ 4:18 pm

Dandlion’s New Bloom: A Family Of Semantic Text Analysis APIs by Jennifer Zaino.

From the post:

Dandelion, the service from SpazioDati whose goal is to delivering linked and enriched data for apps, has just recently introduced a new suite of products related to semantic text analysis.

Its dataTXT family of semantic text analysis APIs includes dataTXT-NEX, a named entity recognition API that links entities in the input sentence with Wikipedia and DBpedia and, in turn, with the Linked Open Data cloud and dataTXT-SIM, an experimental semantic similarity API that computes the semantic distance between two short sentences. TXT-CL (now in beta) is a categorization service that classifies short sentences into user-defined categories, says SpazioDati.CEO Michele Barbera.

“The advantage of the dataTXT family compared to existing text analysis’ tools is that dataTXT relies neither on machine learning nor NLP techniques,” says Barbera. “Rather it relies entirely on the topology of our underlying knowledge graph to analyze the text.” Dandelion’s knowledge graph merges together several Open Community Data sources (such as DBpedia) and private data collected and curated by SpazioDati. It’s still in private beta and not yet publicly accessible, though plans are to gradually open up portions of the graph in the future via the service’s upcoming Datagem APIs, “so that developers will be able to access the same underlying structured data by linking their own content with dataTXT APIs or by directly querying the graph with the Datagem APIs; both of them will return the same resource identifiers,” Barbera says. (See the Semantic Web Blog’s initial coverage of Dandelion here, including additional discussion of its knowledge graph.)

The line, “…dataTXT relies neither on machine learning nor NLP techniques,…[r]ather it relies entirely on the topology of our underlying knowledge graph to analyze the text,” caught my eye.

In private beta now but I am interested in how well it works against data in the wild.

December 14, 2013

JITA Classification System of Library and Information Science

Filed under: Classification,Library,Linked Data — Patrick Durusau @ 5:00 pm

JITA Classification System of Library and Information Science

From the post:

JITA is a classification schema of Library and Information Science (LIS). It is used by E-LIS, an international open repository for scientific papers in Library and Information Science, for indexing and searching. Currently JITA is available in English and has been translated into 14 languages (tr, el, nl, cs, fr, it, ro, ca, pt, pl, es, ar, sv, ru). JITA is also accessible as Linked Open Data, containing 3500 triples.

You had better enjoy triples before link rot overtakes them.

Today CSV, tomorrow JSON?

How long do you think the longest lived triple will last?

December 4, 2013

To fairly compare…

Filed under: Benchmarks,Graphs,Linked Data — Patrick Durusau @ 3:27 pm

LDBC D3.3.1 Use case analysis and choke point analysis Coordinators: Alex Averbuch and Norbert Martinez.

From the introduction:

Due largely to the Web, an exponentially increasing amount of data is generated each year. Moreover, a significant fraction of this data is unstructured, or semi-structured at best. This has meant that traditional data models are becoming increasingly restrictive and unsuitable for many application domains – the relational model in particular has been criticized for its lack of semantics. These trends have driven development of alternative database technologies, including graph databases.

The proliferation of applications dealing with complex networks has resulted in an increasing number of graph database deployments. This, in turn, has created demand for a means by which to compare the characteristics of different graph database technologies, such as: performance, data model, query expressiveness, as well as general functional and non-functional capabilities.

To fairly compare these technologies it is essential to first have a thorough understanding of graph data models, graph operations, graph datasets, graph workloads, and the interactions between all of these. (emphasis added)

In this rather brief report, the LDBC (Linked Data Benchmark Council) gives a thumbnail sketch of the varieties of graphs, graph databases, graph query languages, along with some summary use cases. To their credit, unlike some graph vendors, they do understand what is meant by a hyperedge. (see p.8)

On the other hand, they retreat from the full generality of graph models to “directed attributed multigraphs,” before evaluating any of the graph alternatives. (also at p.8)

It may be a personal prejudice but I would prefer to see fuller development of use cases and requirements before restricting the solution space.

Particularly since new developments in graph theory and/or technology are a weekly if not daily occurrence.

Premature focus on “unsettled” technology could result in a benchmark for yesterday’s version of graph technology.

Interesting I suppose but not terribly useful.

December 2, 2013

Vidi Competition [Closes 14th February 2014]

Filed under: Contest,Linked Data — Patrick Durusau @ 4:41 pm

Vidi Competition by Marieke Guy. (A public email notice I received today.)

At the start of November the LinkedUp Project launched the second in our LinkedUp Challenge – the Vidi Competition.

For the Vidi Competition we are inviting you to design and build innovative and robust prototypes and demos for tools that analyse and/or integrate open web data for educational purposes. The competition will run from 4th November 2013 till 14th February 2014. Prizes (up to €3,000 for first) will be awarded at the European Semantic Web Conference in Crete, Greece in May 2014. You can find out full details on the LinkedUp Challenge Website.

For this Competition we have one open track and two focused tracks that may guide teams or provide inspiration.

We’ve recently published blog posts on the tracks:

  • Pathfinder: Using linked data to ease access to recommendations and guidance
  • Simplificator: Using linked data to add context to domain-specific resources

There is also a blog post detailing the technical support we can offer.

We’d like to complement these posts with an online webinar which will introduce LinkedUp and the Vidi Competition. There will also be an opportunity to ask our technical support team questions and find out more about the data sets available. The webinar will take approximately 45 minutes and will be recorded.

The webinar is still in planning but is likely to take place in the next couple of weeks, if you are interested in participating please register your email address and we will share times with you.

A collection of suggested data sources can be found at the LinkedUp Data Repository

The overall theme of the competition:

We’re inviting you to design and build innovative and robust prototypes and demos for tools that analyse and/or integrate open web data for educational purposes. You can submit your Web application, App, analysis toolkit, documented API or any other tool that connects, exploits or analyses open or linked data and that addresses real educational needs. Your tool still may contain some bugs, as long as it has a stable set of features and you have some proof that it can be deployed on a realistic scale.

You could approach this competition several ways:

  1. Do straight linked data as a credential of your ability to produce and use linked data.
  2. Do straight linked data and supplement it with a topic map, either separately or as part of the competition.
  3. Create a solution (topic maps and/or linked data) and approach people with an interest in these resources.

A regular reader of this blog recently reminded me people are not shopping for topic maps (or linked data) but for results. (That’s #3 in my list.)

November 7, 2013

Creating Knowledge out of Interlinked Data…

Filed under: Linked Data,LOD,Open Data,Semantic Web — Patrick Durusau @ 6:55 pm

Creating Knowledge out of Interlinked Data – STATISTICAL OFFICE WORKBENCH by Bert Van Nuffelen and Karel Kremer.

From the slides:

LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany.

LOD2 will integrate and syndicate Linked Data with existing large-scale applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment.

LOD2 Stack Release 3.0 overview

Connecting the dots: Workbench for Statistical Office

In case you are interested:

LOD2 homepage

Ubuntu 12.04 Repository

VM User / Password: lod2demo / lod2demo

LOD2 blog

The LOD2 project expires in August of 2014.

Linked Data is going to be around, one way or the other, for quite some time.

My suggestion: Grab the last VM from LOD2 and a copy of its OS, store in a location that migrates as data systems change.

November 5, 2013

Implementations of Data Catalog Vocabulary

Filed under: DCAT,Government Data,Linked Data — Patrick Durusau @ 5:45 pm

Implementations of Data Catalog Vocabulary

From the post:

The Government Linked Data (GLD) Working Group today published the Data Catalog Vocabulary (DCAT) as a Candidate Recommendation. DCAT allows governmental and non-governmental data catalogs to publish their entries in a standard machine-readable format so they can be managed, aggregated, and presented in other catalogs.

Originally developed at DERI, DCAT has evolved with input from a variety of stakeholders and is now stable and ready for widespread use. If you have a collection of data sources, please consider publishing DCAT metadata for it, and if you run a data catalog or portal, please consider making use of DCAT metadata you find. The Working Group is eager to receive comments reports of use at public-gld-comments@w3.org and is maintaining an Implementation Report.

If you know anyone in the United States government, please suggest this to them.

The more time the U.S. government spends on innocuous data, the less time it has to spy on its citizens and the citizens and governments of other countries.

I say innocuous data because I have yet to see any government release information that would discredit the current regime.

Wasn’t true for the Pentagon Papers, the Watergate tapes or the Snowden releases.

Can you think of any voluntary release of data by any government that discredited a current regime?

The reason for secrecy isn’t to protect techniques or sources.

Guess whose incompetence would be exposed by transparency?

October 12, 2013

NYCPedia

Filed under: Encyclopedia,Linked Data — Patrick Durusau @ 7:21 pm

NYCPedia

From the about page:

NYCpedia is a new data encyclopedia about New York City.

This is a beta preview, so bear with us as we work out the bugs, add tons more features and add new data.

NYCpedia is organized so you can search for information about a borough, neighborhood, or zip code. From there you can find insights about jobs, education, healthy living, real estate, transportation and more. We pull up-to-date information from open data sources and link it up so it’s easier to explore, but you can always check out the original source. We are constantly looking to add new data sources, so if you know of a great dataset that should be in NYCpedia, let us know.

Need data services for your NYC-based business, non-profit, or academic institution? Contact us to find out how you can link your organization’s data to NYCpedia.

Based on the PediaCities platform, whose about page says:

Ontodia created the PediaCities platform to curate, organize, and link data about cities. Check out our first PediaCities knowledgebase at NYCpedia.com for a demonstration of what clean linked data looks like. Ontodia was founded in 2012 by Joel Natividad and Sami Baig following their success at NYCBigApps 3.0, where they won the Grand Prize for NYCFacets. The PediaCities platform, with NYCpedia as the first PediaCity, is our attempt to add value on top of NYC’s incredible open data ecosystem.

I was disappointed until I got deep enough in the map.

Try: http://nyc.pediacities.com/Resource/CommunityStats/10006, which is the 10006 zip code.

It’s clean, easy to navigate, not all the data possible but targeted at the usual user.

I suspect a fairly homogeneous data set but I can’t say for sure.

Probably because it is in beta, there did not appear to be any non-English interfaces? Would suspect that is going to be an early added feature if it isn’t already on the development map.

BTW, if you are interested in data from New York City, try NYC Open Data with over 1100 data sets currently available.

September 27, 2013

Semantic Search and Linked Open Data Special Issue

Filed under: Linked Data,Semantic Search — Patrick Durusau @ 12:33 pm

Semantic Search and Linked Open Data Special Issue

Paper submission: 15 December 2013
Notice of review results: 15 February 2013
Revisions due: 31 March 2014
Publication: Aslib Proceedings, issue 5, 2014.

From the call:

The opportunities and challenges of Semantic Search from theoretical and practical, conceptual and empirical perspectives. We are particularly interested in papers that place carefully conducted studies into the wider framework of current Semantic Search research in the broader context of Linked Open Data. Topics of interest include but are not restricted to:

  • The history of semantic search –  the latest techniques and technology developments in the last 1000 years
  • Technical approaches to semantic search : linguistic/NLP, probabilistic, artificial intelligence, conceptual/ontological
  • Current trends in Semantic Search, including best practice, early adopters, and cultural heritage
  • Usability and user experience; Visualisation; and techniques and technologies in the practice for Semantic Search
  • Quality criteria and Impact of norms and standardisation similar to ISO 25964 “Thesauri for information retrieval“
  • Cross-industry collaboration and standardisation
  • Practical problems in brokering consensus and agreement – defining concepts, terms and classes, etc
  • Curation and management of ontologies
  • Differences between web-scale, enterprise scale, and collection-specific scale techniques
  • Evaluation of Semantic Search solutions, including comparison of data collection approaches
  • User behaviour including evolution of norms and conventions; Information behaviour; and Information literacy
  • User surveys; usage scenarios and case studies

Papers should clearly connect their studies to the wider body of Semantic Search scholarship, and spell out the implications of their findings for future research. In general, only research-based submissions including case studies and best practice will be considered. Viewpoints, literature reviews or general reviews are generally not acceptable.

See the post for submission requirements, etc.

I am encouraged by the inclusion of:

The history of semantic search –  the latest techniques and technology developments in the last 1000 years

Wondering who will take up the gauntlet on that topic?

August 30, 2013

OpenAGRIS 0.9 released:…

Filed under: Agriculture,Data Mining,Linked Data,Open Data — Patrick Durusau @ 7:25 pm

OpenAGRIS 0.9 released: new functionalities, resources & look by Fabrizio Celli.

From the post:

The AGRIS team has released OpenAGRIS 0.9, a new version of the Web application that aggregates information from different Web sources to expand the AGRIS knowledge, providing as much data as possible about a topic or a bibliographical resource within the agricultural domain.

OpenAGRIS 0.9 contains new functionalities and resources, and received a new interface in English and Spanish, with French, Arabic, Chinese and Russian translations on their way.

Mission: To make information on agricultural research globally available, interlinked with other data resources (e.g. DBPedia, World Bank, Geopolitical Ontology, FAO fisheries dataset, AGRIS serials dataset etc.) following Linked Open Data principles, allowing users to access the full text of a publication and all the information the Web holds about a specific research area in the agricultural domain (1).

Curious what agricultural experts make of this resource?

As of today, the site claims 5,076,594 records. And with all the triple bulking up, some 134,276,804 triples based on those records.

What, roughly # of records * 26 for the number of triples?

Which is no mean feat but I wonder about the granularity of the information being offered?

That is how useful is it to find 10,000 resources when each will take an hour to read?

More granular retrieval, that is far below the level of a file or document, is going to be necessary to avoid repetition of human data mining.

Repetitive human data mining being one of the earmarks of today’s search technology.

August 10, 2013

Chinese Agricultural Thesaurus published as Linked Open Data

Filed under: Agriculture,Linked Data — Patrick Durusau @ 2:56 pm

Chinese Agricultural Thesaurus published as Linked Open Data

From the post:

CAT is the largest agricultural domain thesaurus in China, which is held and maintained by AII of CAAS. CAT was the important fruit of more than 100 professionals’ six years hard work. The international and national standards were adopted while designing and constructing CAT. CAT covers areas including agriculture, forestry, biology, etc. It is organized in 40 main categories and contains more than 63 thousand concepts and most of them have English translation. In addition, CAT includes more than 130 thousand semantic relationships such as Use, UF, BT, NT and RT.

Not my favorite format but at least you can avoid a lot of tedious data entry.

Transformation and adding properties will take some effort but not as much as starting from scratch.

July 24, 2013

Integrating Linked Data into Discovery [Solr Success for Topic Maps?]

Filed under: Linked Data,Semantic Web,Topic Maps — Patrick Durusau @ 2:06 pm

Integrating Linked Data into Discovery by Götz Hatop.

Abstract:

Although the Linked Data paradigm has evolved from a research idea to a practical approach for publishing structured data on the web, the performance gap between currently available RDF data stores and the somewhat older search technologies could not be closed. The combination of Linked Data with a search engine can help to improve ad-hoc retrieval. This article presents and documents the process of building a search index for the Solr search engine from bibliographic records published as linked open data.

Götz makes an interesting contrast between the Semantic Web and Solr:

In terms of the fast evolving technologies in the web age, the Semantic Web can already be called an old stack. For example, RDF was originally recommended by the W3C on the February 22, 1999. Greenberg [8] points out many similarities between libraries and the Semantic Web: Both have been invented as a response to information abundance, their mission is grounded in service and information access, and libraries and the Semantic Web both benefit from national and international standards. Nevertheless, the technologies defined within the Semantic Web stack are not well established in libraries today, and the Semantic Web community is not fully aware of the skills, talent, and knowledge that catalogers have and which may be of help to advance the Semantic Web.

On the other hand, the Apache Solr [9] search system has taken the library world by storm. From Hathi Trust to small collections, Solr has become the search engine of choice for libraries. It is therefore not surprising, that the VuFind discovery system uses Solr for its purpose, and is not built upon a RDF triple store. Fortunately, the software does not make strong assumptions about the underlying index structure and can coexist with non-MARC data as soon as these data are indexed conforming to the scheme provided by VuFind.

The lack of “…strong assumptions about the underlying index structure…” enables users to choose their own indexing strategies.

That is an indexing strategy is not forced on all users.

You could just as easily say that no built-in semantics are forced on users by Solr.

Want Solr success for topic maps?

Free users from built-in semantics. Enable them to use topic maps to map their models, their way.

Or do we fear the semantics of others?

Crafting Linked Open Data for Cultural Heritage:…

Filed under: Linked Data,Music,Music Retrieval — Patrick Durusau @ 1:17 pm

Crafting Linked Open Data for Cultural Heritage: Mapping and Curation Tools for the Linked Jazz Project by M. Cristina Pattuelli, Matt Miller, Leanora Lange, Sean Fitzell, and Carolyn Li-Madeo.

Abstract:

This paper describes tools and methods developed as part of Linked Jazz, a project that uses Linked Open Data (LOD) to reveal personal and professional relationships among jazz musicians based on interviews from jazz archives. The overarching aim of Linked Jazz is to explore the possibilities offered by LOD to enhance the visibility of cultural heritage materials and enrich the semantics that describe them. While the full Linked Jazz dataset is still under development, this paper presents two applications that have laid the foundation for the creation of this dataset: the Mapping and Curator Tool, and the Transcript Analyzer. These applications have served primarily for data preparation, analysis, and curation and are representative of the types of tools and methods needed to craft linked data from digital content available on the web. This paper discusses these two domain-agnostic tools developed to create LOD from digital textual documents and offers insight into the process behind the creation of LOD in general.

The Linked Data Jazz Name Directory:

consists of 8,725 unique names of jazz musicians as N-Triples.

It’s a starting place if you want to create a topic map about Jazz.

Although, do be aware the Center for Arts and Cultural Policy Studies at Princeton University reports:

Although national estimates of the number of jazz musicians are unavailable, the Study of Jazz Artists 2001 estimated the number of jazz musicians in three metropolitan jazz hubs — New York, San Francisco, and New Orleans — at 33,003, 18,733, and 1,723, respectively. [A total of 53,459. How Many Jazz Musicians Are There?]

And that is only for one point in time. It does not include jazz musicians who perished before the estimate was made.

Much work remains to be done.

June 30, 2013

Preservation Vocabularies [3 types of magnetic storage medium?]

Filed under: Archives,Library,Linked Data,Vocabularies — Patrick Durusau @ 12:30 pm

Preservation Datasets

From the webpage:

The Linked Data Service is to provide access to commonly found standards and vocabularies promulgated by the Library of Congress. This includes data values and the controlled vocabularies that house them. Below are descriptions of each preservation vocabulary derived from the PREMIS standard. Inside each, a search box allows you to search the vocabularies individually .

New preservation vocabularies from the Library of Congress.

Your mileage will vary with these vocabularies.

Take storage for example.

As we all learned in school, there are only three kinds of magnetic “storage medium:”

  • hard disk
  • magnetic tape
  • TSM

😉

In case you don’t recognize TSM, it stands for IBM Tivoli Storage Manager.

Hmmmm, what about the twenty (20) types of optical disks?

Or other forms of magnetic media? Such as thumb drives, floppy disks, etc.

I pick “storage medium” at random.

Take a look at some of the other vocabularies and let me know what you think.

Please include links to more information in case the LOC decides to add more entries to its vocabularies.

I first saw this at: 21 New Preservation Vocabularies available at id.loc.gov.

June 29, 2013

Linked Data Glossary

Filed under: Glossary,Linked Data — Patrick Durusau @ 3:44 pm

Linked Data Glossary

Abstract:

This document is a glossary of terms defined and used to describe Linked Data, and its associated vocabularies and Best Practices. This document published by the W3C Government Linked Data Working Group as a Working Group Note, is intended to help information management professionals, Web developers, scientists and the general public better understand publishing structured data using Linked Data Principles.

A glossary of one hundred and thirty-two terms used with Linked Data.

June 18, 2013

Shortfall of Linked Data

Filed under: Linked Data,LOD,Semantics,WWW — Patrick Durusau @ 8:58 am

Preparing a presentation I stumbled upon a graphic illustration of why we need better semantic techniques for the average author:

Linked Data in 2011:

LOD

Versus the WWW:

WWW

This must be why you don’t see any updated linked data clouds. The comparison is too shocking.

Particularly when you remember the WWW itself is only part of a much larger data cloud. (Ask the NSA about the percentages.)

Data is being produced every day, pushing us further and further behind with regard to its semantics. (And making the linked data cloud an even smaller percentage of all data.)

Authors have semantics in mind when they write.

The question is how to capture those semantics in machine readable form as nearly as seamlessly as authors write?

Suggestions?

June 3, 2013

Content-Negotiation for WorldCat

Filed under: Linked Data,WorldCat — Patrick Durusau @ 2:44 pm

Content-Negotiation for WorldCat by Richard Wallis.

From the post:

I am pleased to share with you a small but significant step on the Linked Data journey for WorldCat and the exposure of data from OCLC.

Content-negotiation has been implemented for the publication of Linked Data for WorldCat resources.

For those immersed in the publication and consumption of Linked Data, there is little more to say. However I suspect there are a significant number of folks reading this who are wondering what the heck I am going on about. It is a little bit techie but I will try to keep it as simple as possible.

Back last year, a linked data representation of each (of the 290+ million) WorldCat resources was imbedded in it’s web page on the WorldCat site. For full details check out that announcement but in summary:

  • All resource pages include Linked Data
  • Human visible under a Linked Data tab at the bottom of the page
  • Embedded as RDFa within the page html
  • Described using the Schema.org vocabulary
  • Released under an ODC-BY open data license

That is all still valid – so what’s new from now?

That same data is now available in several machine readable RDF serialisations. RDF is RDF, but dependant on your use it is easier to consume as RDFa, or XML, or JSON, or Turtle, or as triples.

In many Linked Data presentations, including some of mine, you will hear the line “As I clicked on the link a web browser we are seeing a html representation. However if I was a machine I would be getting XML or another format back.” This is the mechanism in the http protocol that makes that happen.

I use WorldCat often. It enables readers to search for a book at their local library or to order online.

June 1, 2013

Asset Description Metadata Schema (ADMS)

Filed under: ADMS,Linked Data — Patrick Durusau @ 7:41 pm

Asset Description Metadata Schema (ADMS)

Abstract:

The Asset Description Metadata Schema (ADMS) is a common way to describe semantic interoperability assets making it possible for everyone to search and discover them once shared through the forthcoming federation of asset repositories.

Please consult the ADMS brochure for further introduction.

The 1.0 version was released 18 April 2012.

The 1.0 version was contributed to the W3C Government Linked Data (GLD) Working Group.

Wikipedia reports that the ADMS page at Wikipedia, http://en.wikipedia.org/wiki/Asset_Description_Metadata_Schema is an “orphan” page.

That is no other pages link to it.

Just in case you are looking for a weekend project.

May 21, 2013

Beyond Enterprise Search…

Filed under: Linked Data,MarkLogic,Searching,Semantic Web — Patrick Durusau @ 2:49 pm

Beyond Enterprise Search… by adamfowleruk.

From the post:

Searching through all your content is fine – until you get a mountain of it with similar content, differentiated only by context. Then you’ll need to understand the meaning within the content. In this post I discuss how to do this using semantic techniques…

Organisations today have realised that for certain applications it is useful to have a consolidated search approach over several catalogues. This is most often the case when customers can interact with several parts of the company – sales, billing, service, delivery, fraud checks.

This approach is commonly called Enterprise Search, or Search and Discovery, which is where your content across several repositories is indexed in a separate search engine. Typically this indexing occurs some time after the content is added. In addition, it is not possible for a search engine to understand the fully capabilities of every content system. This means complex mappings are needed between content, meta data and security. In some cases, this may be retrofitted with custom code as the systems do not support a common vocabulary around these aspects of information management.

Content Search

We are all used to content search, so much so that for today’s teenagers a search bar with a common (‘Google like’) grammar is expected. This simple yet powerful interface allows us to search for content (typically web pages and documents) that contain all the words or phrases that we need. Often this is broadened by the use of a thesaurus and word stemming (plays and played stems to the verb play), and combined with some form of weighting based on relative frequency within each unit of content.

Other techniques are also applied. Metadata is extracted or implied – author, date created, modified, security classification, Dublin Core descriptive data. Classification tools can be used (either at the content store or search indexing stages) to perform entity extraction (Cheese is a food stuff) and enrichment (Sheffield is a place with these geospatial co-ordinates). This provides a greater level of description of the term being searched for over and above simple word terms.

Using these techniques, additional search functionality can be provided. Search for all shops visible on a map using a bounding box, radius or polygon geospatial search. Return only documents where these words are within 6 words of each other. Perhaps weight some terms as more important than others, or optional.

These techniques are provided by many of the Enterprise class search engines out there today. Even Open Source tools like Lucene and Solr are catching up with this. They have provided access to information where before we had to rely on Information and Library Services staff to correctly classify incoming documents manually, as they did back in the paper bound days of yore.

Content search only gets you so far though.

I was amening with the best of them until Adam reached the part about MarkLogic 7 going to add Semantic Web capabilities. 😉

I didn’t see any mention of linked data replicating the semantic diversity that currently exists in data stores.

Making data more accessible isn’t going to make it less diverse.

Although making data more accessible may drive the development of ways to manage semantic diversity.

So perhaps there is a useful side to linked data after all.

May 13, 2013

Putting Linked Data on the Map

Filed under: Linked Data,Semantic Web — Patrick Durusau @ 2:32 pm

Putting Linked Data on the Map by Richard Wallis.

In fairness to Linked Data/Semantic Web, I really should mention this post by one of its more mainstream advocates:

Show me an example of the effective publishing of Linked Data – That, or a variation of it, must be the request I receive more than most when talking to those considering making their own resources available as Linked Data, either in their enterprise, or on the wider web.

There are some obvious candidates. The BBC for instance, makes significant use of Linked Data within its enterprise. They built their fantastic Olympics 2012 online coverage on an infrastructure with Linked Data at its core. Unfortunately, apart from a few exceptions such as Wildlife and Programmes, we only see the results in a powerful web presence. The published data is only visible within their enterprise.

Dbpedia is another excellent candidate. From about 2007 it has been a clear demonstration of Tim Berners-Lee’s principles of using URIs as identifiers and providing information, including links to other things, in RDF – it is just there at the end of the dbpedia URIs. But for some reason developers don’t seem to see it as a compelling example. Maybe it is influenced by the Wikipedia effect – interesting but built by open data geeks, so not to be taken seriously.

A third example, which I want to focus on here, is Ordnance Survey. Not generally known much beyond the geographical patch they cover, Ordnance Survey is the official mapping agency for Great Britain. Formally a government agency, they are best known for their incredibly detailed and accurate maps that are the standard accessory for anyone doing anything in the British countryside. A little less known is that they also publish information about post-code areas, parish/town/city/county boundaries, parliamentary constituency areas, and even European regions in Britain. As you can imagine, these all don’t neatly intersect, which makes the data about them a great case for a graph based data model and hence for publishing as Linked Data. Which is what they did a couple of years ago.

The reason I want to focus on their efforts now, is that they have recently beta released a new API suite, which I will come to in a moment. But first I must emphasise something that is often missed.

Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’. With only standard [http] web protocols, you can get the data for an entity in their dataset by just doing a http GET request on the identifier…

(images omitted)

Richard does a great job describing the Linked Data APIs from the Ordnance Survey.

My only quibble is with his point:

Linked Data is just there – without the need for an API the raw data (described in RDF) is ‘just there to consume’.

True enough but it omits the authoring side of Linked Data.

Or understanding the data to be consumed.

With HTML, authoring hyperlinks was only marginally more difficult than “using” hyperlinks.

And the consumption of a hyperlink, beyond mime types, was unconstrained.

So linked data isn’t “just there.”

It’s there with an authoring burden that remains unresolved and that constrains consumption, should you decide to follow “standard [http] web protocols” and Linked Data.

I am sure the Ordnance Survey Linked Data and other Linked Data resources Richard mentions will be very useful, to some people in some contexts.

But pretending Linked Data is easier than it is, will not lead to improved Linked Data or other semantic solutions.

May 9, 2013

The ChEMBL database as linked open data

Filed under: Cheminformatics,Linked Data,RDF,Semantic Web — Patrick Durusau @ 1:51 pm

The ChEMBL database as linked open data by Egon L Willighagen, Andra Waagmeester, Ola Spjuth, Peter Ansell, Antony J Williams, Valery Tkachenko, Janna Hastings, Bin Chen and David J Wild. (Journal of Cheminformatics 2013, 5:23 doi:10.1186/1758-2946-5-23).

Abstract:

Background Making data available as Linked Data using Resource Description Framework (RDF) promotes integration with other web resources. RDF documents can natively link to related data, and others can link back using Uniform Resource Identifiers (URIs). RDF makes the data machine-readable and uses extensible vocabularies for additional information, making it easier to scale up inference and data analysis.

Results This paper describes recent developments in an ongoing project converting data from the ChEMBL database into RDF triples. Relative to earlier versions, this updated version of ChEMBL-RDF uses recently introduced ontologies, including CHEMINF and CiTO; exposes more information from the database; and is now available as dereferencable, linked data. To demonstrate these new features, we present novel use cases showing further integration with other web resources, including Bio2RDF, Chem2Bio2RDF, and ChemSpider, and showing the use of standard ontologies for querying.

Conclusions We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links and the knowledge encoded in standards and ontologies, the ChEMBL-RDF resource creates a foundation for integrated semantic web cheminformatics applications, such as the presented decision support.

You already know about the fragility of ontologies so no need to repeat that rant here.

Having material encoded with an ontology, on the other hand, after vetting, can be a source that you wrap with a topic map.

So all that effort isn’t lost.

May 7, 2013

Linked CSV (Encouraging Development)

Filed under: CSV,Linked Data — Patrick Durusau @ 1:35 pm

Linked CSV by Jeni Tennison.

Abstract:

Many open data sets are essentially tables, or sets of tables, which follow the same regular structure. This document describes a set of conventions for CSV files that enable them to be linked together and to be interpreted as RDF.

An encouraging observation in the draft:

Linked CSV is built around the concept of using URIs to name things. Every record, column, and even slices of data, in a linked CSV file is addressable using URI Identifiers for the text/csv Media Type. For example, if the linked CSV file is accessed at http://example.org/countries, the first record in the CSV file above, which happens to be the first data line within the linked CSV file (which describes Andorra) is addressable with the URI:

http://example.org/countries#row:0

However, this addressing merely identifies the records within the linked CSV file, not the entities that the record describes. This distinction is important for two reasons:

  • a single entity may be described by multiple records within the linked CSV file
  • addressing entities and records separately enables us to make statements about the source of the information within a particular record

By default, each data line describes an entity, each entity is described by a single data line, and there is no way to address the entities. However, adding a $id column enables entities to be given identifiers. These identifiers are always URIs, and they are interpreted relative to the location of the linked CSV file. The $id column may be positioned anywhere but by convention it should be the first column (unless there is a # column, in which case it should be the second). For example:

Hopefully Jeni is setting a trend in Linked Data circles of distinguishing locations from entities.

I first saw this in Christophe Lalanne’s A bag of tweets / April 2013.

April 28, 2013

Scientific Lenses over Linked Data… [Operational Equivalence]

Scientific Lenses over Linked Data: An approach to support task specifi c views of the data. A vision. by Christian Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J G Gray, Paul Groth, Steve Pettifer, Robert Stevens, Antony J Williams, and Egon L Willighagen.

Abstract:

Within complex scienti fic domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specifi c. Existing Linked Data integration procedures and equivalence services do not take the context and task of the user into account. We present a vision for enabling users to control the notion of operational equivalence by applying scienti c lenses over Linked Data. The scientifi c lenses vary the links that are activated between the datasets which aff ects the data returned to the user.

Two additional quotes from this paper should convince you of the importance of this work:

We aim to support users in controlling and varying their view of the data by applying a scientifi c lens which govern the notions of equivalence applied to the data. Users will be able to change their lens based on the task and role they are performing rather than having one fixed lens. To support this requirement, we propose an approach that applies context dependent sets of equality links. These links are stored in a stand-off fashion so that they are not intermingled with the datasets. This allows for multiple, context-dependent, linksets that can evolve without impact on the underlying datasets and support diff ering opinions on the relationships between data instances. This flexibility is in contrast to both Linked Data and traditional data integration approaches. We look at the role personae can play in guiding the nature of relationships between the data resources and the desired a ffects of applying scientifi c lenses over Linked Data.

and,

Within scienti fic datasets it is common to fi nd links to the “equivalent” record in another dataset. However, there is no declaration of the form of the relationship. There is a great deal of variation in the notion of equivalence implied by the links both within a dataset’s usage and particularly across datasets, which degrades the quality of the data. The scienti fic user personae have very di fferent needs about the notion of equivalence that should be applied between datasets. The users need a simple mechanism by which they can change the operational equivalence applied between datasets. We propose the use of scientifi c lenses.

Obvious questions:

Does your topic map software support multiple operational equivalences?

Does your topic map interface enable users to choose “lenses” (I like lenses better than roles) to view equivalence?

Does your topic map software support declaring the nature of equivalence?

I first saw this in the slide deck: Scientific Lenses: Supporting Alternative Views of the Data by Alasdair J G Gray at: 4th Open PHACTS Community Workshop.

BTW, the notion of equivalence being represented by “links” reminds me of a comment Peter Neubauer (Neo4j) once made to me, saying that equivalence could be modeled as edges. Imagine typing equivalence edges. Will have to think about that some more.

« Newer PostsOlder Posts »

Powered by WordPress