Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 22, 2013

…electronic laboratory notebook records

Filed under: Cheminformatics,ELN Integration,Science,Semantics — Patrick Durusau @ 7:29 pm

First steps towards semantic descriptions of electronic laboratory notebook records by Simon J Coles, Jeremy G Frey, Colin L Bird, Richard J Whitby and Aileen E Day.

Abstract:

In order to exploit the vast body of currently inaccessible chemical information held in Electronic Laboratory Notebooks (ELNs) it is necessary not only to make it available but also to develop protocols for discovery, access and ultimately automatic processing. An aim of the Dial-a-Molecule Grand Challenge Network is to be able to draw on the body of accumulated chemical knowledge in order to predict or optimize the outcome of reactions. Accordingly the Network drew up a working group comprising informaticians, software developers and stakeholders from industry and academia to develop protocols and mechanisms to access and process ELN records. The work presented here constitutes the first stage of this process by proposing a tiered metadata system of knowledge, information and processing where each in turn addresses a) discovery, indexing and citation b) context and access to additional information and c) content access and manipulation. A compact set of metadata terms, called the elnItemManifest, has been derived and caters for the knowledge layer of this model. The elnItemManifest has been encoded as an XML schema and some use cases are presented to demonstrate the potential of this approach.

And the current state of electronic laboratory notebooks:

It has been acknowledged at the highest level [15] that “research data are heterogeneous, often classified and cited with disparate schema, and housed in distributed and autonomous databases and repositories. Standards for descriptive and structural metadata will help establish a common framework for understanding data and data structures to address the heterogeneity of datasets.” This is equally the case with the data held in ELNs. (citing: 15. US National Science Board report, Digital Research Data Sharing and Management, Dec 2011 Appendix F Standards and interoperability enable data-intensive science. http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf, accessed 10/07/2013.)

It is trivially true that: “…a common framework for understanding data and data structures …[would] address the heterogeneity of datasets.”

Yes, yes a common framework for data and data structures would solve the heterogeneity issues with datasets.

What is surprising is that no one had that idea up until now. 😉

I won’t recite the history of failed attempts at common frameworks for data and data structures here. To the extent that communities do adopt common practices or standards, those do help. Unfortunately there have never been any universal ones.

Or should I say there have never been any proposals for universal frameworks that succeeded in becoming universal? That’s more accurate. We have not lacked for proposals for universal frameworks.

That isn’t to say this is a bad proposal. But it will be only one of many proposals for the integration of electronic laboratory notebook records, leaving the task of integration between systems for integration left to be done.

BTW, if you are interested in further details, see the article and the XML schema at: http://www.dial-a-molecule.org/wp/blog/2013/08/elnitemmanifest-a-metadata-schema-for-accessing-and-processing-eln-records/.

December 15, 2013

Aberdeen – 1398 to Present

Filed under: Archives,Government Data,History,Semantics — Patrick Durusau @ 8:58 pm

A Text Analytic Approach to Rural and Urban Legal Histories

From the post:

Aberdeen has the earliest and most complete body of surviving records of any Scottish town, running in near-unbroken sequence from 1398 to the present day. Our central focus is on the ‘provincial town’, especially its articulations and interactions with surrounding rural communities, infrastructure and natural resources. In this multi-disciplinary project, we apply text analytical tools to digitised Aberdeen Burgh Records, which are a UNESCO listed cultural artifact. The meaningful content of the Records is linguistically obscured, so must be interpreted. Moreover, to extract and reuse the content with Semantic Web and linked data technologies, it must be machine readable and richly annotated. To accomplish this, we develop a text analytic tool that specifically relates to the language, content, and structure of the Records. The result is an accessible, flexible, and essential precursor to the development of Semantic Web and linked data applications related to the Records. The applications will exploit the artifact to promote Aberdeen Burgh and Shire cultural tourism, curriculum development, and scholarship.

The scholarly objective of this project is to develop the analytic framework, methods, and resource materials to apply a text analytic tool to annotate and access the content of the Burgh records. Amongst the text analytic issues to address in historical perspective are: the identification and analysis of legal entities, events, and roles; and the analysis of legal argumentation and reasoning. Amongst the legal historical issues are: the political and legal culture and authority in the Burgh and Shire, particularly pertaining to the management and use of natural resources. Having an understanding of these issues and being able to access them using Semantic Web/linked data technologies will then facilitate exploitation in applications.

This project complements a distinct, existing collaboration between the Aberdeen City & Aberdeenshire Archives (ACAA) and the University (Connecting and Projecting Aberdeen’s Burgh Records, jointly led by Andrew Mackillop and Jackson Armstrong) (the RIISS Project), which will both make a contribution to the project (see details on application form). This multi-disciplinary application seeks funding from Dot.Rural chiefly for the time of two specialist researchers: a Research Fellow to interpret the multiple languages, handwriting scripts, archaic conventions, and conceptual categories emerging from these records; and subcontracting the A-I to carry out the text analytic and linked data tasks on a given corpus of previously transcribed council records, taking the RF’s interpretation as input.

Now there’s a project for tracking changing semantics over the hills and valleys of time!

Will be interesting to see how they capture semantics that are alien to our own.

Or how they preserve relationships between ancient semantic concepts.

December 14, 2013

Everything is Editorial:..

Filed under: Algorithms,Law,Legal Informatics,Search Algorithms,Searching,Semantics — Patrick Durusau @ 7:57 pm

Everything is Editorial: Why Algorithms are Hand-Made, Human, and Not Just For Search Anymore by Aaron Kirschenfeld.

From the post:

Down here in Durham, NC, we have artisanal everything: bread, cheese, pizza, peanut butter, and of course coffee, coffee, and more coffee. It’s great—fantastic food and coffee, that is, and there is no doubt some psychological kick from knowing that it’s been made carefully by skilled craftspeople for my enjoyment. The old ways are better, at least until they’re co-opted by major multinational corporations.

Aside from making you either hungry or jealous, or perhaps both, why am I talking about fancy foodstuffs on a blog about legal information? It’s because I’d like to argue that algorithms are not computerized, unknowable, mysterious things—they are produced by people, often painstakingly, with a great deal of care. Food metaphors abound, helpfully I think. Algorithms are the “special sauce” of many online research services. They are sets of instructions to be followed and completed, leading to a final product, just like a recipe. Above all, they are the stuff of life for the research systems of the near future.

Human Mediation Never Went Away

When we talk about algorithms in the research community, we are generally talking about search or information retrieval (IR) algorithms. A recent and fascinating VoxPopuLII post by Qiang Lu and Jack Conrad, “Next Generation Legal Search – It’s Already Here,” discusses how these algorithms have become more complicated by considering factors beyond document-based, topical relevance. But I’d like to step back for a moment and head into the past for a bit to talk about the beginnings of search, and the framework that we have viewed it within for the past half-century.

Many early information-retrieval systems worked like this: a researcher would come to you, the information professional, with an information need, that vague and negotiable idea which you would try to reduce to a single question or set of questions. With your understanding of Boolean search techniques and your knowledge of how the document corpus you were searching was indexed, you would then craft a search for the computer to run. Several hours later, when the search was finished, you would be presented with a list of results, sometimes ranked in order of relevance and limited in size because of a lack of computing power. Presumably you would then share these results with the researcher, or perhaps just turn over the relevant documents and send him on his way. In the academic literature, this was called “delegated search,” and it formed the background for the most influential information retrieval studies and research projects for many years—the Cranfield Experiments. See also “On the History of Evaluation in IR” by Stephen Robertson (2008).

In this system, literally everything—the document corpus, the index, the query, and the results—were mediated. There was a medium, a middle-man. The dream was to some day dis-intermediate, which does not mean to exhume the body of the dead news industry. (I feel entitled to this terrible joke as a former journalist… please forgive me.) When the World Wide Web and its ever-expanding document corpus came on the scene, many thought that search engines—huge algorithms, basically—would remove any barrier between the searcher and the information she sought. This is “end-user” search, and as algorithms improved, so too would the system, without requiring the searcher to possess any special skills. The searcher would plug a query, any query, into the search box, and the algorithm would present a ranked list of results, high on both recall and precision. Now, the lack of human attention, evidenced by the fact that few people ever look below result 3 on the list, became the limiting factor, instead of the lack of computing power.

delegated search

The only problem with this is that search engines did not remove the middle-man—they became the middle-man. Why? Because everything, whether we like it or not, is editorial, especially in reference or information retrieval. Everything, every decision, every step in the algorithm, everything everywhere, involves choice. Search engines, then, are never neutral. They embody the priorities of the people who created them and, as search logs are analyzed and incorporated, of the people who use them. It is in these senses that algorithms are inherently human.

A delightful piece on search algorithms that touches at the heart of successful search and/or data integration.

Its first three words capture the issue: Everything is Editorial….

Despite the pretensions of scholars, sages and rogues, everything is editorial, there are no universal semantic primitives.

For convenience in data processing we may choose to treat some tokens as semantic primitives, but that is always a choice that we make.

Once you make that leap, it comes as no surprise that owl:sameAs wasn’t used the same way by everyone who used it.

See: When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, and Patrick J. Hayes, for one take on the confusion around owl:sameAs.

If you are interested in moving beyond opaque keyword searching, consider Aaron’s post carefully.

November 24, 2013

Not all citations are equal:… [Semantic Triage?]

Filed under: Citation Analysis,Citation Practices,Semantics,Topic Maps — Patrick Durusau @ 4:06 pm

Not all citations are equal: identifying key citations automatically by Daniel Lemire.

From the post:

Suppose that you are researching a given issue. Maybe you have a medical condition or you are looking for the best algorithm to solve your current problem.

A good heuristic is to enter reasonable keywords in Google Scholar. This will return a list of related research papers. If you are lucky, you may even have access to the full text of these research papers.

Is that good enough? No.

Scholarship, on the whole, tends to improve with time. More recent papers incorporate the best ideas from past work and correct mistakes. So, if you have found a given research paper, you’d really want to also get a list of all papers building on it…

Thankfully, a tool like Google Scholar allows you to quickly access a list of papers citing a given paper.

Great, right? So you just pick your research paper and review the papers citing them.

If you have ever done this work, you know that most of your effort will be wasted. Why? Because most citations are shallow. Almost none of the citing papers will build on the paper you picked. In fact, many researchers barely even read the papers that they cite.

Ideally, you’d want Google Scholar to automatically tell apart the shallow citations from the real ones.

The paper of the same title is due to appear in JASIST.

The abstract:

The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation.

By asking authors to identify the key references in their own work, we created a dataset in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this dataset using only four features.

The best features, among those we evaluated, were features based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.

What I find intriguing is the potential for this type of research to enable a type of semantic triage when creating topic maps or other semantic resources.

If only three out of thirty citations in a paper are determined to be “influential,” why should I use scarce resources to capture them as completely as the influential resources?

The corollary to Daniel’s “not all citations are equal,” is that “not all content is equal.”

We already make those sort of choices when we select some citations from the larger pool of possible citations.

I’m just suggesting that we make that decision explicit when creating semantic resources.

PS: I wonder how Daniel’s approach would work with opinions rendered in legal cases. Court’s often cite an entire block of prior decisions but no particular rule or fact from any of them. Could reduce the overhead of tracking influential prior case decisions.

November 3, 2013

Penguins in Sweaters…

Filed under: Searching,Semantics,Serendipity — Patrick Durusau @ 8:38 pm

Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content by Ilaria Bordino, Yelena Mejova and Mounia Lalmas.

Abstract:

In many cases, when browsing the Web users are searching for specific information or answers to concrete questions. Sometimes, though, users find unexpected, yet interesting and useful results, and are encouraged to explore further. What makes a result serendipitous? We propose to answer this question by exploring the potential of entities extracted from two sources of user-generated content – Wikipedia, a user-curated online encyclopedia, and Yahoo! Answers, a more unconstrained question/answering forum – in promoting serendipitous search. In this work, the content of each data source is represented as an entity network, which is further enriched with metadata about sentiment, writing quality, and topical category. We devise an algorithm based on lazy random walk with restart to retrieve entity recommendations from the networks. We show that our method provides novel results from both datasets, compared to standard web search engines. However, unlike previous research, we find that choosing highly emotional entities does not increase user interest for many categories of entities, suggesting a more complex relationship between topic matter and the desirable metadata attributes in serendipitous search.

From the introduction:

A system supporting serendipity must provide results that are surprising, semantically cohesive, i.e., relevant to some information need of the user, or just interesting. In this paper, we tackle the question of what makes a result serendipitous.

Serendipity, now that would make a very interesting product demonstration!

In particular if the search results were interesting to the client.

I must admit when I saw the first part of the title I was expecting an article on Linux. 😉

.

October 21, 2013

Denotational Semantics

Filed under: Denotational Semantics,Programming,Semantics — Patrick Durusau @ 3:53 pm

Denotational Semantics: A Methodology for Language Development by David A. Schmidt.

From the Preface:

Denotational semantics is a methodology for giving mathematical meaning to programming languages and systems. It was developed by Christopher Strachey’s Programming Research Group at Oxford University in the 1960s. The method combines mathematical rigor, due to the work of Dana Scott, with notational elegance, due to Strachey. Originally used as an analysis tool, denotational semantics has grown in use as a tool for language design and implementation.

This book was written to make denotational semantics accessible to a wider audience and to update existing texts in the area. I have presented the topic from an engineering viewpoint, emphasizing the descriptional and implementational aspects. The relevant mathematics is also included, for it gives rigor and validity to the method and provides a foundation for further research.

The book is intended as a tutorial for computing professionals and as a text for university courses at the upper undergraduate or beginning graduate level. The reader should be acquainted with discrete structures and one or more general purpose programming languages. Experience with an applicative-style language such as LISP, ML, or Scheme is also helpful.

You can document the syntax of a programming language using some variation of BNF.

Documenting the semantics of a programming language is a bit tougher.

Denotational semantics is one approach. Other approaches include: Axiomatic semantics and Operational semantics.

Even if you are not interested in proving the formal correctness of program, the mental discipline required by any of these approaches is useful.

Semantics and Delivery of Useful Information [Bills Before the U.S. House]

Filed under: Government,Government Data,Law,Semantics — Patrick Durusau @ 2:23 pm

Lars Marius Garshol pointed out in Semantic Web adoption and the users the question of “What do semantic technologies do better than non-semantic technologies?” has yet to be answered.

Tim O’Reilly tweeted about Madison Federal today, a resource that raises the semantic versus non-semantic technology question.

In a nutshell, Madison Federal has all the bills pending before the U.S. House of Representatives online.

If you login with Facebook, you can:

  • Add a bill edit / comment
  • Enter a community suggestion
  • Enter a community comment
  • Subscribe to future edits/comments on a bill

So far, so good.

You can pick any bill but the one I chose as an example is: Postal Executive Accountability Act.

I will quote just a few lines of the bill:

2. Limits on executive pay

    (a) Limitation on compensation Section 1003 of title 39, United States Code, 
         is amended:

         (1) in subsection (a), by striking the last sentence; and
         (2) by adding at the end the following:

             (e)
                  (1) Subject to paragraph (2), an officer or employee of the Postal 
                      Service may not be paid at a rate of basic pay that exceeds 
                      the rate of basic pay for level II of the Executive Schedule 
                      under section 5312 of title 5.

What would be the first thing you want to know?

Hmmm, what about subsection (a) of title 39 of the United States Code since we are striking the last sentence?

39 USC § 1003 – Employment policy [Legal Information Institute], which reads:

(a) Except as provided under chapters 2 and 12 of this title, section 8G of the Inspector General Act of 1978, or other provision of law, the Postal Service shall classify and fix the compensation and benefits of all officers and employees in the Postal Service. It shall be the policy of the Postal Service to maintain compensation and benefits for all officers and employees on a standard of comparability to the compensation and benefits paid for comparable levels of work in the private sector of the economy. No officer or employee shall be paid compensation at a rate in excess of the rate for level I of the Executive Schedule under section 5312 of title 5.

OK, so now we know that (1) is striking:

No officer or employee shall be paid compensation at a rate in excess of the rate for level I of the Executive Schedule under section 5312 of title 5.

Semantics? No, just a hyperlink.

For the added text, we want to know what is meant by:

… rate of basic pay that exceeds the rate of basic pay for level II of the Executive Schedule under section 5312 of title 5.

The Legal Information Institute is already ahead of Congress because their system provides the hyperlink we need: 5312 of title 5.

If you notice something amiss when you follow that link, congratulations! You have discovered your first congressional typo and/or error.

5312 of title 5 defines Schedule I of the Executive Schedule, which includes the Secretary of State, Secretary of the Treasury, Secretary of Defense, Attorney General and others. Base rate for Executive Schedule Level I is $199,700.

On the other hand, 5313 of title 5 defines Schedule II of the Executive Schedule, which includes Department of Agriculture, Deputy Secretary of Agriculture; Department of Defense, Deputy Secretary of Defense, Secretary of the Army, Secretary of the Navy, Secretary of the Air Force, Under Secretary of Defense for Acquisition, Technology and Logistics; Department of Education, Deputy Secretary of Education; Department of Energy, Deputy Secretary of Energy and others. Base rate for Executive Schedule Level II is $178,700.

Assuming someone catches or comments that 5312 should be 5313, top earners at the Postal Service may be about to take a $21,000.00 pay reduction.

We got all that from mechanical hyperlinks, no semantic technology required.

Where you might need semantic technology is when reading 39 USC § 1003 – Employment policy [Legal Information Institute] where it says (in part):

…It shall be the policy of the Postal Service to maintain compensation and benefits for all officers and employees on a standard of comparability to the compensation and benefits paid for comparable levels of work in the private sector of the economy….

Some questions:

Question: What are “comparable levels of work in the private sector of the economy?”

Question: On what basis is work for the Postal Service compared to work in the private economy?

Question: Examples of comparable jobs in the private economy and their compensation?

Question: What policy or guideline documents have been developed by the Postal Service for evaluation of Postal Service vs. work in the private economy?

Question: What studies have been done, by who, using what practices, on comparing compensation for Postal Service work to work in the private economy?

That would be a considerable amount of information with what I suspect would be a large amount of duplication as reports or studies are cited by numerous sources.

Semantic technology would be necessary for the purpose of deduping and navigating such a body of information effectively.

Pick a bill. Where would you put the divide between mechanical hyperlinks and semantic technologies?

PS: You may remember that the House of Representatives had their own “post office” which they ran as a slush fund. The thought of the House holding someone “accountable” is too bizarre for words.

October 10, 2013

Semantics: The Next Big Issue in Big Data

Filed under: BigData,Semantics — Patrick Durusau @ 3:44 pm

Semantics: The Next Big Issue in Big Data by Glen Fest.

From the post:

The use of semantics often is a way to evade the issue at hand (i.e., Bill Clinton’s parsed definition of “is”). But in David Saul’s world of bank compliance and regulation, it’s something that can help get right to the heart of the matter.

Saul, the chief scientist at State Street Corp. in Boston, views the technology of semantics—in which data is structured in ways that it can be shared easily between bank divisions, institutions and regulators—as an ends to better understand and manage big-bank risk profiles.

“By bringing all of this data together with the semantic models, we’re going to be able to ask the questions you need to ask to prepare regulatory reporting,” as well as internal risk calculations, Saul promised at a recent forum held at the New York offices of SWIFT, the Society for Worldwide Interbank Financial Telecommunication. Saul’s championing of semantics technology was part of a wider-ranging panel discussion on the role of technology in helping banks meet the current and forthcoming compliance demands of global regulators. “That’s really what we’re doing: trying to pull risk information from a variety of different systems and platforms, written at different times by different people,” Saul says.

To bridge the underlying data, the Financial Industry Business Ontology (FIBO), a group that Saul participates in, is creating the common terms and data definitions that will put banks and regulators on the same semantic page.

What’s ironic is in the same post you find:

Semantics technology already is a proven concept as an underlying tool of the Web that requires common data formats for sites to link to one another, says Saul. At large global banks, common data infrastructure is still in most cases a work in progress, if it’s underway at all. Legacy departmental divisions have allowed different (and incompatible) data sets and systems to evolve internally, leaving banks with the heavy chore of accumulating and repurposing data for both compliance reporting and internal risk analysis.

The inability to automate or reuse data across silos is at the heart of banks’ big-data dilemma—or as Saul likes to call it, a “smart data” predicament.

I’m not real sure what having a “common data forma” has to do with linking between data sets. Most websites use something close to HTML but that doesn’t mean they can be usefully linked together.

Not to mention the “legacy departmental divisions.” What is going to happen to them and their data?

How “semantics technology” is going to deal with different and incompatible data sets isn’t clear. Change all the recorded data retroactively? How far back?

If you have any contacts in the banking industry, tell them the FIBO proposal sounds like a bad plan.

October 1, 2013

Recursive Deep Models for Semantic Compositionality…

Filed under: Machine Learning,Modeling,Semantic Vectors,Semantics,Sentiment Analysis — Patrick Durusau @ 4:12 pm

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank by Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts.

Abstract:

Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levels for both positive and negative phrases.

You will no doubt want to see the webpage with the demo.

Along with possibly the data set and the code.

I was surprised by “fine-grained sentiment labels” meaning:

  1. Positive
  2. Somewhat positive
  3. Neutral
  4. Somewhat negative
  5. Negative

But then for many purposes, subject recognition on that level of granularity may be sufficient.

September 24, 2013

Rumors of Legends (the TMRM kind?)

Filed under: Bioinformatics,Biomedical,Legends,Semantics,TMRM,XML — Patrick Durusau @ 3:42 pm

BioC: a minimalist approach to interoperability for biomedical text processing (numerous authors, see the article).

Abstract:

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/.

From the introduction:

With the proliferation of natural language text, text mining has emerged as an important research area. As a result many researchers are developing natural language processing (NLP) and information retrieval tools for text mining purposes. However, while the capabilities and the quality of tools continue to grow, it remains challenging to combine these into more complex systems. Every new generation of researchers creates their own software specific to their research, their environment and the format of the data they study; possibly due to the fact that this is the path requiring the least labor. However, with every new cycle restarting in this manner, the sophistication of systems that can be developed is limited. (emphasis added)

That is the experience with creating electronic versions of the Hebrew Bible. Every project has started from a blank screen, requiring re-proofing of the same text, etc. As a result, there is no electronic encoding of the masora magna (think long margin notes). Duplicated effort has a real cost to scholarship.

The authors stray into legend land when they write:

Our approach to these problems is what we would like to call a ‘minimalist’ approach. How ‘little’ can one do to obtain interoperability? We provide an extensible mark-up language (XML) document type definition (DTD) defining ways in which a document can contain text, annotations and relations. Major XML elements may contain ‘infon’ elements, which store key-value pairs with any desired semantic information. We have adapted the term ‘infon’ from the writings of Devlin (1), where it is given the sense of a discrete item of information. An associated ‘key’ file is necessary to define the semantics that appear in tags such as the infon elements. Key files are simple text files where the developer defines the semantics associated with the data. Different corpora or annotation sets sharing the same semantics may reuse an existing key file, thus representing an accepted standard for a particular data type. In addition, key files may describe a new kind of data not seen before. At this point we prescribe no semantic standards. BioC users are encouraged to create their own key files to represent their BioC data collections. In time, we believe, the most useful key files will develop a life of their own, thus providing emerging standards that are naturally adopted by the community.

The “key files” don’t specify subject identities for the purposes of merging. But defining the semantics of data is a first step in that direction.

I like the idea of popular “key files” (read legends) taking on a life of their own due to their usefulness. An economic activity based on reducing the friction in using or re-using data. That should have legs.

BTW, don’t overlook the author’s data and code, available at: http://bioc.sourceforge.net/.

September 16, 2013

Self Organizing Maps

Filed under: Self Organizing Maps (SOMs),Semantics — Patrick Durusau @ 4:48 pm

Self Organizing Maps by Giuseppe Vettigli.

From the post:

The Self Organizing Maps (SOM), also known as Kohonen maps, are a type of Artificial Neural Networks able to convert complex, nonlinear statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display. In a SOM the neurons are organized in a bidimensional lattice and each neuron is fully connected to all the source nodes in the input layer. An illustration of the SOM by Haykin (1996) is the following

If you are looking for self organizing maps using Python, this is the right place.

As with all mathematical techniques, SOMs requires the author to bridge the gap between semantics and discrete values for processing.

An iffy process at best.

September 11, 2013

Input Requested: Survey on Legislative XML

Filed under: Law - Sources,Legal Informatics,Semantics — Patrick Durusau @ 5:15 pm

Input Requested: Survey on Legislative XML

A request for survey participants who are familiar with XML and law. To comment on the Crown Legislative Markup Language (CLML) which is used for the content at: legislation.gov.uk.

Background:

By way of background, the Crown Legislation Mark-up Language (CLML) is used to represent UK legislation in XML. It’s the base format for all legislation published on the legislation.gov.uk website. We make both the schema and all our data freely available for anyone to use, or re-use, under the UK government’s Open Government Licence. CLML is currently expressed as a W3C XML Schema which is owned and maintained by The National Archives. A version of the schema can be accessed online at http://www.legislation.gov.uk/schema/legislation.xsd . Legislation as CLML XML can be accessed from the website using the legislation.gov.uk API. Simply add “/data.xml” to any legislation content page, e.g. http://www.legislation.gov.uk/ukpga/2010/1/data.xml .

Why is this important for topic maps?

Would you believe that the markup semantics of CLML are different from the semantics of United States Legislative Markup (USLM)?

That’s just markup syntax differences. Hard to say what substantive semantic variations are in the laws themselves.

Mapping legal semantics becomes important when the United States claims extraterritorial jurisdiction for the application of its laws.

Or when the United States uses its finance laws to inflict harm on others. (Treasury’s war: the unleashing of a new era of financial warfare by Juan Carlos Zarate.)

Mapping legal semantics won’t make U.S. claims any less extreme but may help convince others of a clear and present danger.

August 29, 2013

DSLs and Towers of Abstraction

Filed under: DSL,Logic,Mathematics,Semantics — Patrick Durusau @ 6:04 pm

DSLs and Towers of Abstraction by Gershom Bazerman.

From the description:

This talk will sketch some connections at the foundations of semantics (of programming languages, logics, formal systems in general). In various degrees of abbreviation, we will present Galois Connections, Lawvere Theories, adjoint functors and their relationship to syntax and semantics, and the core notion behind abstract interpretation. At each step we’ll draw connections, trying to show why these are good tools to think with even as we’re solving real world problems and building tools and libraries others will find simple and elegant to use.

Further reading:

logicmatters.net/resources/pdfs/Galois.pdf
dpmms.cam.ac.uk/~martin/Research/Publications/2007/hp07.pdf
tac.mta.ca/tac/reprints/articles/5/tr5abs.html

If your mind has gotten flabby over the summer, this presentation will start to get it back in shape.

You may get swept along in the speaker’s enthusiasm.

Very high marks!

August 17, 2013

AT4AM: The XML Web Editor Used By…

Filed under: Editor,EU,Semantics — Patrick Durusau @ 4:27 pm

AT4AM: The XML Web Editor Used By Members Of European Parliment

From the post:

AT4AM – Authoring Tool for Amendments – is a web editor provided to Members of European Parliament (MEPs) that has greatly improved the drafting of amendments at European Parliament since its introduction in 2010.

The tool, developed by the Directorate for Innovation and Technological Support of European Parliament (DG ITEC) has replaced a system based on a collection of macros developed in MS Word and specific ad hoc templates.

Moving beyond guessing the semantics of an author depends upon those semantics being documented at the point of creation.

Having said that, I think we all acknowledge that for the average user, RDF and its kin, were DOA.

Interfaces such as AT4AM, if they can be extended to capture the semantics of their authors, would be a step in the right direction.

BTW, see the AT4AM homepage, complete with live demo.

August 16, 2013

Semantic Computing of Moods…

Filed under: Music,Music Retrieval,Semantics,Tagging — Patrick Durusau @ 4:46 pm

Semantic Computing of Moods Based on Tags in Social Media of Music by Pasi Saari, Tuomas Eerola. (IEEE Transactions on Knowledge and Data Engineering, 2013; : 1 DOI: 10.1109/TKDE.2013.128)

Abstract:

Social tags inherent in online music services such as Last.fm provide a rich source of information on musical moods. The abundance of social tags makes this data highly beneficial for developing techniques to manage and retrieve mood information, and enables study of the relationships between music content and mood representations with data substantially larger than that available for conventional emotion research. However, no systematic assessment has been done on the accuracy of social tags and derived semantic models at capturing mood information in music. We propose a novel technique called Affective Circumplex Transformation (ACT) for representing the moods of music tracks in an interpretable and robust fashion based on semantic computing of social tags and research in emotion modeling. We validate the technique by predicting listener ratings of moods in music tracks, and compare the results to prediction with the Vector Space Model (VSM), Singular Value Decomposition (SVD), Nonnegative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA). The results show that ACT consistently outperforms the baseline techniques, and its performance is robust against a low number of track-level mood tags. The results give validity and analytical insights for harnessing millions of music tracks and associated mood data available through social tags in application development.

These results make me wonder if the results of tagging represents the average semantic resolution that users want?

Obviously a musician or musicologist would want far finer and sharper distinctions, at least for music of interest to them. Or substitute the domain of your choice. Domain experts want precision, while the average user muddles along with coarser divisions.

We already know from Karen Drabenstott’s work (Subject Headings and the Semantic Web) that library classification systems are too complex for the average user and even most librarians.

On the other hand, we all have some sense of the wasted time and effort caused by the uncharted semantic sea where Google and others practice catch and release with semantic data.

Some of the unanswered questions that remain:

How much semantic detail is enough?

For which domains?

Who will pay for gathering it?

What economic model is best?

August 7, 2013

Extremely Large Images: Considerations for Contemporary Approach

Filed under: Astroinformatics,Semantics — Patrick Durusau @ 6:53 pm

Extremely Large Images: Considerations for Contemporary Approach by Bruce Berriman.

From the post:

This is the title of a paper by Kitaeff, Wicenec, Wu and Taubman recently posted on astro-ph. The paper addresses the issues of accessing and interacting with very large data-cube images that will be be produced by next generation of radio telescopes such as the Square Kilometer Array (SKA), the Low Frequency Array for Radio Astronomy (LOFAR) and others. Individual images may be TB-sized, and one SKA Reference Mission Project, “Galaxy Evolution in the Nearby Uni-verse: HI Observations,” will generate individual images of 70-90 TB each.

Data sets this large cannot reside on local disks, even with anticipated advances in storage and network technology. Nor will any new lossless compression techniques that preserve the low S/N of the data save the day, for the act of decompression will impose excessive computational demands on servers and clients.

(emphasis added)

Yes, you read that correctly: “generate individual images of 70-90 TB each.”

Looks like the SW/WWW is about to get a whole lot smaller, comparatively speaking.

But the data you will be encountering will be getting larger. A lot larger.

Bear in mind that the semantics we associate with data will be getting larger as well.

Read that carefully, especially the part about “…we associate with data…”

Data may appear to have intrinsic semantics but only because we project but do not acknowledge the projection of semantics.

The more data we have, the more space there is for semantic projection, by everyone who views the data.

Whose views/semantics do you want to capture?

August 5, 2013

Semantic Parsing with Combinatory Categorial Grammars

Filed under: Parsing,Semantics — Patrick Durusau @ 10:19 am

Semantic Parsing with Combinatory Categorial Grammars by Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer.

Slides from an ACL tutorial, 2013. Three hundred and fifty-one (351) slides.

You may want to also visit: The University of Washington Semantic Parsing Framework v1.3 site where you can download source or binary files.

The ACL wiki introduces combinatory categorical grammars with:

Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure. CCG relies on combinatory logic, which has the same expressive power as the lambda calculus, but builds its expressions differently.

The first linguistic and psycholinguistic arguments for basing the grammar on combinators were put forth by Mark Steedman and Anna Szabolcsi. More recent proponents of the approach are Jacobson and Baldridge. For example, the combinator B (the compositor) is useful in creating long-distance dependencies, as in “Who do you think Mary is talking about?” and the combinator W (the duplicator) is useful as the lexical interpretation of reflexive pronouns, as in “Mary talks about herself”. Together with I (the identity mapping) and C (the permutator) these form a set of primitive, non-interdefinable combinators. Jacobson interprets personal pronouns as the combinator I, and their binding is aided by a complex combinator Z, as in “Mary lost her way”. Z is definable using W and B.

CCG is known to define the same language class as tree-adjoining grammar, linear indexed grammar, and head grammar, and is said to be mildly context-sensitive.

One of the key publications of CCG is The Syntactic Process by Mark Steedman. There are various efficient parsers available for CCG.

The ACL wiki page also lists other software packages and references.

Machine parsing/searching are absolute necessities if you want to create topic maps on a human scale. (Web Scale? Or do you want to try for human scale?)

To surpass current search results, build correction/interaction with users directly into your interface. So that search results “get smarter” the more your interface is used.

In contrast to the pagerank/lemming approach to document searching.

August 3, 2013

Semantic Search… [Call for Papers]

Filed under: Marketing,Publishing,Semantics,Topic Maps — Patrick Durusau @ 3:52 pm

Semantic Search – Call for Papers for special issue of Aslib Journal of Information Management by Fran Alexander.

From the post:

I am currently drafting the Call for Papers for a special issue of the Aslib Journal of Information Management (formerly Aslib Proceedings) which I am guest editing alongside Dr Ulrike Spree from the University of Hamburg.

Ulrike is the academic expert, while I am providing the practitioner perspective. I am very keen to include practical case studies, so if you have an interesting project or comments on a project but have never written an academic paper before, don’t be put off. I will be happy to advise on style, referencing, etc.

Suggested Topics

Themes Ulrike is interested in include:

  • current trends in semantic search
  • best practice – how far along the road from ‘early adopters’ to ‘mainstream users’ has semantic search gone so far
  • usability of semantic search
  • visualisation and semantic search
  • the relationship between new trends in knowledge organisation and semantic search, such as vocabulary norms (like ISO 25964 “Thesauri for information retrieval“) and the potential of semantic search from a more critical perspective – what, for example, are the criteria for judging quality?

Themes I am interested in include:

  • the history of semantic search – how the latest techniques and technologies have come out of developments over the last 5, 10, 20, 100, 2000… years
  • how semantic search techniques and technologies are being used in practice
  • how semantic technologies are fostering a need for cross-industry collaboration and standardization
  • practical problems in brokering consensus and agreement – defining terms and classes, etc.
  • differences between web-scale, enterprise scale, and collection-specific scale techniques
  • curation and management of ontologies.

However, we are open to suggestions, especially as it is such a broad topic, there are so many aspects that could be covered.

Fran doesn’t mention a deadline but I will ask and update here when I get it.

Sounds like a venue that would welcome papers on topic maps.

Yes?

August 2, 2013

Frontiers in Massive Data Analysis

Filed under: BigData,Semantics — Patrick Durusau @ 3:42 pm

Frontiers in Massive Data Analysis from the National Research Council.

One of the conclusions reached by the NRC reads:

  • While there are many sources of data that are currently fueling the rapid growth in data volume, a few forms of data create particularly interesting challenges. First, much current data involves human language and speech, and increasingly the goal with such data is to extract aspects of the semantic meaning underlying the data. Examples include sentiment analysis, topic models of documents, relational modeling, and the full-blown semantic analyses required by question-answering systems. Second, video and image data are increasingly prevalent, creating a range of challenges in large-scale compression, image processing, computational vision, and semantic analysis. Third, data are increasingly labeled with geo-spatial and temporal tags, creating challenges in maintaining coherence across spatial scales and time. Fourth, many data sets involve networks and graphs, with inferential questions hinging on semantically rich notions such as “centrality” and “influence.” The deeper analyses required by data sources such as these involve difficult and unsolved problems in artificial intelligence and the mathematical sciences that go beyond near-term issues of scaling existing algorithms. The committee notes, however, that massive data itself can provide new leverage on such problems, with machine translation of natural language a frequently cited example.
  • Massive data analysis creates new challenges at the interface between humans and computers. As just alluded to, many data sets require semantic understanding that is currently beyond the reach of algorithmic approaches and for which human input is needed. This input may be obtained from the data analyst, whose judgment is needed throughout the data analysis process, from the framing of hypotheses to the management of trade-offs (e.g., errors versus time) to the selection of questions to pursue further. It may also be obtained from crowdsourcing, a potentially powerful source of inputs that must be used with care, given the many kinds of errors and biases that can arise. In either case, there are many challenges that need to be faced in the design of effective visualizations and interfaces and, more generally, in linking human judgment with data analysis algorithms. [Emphasis added]

Sounds like they are singing the topic maps song!

I will be mining this volume for more quotes, etc.

I first saw this at: Frontiers in Massive Data Analysis.

July 15, 2013

Corporate Culture Clash:…

Filed under: Communication,Diversity,Heterogeneous Data,Language,Marketing,Semantics — Patrick Durusau @ 3:05 pm

Corporate Culture Clash: Getting Data Analysts and Executives to Speak the Same Language by Drew Rockwell

From the post:

A colleague recently told me a story about the frustration of putting in long hours and hard work, only to be left feeling like nothing had been accomplished. Architecture students at the university he attended had scrawled their frustrations on the wall of a campus bathroom…“I wanted to be an architect, but all I do is create stupid models,” wrote students who yearned to see their ideas and visions realized as staples of metropolitan skylines. I’ve heard similar frustrations expressed by business analysts who constantly face the same uphill battle. In fact, in a recent survey we did of 600 analytic professionals, some of the biggest challenges they cited were “getting MBAs to accept advanced methods”, getting executives to buy into the potential of analytics, and communicating with “pointy-haired” bosses.

So clearly, building the model isn’t enough when it comes to analytics. You have to create an analytics-driven culture that actually gets everyone paying attention, participating and realizing what analytics has to offer. But how do you pull that off? Well, there are three things that are absolutely critical to building a successful, analytics-driven culture. Each one links to the next and bridges the gap that has long divided analytics professionals and business executives.

Some snippets to attract you to this “must read:”

(…)
In the culinary world, they say you eat with your eyes before your mouth. A good visual presentation can make your mouth water, while a bad one can kill your appetite. The same principle applies when presenting data analytics to corporate executives. You have to show them something that stands out, that they can understand and that lets them see with their own eyes where the value really lies.
(…)
One option for agile integration and analytics is data discovery – a type of analytic approach that allows business people to explore data freely so they can see things from different perspectives, asking new questions and exploring new hypotheses that could lead to untold benefits for the entire organization.
(…)
If executives are ever going to get on board with analytics, the cost of their buy-in has to be significantly lowered, and the ROI has to be clear and substantial.
(…)

I did pick the most topic map “relevant” quotes but they are as valid for topic maps as any other approach.

Seeing from different perspectives sounds like on-the-fly merging to me.

How about you?

July 10, 2013

Naming Conventions for Naming Things

Filed under: Names,Semantics — Patrick Durusau @ 3:36 pm

Naming Conventions for Naming Things by David Loshin.

From the post:

In a recent email exchange with a colleague, I have been discussing two aspects of metadata: naming conventions and taxonomies. Just as a reminder, “taxonomy” refers to the practice of organization and classification, and in this context it refers to the ways that concepts are defined and how the real-world things referred to by those concepts are logically grouped together. After pondering the email thread, which was in reference to documenting code lists and organizing the codes within particular classes, I was reminded of a selection from Lewis Carroll’s book Through the Looking Glass, at the point where the White Knight is leaving Alice in her continued journey to become a queen.

At that point, the White Knight proposes to sing Alice a song to comfort her as he leaves, and in this segment they discuss the song he plans to share:

Any of you who have been following the discussion of “default semantics” in the XTM group at LinkedIn should appreciate this post.

Your default semantics are very unlikely to be my default semantics.

What I find hard to believe is that prior different semantics are acknowledged in one breath and then a uniform semantic is proposed in the next.

Seems to me that prior semantic diversity is a good sign that today we have semantic diversity. A semantic diversity that will continue into an unlimited number of tomorrows.

Yes?

If so, shouldn’t we empower users to choose their own semantics? As opposed to ours?

July 8, 2013

Detecting Semantic Overlap and Discovering Precedents…

Detecting Semantic Overlap and Discovering Precedents in the Biodiversity Research Literature by Graeme Hirst, Nadia Talenty, and Sara Scharfz.

Abstract:

Scientific literature on biodiversity is longevous, but even when legacy publications are available online, researchers often fail to search it adequately or effectively for prior publications; consequently, new research may replicate, or fail to adequately take into account, previously published research. The mechanisms of the Semantic Web and methods developed in contemporary research in natural language processing could be used, in the near-term future, as the basis for a precedent-finding system that would take the text of an author’s early draft (or a submitted manuscript) and find potentially related ideas in published work. Methods would include text-similarity metrics that take different terminologies, synonymy, paraphrase, discourse relations, and structure of argumentation into account.

Footnote one (1) of the paper gives an idea of the problem the authors face:

Natural history scientists work in fragmented, highly distributed and parochial communities, each with domain specific requirements and methodologies [Scoble 2008]. Their output is heterogeneous, high volume and typically of low impact, but with a citation half-life that may run into centuries” (Smith et al. 2009). “The cited half-life of publications in taxonomy is longer than in any other scientific discipline, and the decay rate is longer than in any scientific discipline” (Moritz 2005). Unfortunately, we have been unable to identify the study that is the basis for Moritz’s remark.

The paper explores in detail issues that have daunted various search techniques, when the material is available in electronic format at all.

The authors make a general proposal for addressing these issues, with mention of the Semantic Web but omit from their plan:

The other omission is semantic interpretation into a logical form, represented in XML, that draws on ontologies in the style of the original Berners-Lee, Hendler, and Lassila (2001) proposal for the Semantic Web. The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here. This is not to say that logical forms would be useless. On the contrary, they are employed by some approaches to paraphrase and textual entailment (section 4.1 above) and hence might appear in the system if only for that reason; but even so, they would form only one component of a broader and somewhat looser kind of semantic representation.

That’s the problem with the Semantic Web in a nutshell:

The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here.

What if I want to be logically precise sometimes but not others?

What if I want to be more precise in some places and less precise in others?

What if I want to have different degrees or types of imprecision?

With topic maps the question is: How im/precise do you want to be?

July 7, 2013

Proceedings of the 3rd Workshop on Semantic Publishing

Filed under: Publishing,Semantics — Patrick Durusau @ 7:54 pm

Proceedings of the 3rd Workshop on Semantic Publishing edited by: Alexander García Castro, Christoph Lange, Phillip Lord, and Robert Stevens.

Table of Contents

Research Papers

  1. Twenty-Five Shades of Greycite: Semantics for Referencing and Preservation Phillip Lord
  2. Systematic Reviews as an Interface to the Web of (Trial) Data: using PICO as an Ontology for Knowledge Synthesis in Evidence-based Healthcare Research Chris Mavergames
  3. Towards Linked Research Data: an Institutional Approach Najko JahnFlorian Lier, Thilo Paul-Stueve, Christian Pietsch, Philipp Cimiano
  4. Repurposing Benchmark Corpora for Reconstructing Provenance Sara Magliacane.
  5. Connections across Scientific Publications based on Semantic Annotations Leyla Jael García Castro, Rafael Berlanga, Dietrich Rebholz-Schuhmann, Alexander Garcia.
  6. Towards the Automatic Identification of the Nature of Citations Angelo Di Iorio, Andrea Giovanni Nuzzolese, Silvio Peroni.
  7. How Reliable is Your Workflow: Monitoring Decay in Scholarly Publications José Manuel Gómez-Pérez, Esteban García-Cuesta, Jun Zhao, Aleix Garrido, José Enrique Ruiz.

Polemics (published externally)

  1. Flash Mob Science, Open Innovation and Semantic Publishing Hal Warren, Bryan Dennis, Eva Winer.
  2. Science, Semantic Web and Excuses Idafen Santana Pérez, Daniel Garijo, Oscar Corcho.
  3. Polemic on Future of Scholarly Publishing/Semantic Publishing Chris Mavergames.
  4. Linked Research Sarven Capadisli.

The whole proceedings can also be downloaded as a single file (PDF, including title pages, preface, and table of contents).

Some reading to start your week!

June 30, 2013

The DNA Data Deluge

Filed under: BigData,Genomics,Semantics — Patrick Durusau @ 5:47 pm

The DNA Data Deluge by Michael C. Schatz & Ben Langmead.

From the post:

We’re still a long way from having anything as powerful as a Web search engine for sequencing data, but our research groups are trying to exploit what we already know about cloud computing and text indexing to make vast sequencing data archives more usable. Right now, agencies like the National Institutes of Health maintain public archives containing petabytes of genetic data. But without easy search methods, such databases are significantly underused, and all that valuable data is essentially dead. We need to develop tools that make each archive a useful living entity the way that Google makes the Web a useful living entity. If we can make these archives more searchable, we will empower researchers to pose scientific questions over much larger collections of data, enabling greater insights.

A very accessible article that makes a strong case for the “DNA Data Deluge.” Literally.

The deluge of concern to the authors is raw genetic data.

They don’t address how we will connect genetic data to the semantic quagmire of clinical data and research publications.

Genetic knowledge disconnected from clinical experience will be interesting but not terribly useful.

If you want more complex data requirements, include other intersections with our genetic makeup, such as pollution, additives, lifestyle, etc.

June 27, 2013

Getting $erious about $emantics

Filed under: Finance Services,Marketing,Semantics — Patrick Durusau @ 6:31 pm

State Street’s Chief Scientist on How to Tame Big Data Using Semantics by Bryan Yurcan.

From the post in Bank Systems & Technology:

Financial institutions are accumulating data at a rapid pace. Between massive amounts of internal information and an ever-growing pool of unstructured data to deal with, banks’ data management and storage capabilities are being stretched thin. But relief may come in the form of semantic databases, which could be the next evolution in how banks manage big data, says David Saul, Chief Scientist for Boston-based State Street Corp.

The semantic data model associates a meaning to each piece of data to allow for better evaluation and analysis, Saul notes, adding that given their ability to analyze relationships, semantic databases are particularly well-suited for the financial services industry.

“Our most important asset is the data we own and the data we act as a custodian for,” he says. “A lot of what we do for our customers, and what they do with the information we deliver to them, is aggregate data from different sources and correlate it to make better business decisions.”

Semantic technology, notes Saul, is based on the same technology “that all of us use on the World Wide Web, and that’s the concept of being able to hyperlink from one location to another location. Semantic technology does the same thing for linking data.”

Using a semantic database, each piece of data has a meaning associated with it, says Saul. For example, a typical data field might be a customer name. Semantic technology knows where that piece of information is in both the database and ununstructured data, he says. Semantic data would then allow for a financial institutions to create a report or dashboard that shows all of their interactions with that customer.

“The way it’s done now, you write data extract programs and create a repository,” he says. “There’s a lot of translation that’s required.”

Semantic data can also be greatly beneficial for banks in conducting risk calculations for regulatory requirements, Saul adds.

“That is something regulators are constantly looking for us to do, they want to know what our total exposure is to a particular customer or geographic area,” he says. “That requires quite a bit of development effort, which equals time and money. With semantic technology, once you describe the data sources, you can do that very, very quickly. You don’t have to write new extract programs.”

(…)

When banks and their technology people start talking about semantics, you know serious opportunities abound.

A growing awareness of the value of the semantics of data and data structures can’t help but create market opportunities for topic maps.

Big data needs big semantics!

June 26, 2013

Semantic Queries. Who Knew?

Filed under: MarkLogic,Semantics — Patrick Durusau @ 10:31 am

The New Generation of Database Technology Includes Semantics and Search David Gorbet, VP of Engineering for MarkLogic, chatted with Bloor Group Principal Robin Bloor in a recent Briefing Room.

From near the end of the interview:

There’s still a lot of opportunity to light up new scenarios for our customers. That’s why we’re excited about our semantics capabilities in MarkLogic 7. We believe that semantics technology is the next generation of search and discovery, allowing queries based on the concepts you’re looking for and not just the words and phrases. MarkLogic 7 will be the only database to allow semantics queries combined with document search and element/value queries all in one place. Our customers are excited about this.

Need to watch the marketing literature from MarkLogic for riffs and themes to repeat for topic map-based solutions.

Not to mention that topic maps can point into add semantics to existing data stores and their contents.

Re-using current data stores sounds more attractive than ripping out all your data to migrate to another platform.

Yes?

June 18, 2013

Shortfall of Linked Data

Filed under: Linked Data,LOD,Semantics,WWW — Patrick Durusau @ 8:58 am

Preparing a presentation I stumbled upon a graphic illustration of why we need better semantic techniques for the average author:

Linked Data in 2011:

LOD

Versus the WWW:

WWW

This must be why you don’t see any updated linked data clouds. The comparison is too shocking.

Particularly when you remember the WWW itself is only part of a much larger data cloud. (Ask the NSA about the percentages.)

Data is being produced every day, pushing us further and further behind with regard to its semantics. (And making the linked data cloud an even smaller percentage of all data.)

Authors have semantics in mind when they write.

The question is how to capture those semantics in machine readable form as nearly as seamlessly as authors write?

Suggestions?

June 10, 2013

When will my computer understand me?

Filed under: Language,Markov Decision Processes,Semantics,Translation — Patrick Durusau @ 2:57 pm

When will my computer understand me?

From the post:

It’s not hard to tell the difference between the “charge” of a battery and criminal “charges.” But for computers, distinguishing between the various meanings of a word is difficult.

For more than 50 years, linguists and computer scientists have tried to get computers to understand human language by programming semantics as software. Driven initially by efforts to translate Russian scientific texts during the Cold War (and more recently by the value of information retrieval and data analysis tools), these efforts have met with mixed success. IBM’s Jeopardy-winning Watson system and Google Translate are high profile, successful applications of language technologies, but the humorous answers and mistranslations they sometimes produce are evidence of the continuing difficulty of the problem.

Our ability to easily distinguish between multiple word meanings is rooted in a lifetime of experience. Using the context in which a word is used, an intrinsic understanding of syntax and logic, and a sense of the speaker’s intention, we intuit what another person is telling us.

“In the past, people have tried to hand-code all of this knowledge,” explained Katrin Erk, a professor of linguistics at The University of Texas at Austin focusing on lexical semantics. “I think it’s fair to say that this hasn’t been successful. There are just too many little things that humans know.”

Other efforts have tried to use dictionary meanings to train computers to better understand language, but these attempts have also faced obstacles. Dictionaries have their own sense distinctions, which are crystal clear to the dictionary-maker but murky to the dictionary reader. Moreover, no two dictionaries provide the same set of meanings — frustrating, right?

Watching annotators struggle to make sense of conflicting definitions led Erk to try a different tactic. Instead of hard-coding human logic or deciphering dictionaries, why not mine a vast body of texts (which are a reflection of human knowledge) and use the implicit connections between the words to create a weighted map of relationships — a dictionary without a dictionary?

“An intuition for me was that you could visualize the different meanings of a word as points in space,” she said. “You could think of them as sometimes far apart, like a battery charge and criminal charges, and sometimes close together, like criminal charges and accusations (“the newspaper published charges…”). The meaning of a word in a particular context is a point in this space. Then we don’t have to say how many senses a word has. Instead we say: ‘This use of the word is close to this usage in another sentence, but far away from the third use.'”

Before you jump to the post looking for the code, Erk is working with a 10,000 dimension space to analyze her data.

The most recent paper: Montague Meets Markov: Deep Semantics with Probabilistic Logical Form (2013)

Abstract:

We combine logical and distributional representations of natural language meaning by transforming distributional similarity judgments into weighted inference rules using Markov Logic Networks (MLNs). We show that this framework supports both judging sentence similarity and recognizing textual entailment by appropriately adapting the MLN implementation of logical connectives. We also show that distributional phrase similarity, used as textual inference rules created on the fly, improves its performance.

June 6, 2013

Vocabulary Management at W3C (Draft) [ontology and vocabulary as synonyms]

Filed under: Ontology,Semantics,Vocabularies — Patrick Durusau @ 8:51 am

Vocabulary Management at W3C (Draft)

From the webpage:

One of the major stumbling blocks in deploying RDF has been the difficulty data providers have in determining which vocabularies to use. For example, a publisher of scientific papers who wants to embed document metadata in the web pages about each paper has to make an extensive search to find the possible vocabularies and gather the data to decide which among them are appropriate for this use. Many vocabularies may already exist, but they are difficult to find; there may be more than one on the same subject area, but it is not clear which ones have a reasonable level of stability and community acceptance; or there may be none, i.e. one may have to be developed in which case it is unclear how to make the community know about the existence of such a vocabulary.

There have been several attempts to create vocabulary catalogs, indexes, etc. but none of them has gained a general acceptance and few have remained up for very long. The latest notable attempt is LOV, created and maintained by Bernard Vatant (Mondeca) and Pierre-Yves Vandenbussche (DERI) as part of the DataLift project. Other application areas have more specific, application-dependent catalogs; e.g., the HCLS community has established such application-specific “ontology portals” (vocabulary hosting and/or directory services) as NCBO and OBO. (Note that for the purposes of this document, the terms “ontology” and “vocabulary” are synonyms.) Unfortunately, many of the cataloging projects in the past relied on a specific project or some individuals and they became, more often than not, obsolete after a while.

Initially (1999-2003) W3C stayed out of this process, waiting to see if the community would sort out this issue by itself. We hoped to see the emergence of an open market for vocabularies, including development tools, reviews, catalogs, consultants, etc. When that did not emerge, we decided to begin offering ontology hosting (on www.w3.org) and we began the Ontaria project (with DARPA funding) to provide an ontology directory service. Implementation of these services was not completed, however, and project funding ended in 2005. After that, W3C took no active role until the emergence of schema.org and the eventual creation of the Web Schemas Task Force of the Semantic Web Interest Group. WSTF was created both to provide an open process for schema.org and as a general forum for people interested in developing vocabularies. At this point, we are contemplating taking a more active role supporting the vocabulary ecosystem. (emphasis added)

The W3C proposal fails to address two issues with vocabularies:

1. Vocabularies are not the origin of the meanings of terms they contain.

Awful, according to yet another master of the king’s English quoted by Fries, could only mean awe-inspiring.

But it was not so. “The real meaning of any word,” argued Fries, “must be finally determined, not by its original meaning, it source or etymology, but by the content given the word in actual practical usage…. Even a hardy purist would scarcely dare pronounce a painter’s masterpiece awful, without explanations. [The Story of Ain’t by David Skinner, HarperCollins 2012, page 47)

Vocabularies represent some community of semantic practice but that brings us to the second problem the W3C proposal ignores.

2. The meaning of terms in a vocabulary are not stable, universal nor self-evident.

The problem with most vocabularies being they have no way to signal the the context, community or other information that would help distinguish one vocabulary meaning from another.

A human reader may intuit context and other clues from a vocabulary and use those factors when comparing the vocabulary to a text.

Computers, on the other hand, know no more than they have been told.

Vocabularies need to move beyond being simple tokens and represent terms with structures that capture some of the information a human reader knows intuitively about those terms.

Otherwise vocabularies will remain mute records of some socially defined meaning, but we won’t know which ones.

May 25, 2013

Semantics as Data

Filed under: Data,Semantics — Patrick Durusau @ 4:28 pm

Semantics as Data by Oliver Kennedy.

From the post:

Something I’ve been getting drawn to more and more is the idea of computation as data.

This is one of the core precepts in PL and computation: any sort of computation can be encoded as data. Yet, this doesn’t fully capture the essence of what I’ve been seeing. Sure you can encode computation as data, but then what do you do with it? How do you make use of the fact that semantics can be encoded?

Let’s take this question from another perspective. In Databases, we’re used to imposing semantics on data. Data has meaning because we chose to give it meaning. The number 100,000 is meaningless, until I tell you that it’s the average salary of an employee at BigCorporateCo. Nevertheless, we can still ask questions in the abstract. Whatever semantics you use, 100,000 < 120,000. We can create abstractions (query languages) that allow us to ask questions about data, regardless of their semantics.

By comparison, an encoded computation carries its own semantics. This makes it harder to analyze, as the nature of those semantics is limited only by the type of encoding used to store the computation. But this doesn’t stop us from asking questions about the computation.

The Computation’s Effects

The simplest thing we can do is to ask a question about what it will compute. These questions span the range from the trivial to the typically intractable. For example, we can ask about…

  • … what the computation will produce given a specific input, or a specific set of inputs.
  • … what inputs will produce a given (range of) output(s).
  • … whether a particular output is possible.
  • … whether two computations are equivalent.

One particularly fun example in this space is Oracle’s Expression type [1]. An Expression stores (as a datatype) an arbitrary boolean expression with variables. The result of evaluating this expression on a given valuation of the variables can be injected into the WHERE clause of any SELECT statement. Notably, Expression objects can be indexed based on variable valuations. Given 3 such expressions: (A = 3), (A = 5), (A = 7), we can build an index to identify which expressions are satisfied for a particular valuation of A.

I find this beyond cool. Not only can Expression objects themselves be queried, it’s actually possible to build index structures to accelerate those queries.

Those familiar with probabilistic databases will note some convenient parallels between the expression type and Condition Columns used in C-Tables. Indeed, the concepts are almost identical. A C-Table encodes the semantics of the queries that went into its construction. When we compute a confidence in a C-Table (or row), what we’re effectively asking about is the fraction of the input space that the C-Table (row) produces an output for.

At every level of semantics there is semantic diversity.

Whether it is code or data, there are levels of semantics, each with semantic diversity.

You don’t have to resolve all semantic diversity, just enough to give you an advantage over others.

« Newer PostsOlder Posts »

Powered by WordPress