Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 26, 2013

*SEM 2013 […Independence to be Semantically Diverse]

Filed under: Conferences,Natural Language Processing,Semantics — Patrick Durusau @ 1:41 pm

*SEM 2013 : The 2nd Joint Conference on Lexical and Computational Semantics

Dates:

When Jun 13, 2013 – Jun 14, 2013
Where Atlanta GA, USA
Submission Deadline Mar 15, 2013
Notification Due Apr 12, 2013
Final Version Due Apr 21, 2013

From the call:

The main goal of *SEM is to provide a stable forum for the growing number of NLP researchers working on different aspects of semantic processing, which has been scattered over a large array of small workshops and conferences.

Topics of interest include, but are not limited to:

  • Formal and linguistic semantics
  • Cognitive aspects of semantics
  • Lexical semantics
  • Semantic aspects of morphology and semantic processing of morphologically rich languages
  • Semantic processing at the sentence level
  • Semantic processing at the discourse level
  • Semantic processing of non-propositional aspects of meaning
  • Textual entailment
  • Multiword expressions
  • Multilingual semantic processing
  • Social media and linguistic semantics

*SEM 2013 will feature a distinguished panel on Deep Language Understanding.

*SEM 2013 hosts the shared task on Semantic Textual Similarity.

Another workshop to join the array of “…small workshops and conferences.” šŸ˜‰

Not a bad thing. Communities grow up around conferences and people you will see at one are rarely at others.

Diversity of communities, dare I say semantics?, isn’t a bad thing. It is a reflection of our diversity and we should stop beating ourselves up over it.

Our machines are capable of being uniformly monotonous. But that is because they lack the independence to be diverse on their own.

Why would anyone want to emulate being a machine?

Multi-tasking with joint semantic spaces

Filed under: Music,Music Retrieval,Semantics — Patrick Durusau @ 1:40 pm

Paper of the Day (Po’D): Multi-tasking with joint semantic spaces by Bob L. Sturm.

From the post:

Hello, and welcome to the Paper of the Day (Po’D): Multi-tasking with joint semantic spaces edition. Today’s paper is: J. Weston, S. Bengio and P. Hamel, “Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval,” J. New Music Research, vol. 40, no. 4, pp. 337-348, 2011.

This article proposes and tests a novel approach (pronounced MUSCLES but written MUSLSE) for describing a music signal along multiple directions, including semantically meaningful ones. This work is especially relevant since it applies to problems that remain unsolved, such as artist identification and music recommendation (in fact the first two authors are employees of Google). The method proposed in this article models a song (or a short excerpt of a song) as a triple in three vector spaces learned from a training dataset: one vector space is created from artists, one created from tags, and the last created from features of the audio. The benefit of using vector spaces is that they bring quantitative and well-defined machinery, e.g., projections and distances.

MUSCLES attempts to learn each vector space together so as to preserve (dis)similarity. For instance, vectors mapped from artists that are similar (e.g., Brittney Spears and Christina Aguilera) should point in nearly the same direction; while those that are not similar (e.g., Engelbert Humperdink and The Rubberbandits), should be nearly orthogonal. Similarly, so should vectors mapped from tags that are semantically close (e.g., “dark” and “moody”), and semantically disjoint (e.g., “teenage death song” and “NYC”). For features extracted from the audio, one hopes the features themselves are comparable, and are able to reflect some notion of similarity at least at the surface level of the audio. MUSCLES takes this a step further to learn the vector spaces so that one can take inner products between vectors from different spaces — which is definitely a novel concept in music information retrieval.

Bob raises a number of interesting issues but here’s one that bites:

A further problem is that MUSCLES judges similarity by magnitude inner product. In such a case, if “sad” and “happy” point in exact opposite directions, then MUSCLES will say they are highly similar.

Ouch! For all the “precision” of vector spaces, there are non-apparent biases lurking therein.

For your convenience:

Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval (full text)

Abstract:

Music prediction tasks range from predicting tags given a song or clip of audio, predicting the name of the artist, or predicting related songs given a song, clip, artist name or tag. That is, we are interested in every semantic relationship between the different musical concepts in our database. In realistically sized databases, the number of songs is measured in the hundreds of thousands or more, and the number of artists in the tens of thousands or more, providing a considerable challenge to standard machine learning techniques. In this work, we propose a method that scales to such datasets which attempts to capture the semantic similarities between the database items by modelling audio, artist names, and tags in a single low-dimensional semantic embedding space. This choice of space is learnt by optimizing the set of prediction tasks of interest jointly using multi-task learning. Our single model learnt by training on the joint objective function is shown experimentally to have improved accuracy over training on each task alone. Our method also outperforms the baseline methods tried and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where the semantic space captures well the similarities of interest.

Just to tempt you into reading the article, consider the following passage:

Artist and song similarity is at the core of most music recommendation or playlist generation systems. However, music similarity measures are subjective, which makes it diļ¬ƒcult to rely on ground truth. This makes the evaluation of such systems more complex. This issue is addressed in Berenzweig (2004) and Ellis, Whitman, Berenzweig, and Lawrence (2002). These tasks can be tackled using content-based features or meta-data from human sources. Features commonly used to predict music similarity include audio features, tags and collaborative ļ¬ltering information.

Meta-data such as tags and collaborative ļ¬ltering data have the advantage of considering human perception and opinions. These concepts are important to consider when building a music similarity space. However, meta-data suļ¬€ers from a popularity bias, because a lot of data is available for popular music, but very little information can be found on new or less known artists. In consequence, in systems that rely solely upon meta-data, everything tends to be similar to popular artists. Another problem, known as the cold-start problem, arises with new artists or songs for which no human annotation exists yet. It is then impossible to get a reliable similarity measure, and is thus diļ¬ƒcult to correctly recommend new or less known artists.

“…[H]uman perception[?]…” Is there some other form I am unaware of? Some other measure of similarity than our own? Recalling that vector spaces are a pale mockery of our more subtle judgments.

Suggestions?

January 22, 2013

Content-Based Image Retrieval at the End of the Early Years

Content-Based Image Retrieval at the End of the Early Years by Arnold W.M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. (Smeulders, A.W.M.; Worring, M.; Santini, S.; Gupta, A.; Jain, R.; , “Content-based image retrieval at the end of the early years,” Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.22, no.12, pp.1349-1380, Dec 2000
doi: 10.1109/34.895972)

Abstract:

Presents a review of 200 references in content-based image retrieval. The paper starts with discussing the working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap. Subsequent sections discuss computational steps for image retrieval systems. Step one of the review is image processing for retrieval sorted by color, texture, and local geometry. Features for retrieval are discussed next, sorted by: accumulative and global features, salient points, object and shape features, signs, and structural combinations thereof. Similarity of pictures and objects in pictures is reviewed for each of the feature types, in close connection to the types and means of feedback the user of the systems is capable of giving by interaction. We briefly discuss aspects of system engineering: databases, system architecture, and evaluation. In the concluding section, we present our view on: the driving force of the field, the heritage from computer vision, the influence on computer vision, the role of similarity and of interaction, the need for databases, the problem of evaluation, and the role of the semantic gap.

Excellent survey article from 2000 (not 2002 as per the Ostermann paper).

I think you will appreciate the treatment of the “semantic gap,” both in terms of its description as well as ways to address it.

If you are using annotated images in your topic map application, definitely a must read.

January 8, 2013

Data Integration Is Now A Business Problem ā€“ Thatā€™s Good

Filed under: Data Integration,Marketing,Semantics — Patrick Durusau @ 11:43 am

Data Integration Is Now A Business Problem ā€“ Thatā€™s Good by John Schmidt.

From the post:

Since the advent of middleware technology in the mid-1990ā€™s, data integration has been primarily an IT-lead technical problem. Business leaders had their hands full focusing on their individual silos and were happy to delegate the complex task of integrating enterprise data and creating one version of the truth to IT. The problem is that there is now too much data that is highly fragmented across myriad internal systems, customer/supplier systems, cloud applications, mobile devices and automatic sensors. Traditional IT-lead approaches whereby a project is launched involving dozens (or hundreds) of staff to address every new opportunity are just too slow.

The good news is that data integration challenges have become so large, and the opportunities for competitive advantage from leveraging data are so compelling, that business leaders are stepping out of their silos to take charge of the enterprise integration task. This is good news because data integration is largely an agreement problem that requires business leadership; technical solutions alone canā€™t fully solve the problem. It also shifts the emphasis for financial justification of integration initiatives from IT cost-saving activities to revenue-generating and business process improvement initiatives. (emphasis added)

I think the key point for me is the bolded line: data integration is largely an agreement problem that requires business leadership; technical solutions alone canā€™t fully solve the problem.

Data integration never was a technical problem, not really. It just wasn’t important enough for leaders to create agreements to solve it.

Like a lack of sharing between U.S. intelligence agencies. Which is still the case, twelve years this next September 11th as a matter of fact.

Topic maps can capture data integration agreements, but only if users have the business leadership to reach them.

Could be a very good year!

December 26, 2012

Semantic Assistants Wiki-NLP Integration

Filed under: Natural Language Processing,Semantics,Wiki — Patrick Durusau @ 3:27 pm

Natural Language Processing for MediaWiki: First major release of the Semantic Assistants Wiki-NLP Integration

From the post:

We are happy to announce the first major release of our Semantic Assistants Wiki-NLP integration. This is the first comprehensive open source solution for bringing Natural Language Processing (NLP) to wiki users, in particular for wikis based on the well-known MediaWiki engine and its Semantic MediaWiki (SMW) extension. It can run any NLP pipeline deployed in the General Architecture for Text Engineering (GATE), brokered as web services through the Semantic Assistants server. This allows you to bring novel text mining assistants to wiki users, e.g., for automatically structuring wiki pages, answering questions in natural language, quality assurance, entity detection, summarization, among others. The results of the NLP analysis are written back to the wiki, allowing humans and AI to work collaboratively on wiki content. Additionally, semantic markup understood by the SMW extension can be automatically generated from NLP output, providing semantic search and query functionalities.

Features:

  • Light-weight MediaWiki Extension
  • NLP Pipeline Independent Architecture
  • Flexible Wiki Input Handling
  • Flexible NLP Result Handling
  • Semantic Markup Generation
  • Wiki-independent Architecture

A promising direction for creation of author-curated text!

December 25, 2012

Static and Dynamic Semantics of NoSQL Languages […Combining Operators…]

Filed under: NoSQL,Query Language,Semantics — Patrick Durusau @ 3:59 pm

Static and Dynamic Semantics of NoSQL Languages (PDF) by VĆ©ronique Benzaken, Giuseppe Castagna, Kim NguyĖœĆŖn and JĆ©rĆ“me SimĆ©on.

Abstract:

We present a calculus for processing semistructured data that spans differences of application area among several novel query languages, broadly categorized as ā€œNoSQLā€. This calculus lets users deļ¬ne their own operators, capturing a wider range of data processing capabilities, whilst providing a typing precision so far typical only of primitive hard-coded operators. The type inference algorithm is based on semantic type checking, resulting in type information that is both precise, and ļ¬‚exible enough to handle structured and semistructured data. We illustrate the use of this calculus by encoding a large fragment of Jaql, including operations and iterators over JSON, embedded SQL expressions, and co-grouping, and show how the encoding directly yields a typing discipline for Jaql as it is, namely without the addition of any type deļ¬nition or type annotation in the code.

From the conclusion:

On the structural side, the claim is that combining recursive records and pairs by unions, intersections, and negations sufļ¬ces to capture all possible structuring of data, covering a palette ranging from comprehensions, to heterogeneous lists mixing typed and untyped data, through regular expressions types and XML schemas. Therefore, our calculus not only provides a simple way to give a formal semantics to, reciprocally compare, and combine operators of different NoSQL languages, but also offers a means to equip these languages, in they current deļ¬nition (ie, without any type deļ¬nition or annotation), with precise type inference.

With lots of work in between the abstract and conclusion.

The capacity to combine operators of different NoSQL languages sounds relevant to a topic maps query language.

Yes?

I first saw this in a tweet by Computer Science.

December 21, 2012

The Twitter of Babel: Mapping World Languages through Microblogging Platforms

Filed under: Diversity,Semantic Diversity,Semantics — Patrick Durusau @ 5:38 pm

The Twitter of Babel: Mapping World Languages through Microblogging Platforms by Delia Mocanu, Andrea Baronchelli, Bruno GonƧalves, Nicola Perra, Alessandro Vespignani.

Abstract:

Large scale analysis and statistics of socio-technical systems that just a few short years ago would have required the use of consistent economic and human resources can nowadays be conveniently performed by mining the enormous amount of digital data produced by human activities. Although a characterization of several aspects of our societies is emerging from the data revolution, a number of questions concerning the reliability and the biases inherent to the big data “proxies” of social life are still open. Here, we survey worldwide linguistic indicators and trends through the analysis of a large-scale dataset of microblogging posts. We show that available data allow for the study of language geography at scales ranging from country-level aggregation to specific city neighborhoods. The high resolution and coverage of the data allows us to investigate different indicators such as the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and the geographical distribution of different languages in multilingual regions. This work highlights the potential of geolocalized studies of open data sources to improve current analysis and develop indicators for major social phenomena in specific communities.

So, rather on the surface homogeneous languages, users can use their own natural, heterogeneous languages, which we can analyze as such?

Cool!

Semantic and linguistic heterogeneity has persisted from the original Tower of Babel until now.

The smart money will be riding on managing semantic and linguistic heterogeneity.

Other money can fund emptying the semantic ocean with a tea cup.

December 17, 2012

Go3R [Searching for Alternatives to Animal Testing]

Go3R

A semantic search engine for finding alternatives to animal testing.

I mention it as an example of a search interface that assists the user in searching.

The help documentation is a bit sparse if you are looking for an opportunity to contribute to such a project.

I did locate some additional information on the project, all usefully with the same title to make locating it “easy.” šŸ˜‰

[Introduction] Knowledge-based semantic search engine for alternative methods to animal experiments

[PubMed – entry] Go3R – semantic Internet search engine for alternative methods to animal testing by Sauer UG, WƤchter T, Grune B, Doms A, Alvers MR, Spielmann H, Schroeder M. (ALTEX. 2009;26(1):17-31).

Abstract:

Consideration and incorporation of all available scientific information is an important part of the planning of any scientific project. As regards research with sentient animals, EU Directive 86/609/EEC for the protection of laboratory animals requires scientists to consider whether any planned animal experiment can be substituted by other scientifically satisfactory methods not entailing the use of animals or entailing less animals or less animal suffering, before performing the experiment. Thus, collection of relevant information is indispensable in order to meet this legal obligation. However, no standard procedures or services exist to provide convenient access to the information required to reliably determine whether it is possible to replace, reduce or refine a planned animal experiment in accordance with the 3Rs principle. The search engine Go3R, which is available free of charge under http://Go3R.org, runs up to become such a standard service. Go3R is the world-wide first search engine on alternative methods building on new semantic technologies that use an expert-knowledge based ontology to identify relevant documents. Due to Go3R’s concept and design, the search engine can be used without lengthy instructions. It enables all those involved in the planning, authorisation and performance of animal experiments to determine the availability of non-animal methodologies in a fast, comprehensive and transparent manner. Thereby, Go3R strives to significantly contribute to the avoidance and replacement of animal experiments.

[ALTEX entry – full text available] Go3R ā€“ Semantic Internet Search Engine for Alternative Methods to Animal Testing

December 16, 2012

Why There Shouldn’t Be A Single Version Of The Truth

Filed under: Diversity,Semantics — Patrick Durusau @ 8:35 pm

Why There Shouldn’t Be A Single Version Of The Truth by Chuck Hollis.

From the post:

Legacy thinking can get you in trouble in so many ways. The challenge is that — well — there’s so much of it around.

Maxims that seemed to make logical sense in one era quickly become the intellectual chains that hold so many of us back. Personally, I’ve come to enjoy blowing up conventional wisdom to make room for emerging realities.

I’m getting into more and more customer discussions with progressive IT organizations that are seriously contemplating building platforms and services that meet the broad goal of “analytically enabling the business” — business analytics as service, if you will.

The problem? The people in charge have done things a certain way for a very long time. And the new, emerging requirements are forcing them to go back and seriously reconsider some of their most deeply-held assumptions.

Like having “one version of the truth”. I’ve seen multiple examples of it get in the way of organizations who need to be doing more with their data.

As usual, a highly entertaining and well illustrated essay from Chuck.

Chuck makes the case for enough uniformity to enable communication but enough diversity to generate new ideas and interesting discussions.

November 24, 2012

Consistency through semantics

Filed under: Consistency,Semantics,Software — Patrick Durusau @ 2:13 pm

Consistency through semantics by Oliver Kennedy.

From the post:

When designing a distributed systems, one of the first questions anyone asks is what kind of consistency model to use. This is a fairly nuanced question, as there isnā€™t really one right answer. Do you enforce strong consistency and accept the resulting latency and communication overhead? Do you use locking, and accept the resulting throughput limitations? Or do you just give up and use eventual consistency and accept that sometimes youā€™ll end up with results that are just a little bit out of sync.

Itā€™s this last bit that Iā€™d like to chat about today, because itā€™s actually quite common in a large number of applications. This model is present in everything from user-facing applications like Dropbox to SVN/GIT, to back-end infrastructure systems like Amazonā€™s Dynamo and Yahooā€™s PNUTs. Often, especially in non-critical applications latency and throughput are more important than dealing with the possibility that two simultaneous updates will conflict.

So what happens when this dreadful possibility does come to pass? Clearly the system canā€™t grind to a halt, and often just randomly discarding one of these updates is the wrong thing to do. So what happens? The answer is common across most of these systems: They punt to the user.

Intuitively, this is the right thing to do. The user sees the big picture. The user knows best how to combine these operations. The user knows what to do, so on those rare occurrences where the system canā€™t handle it, the user can.

But why is this the right thing to do? What does the user have that the infrastructure doesnā€™t?

Take the time to read the rest of Oliver’s post.

He distinguishes rather nicely between applications and users.

November 22, 2012

W3C Community and Business Groups

Filed under: Semantics,W3C — Patrick Durusau @ 5:53 am

W3C Community and Business Groups

A listing of current Community and Business Groups at the W3C. W3C membership is not required to join but you do need a free W3C account.

Several are relevant to semantics and semantic integration and are avenues for meeting other people interested in those topics.

November 16, 2012

Taming Big Data Is Not a Technology Issue [Knuth Exercise Rating]

Filed under: BigData,Semantic Diversity,Semantics — Patrick Durusau @ 4:50 am

Taming Big Data Is Not a Technology Issue by Bill Franks.

From the post:

One thing that has struck me recently is that most of the focus when discussing big data is upon the technologies involved. The consensus seems to be that the biggest challenge with big data is a technological one, yet I donā€™t believe this to be the case. Sure, there are challenges today for organizations using big data, but, I would like to submit to you that technology is not the biggest problem. In fact, technology may be one of the easiest problems to solve when it comes time to tame big data.

The fact is that there are tools and technologies out there that can handle virtually all of the big data needs of the vast majority of organizations. As of today, you can find products and solutions that do whatever you need to do with big data. Technology itself is not the problem.

Then, what are the issues? The real problems are with resource availability, skills, process change, politics, and culture. While the technologies to solve your problems may be out there just waiting for you to implement them, it isnā€™t quite that easy, is it? You have to get budget, you have to do an implementation, you have to get your people up to speed on how to use the tools, you have to get buy in from various stakeholders, and you have to push against a culture averse to change.

The technology is right there, but you are unable to effectively put it to work. It FEELS like a technology issue since technology is front and center. However, it is really the cultural, people, and political issues surrounding the technology that are the problem. Let me illustrate with an example.

A refreshing view at the drive to build technology to “solve” the big data problem.

Once terabytes of data are accessible as soon as entering the data stream, for real time, reactive analysis, with n-dimensional graphic representations as a matter of course, the “big data” problem will still be the “big data” problem.

The often cited “volume, velocity, variety” characterization of “big data” are surface issues that in one manner or another, can be addressed using technology. Now.

A deeper, more persistent problem is that users expect their data, big or small, to have semantics. Whether express or implied. That problem, along with the others cited by Franks, has no technological solution.

Because semantics originate with us and not with our machines.

By all means, we need to solve the technology issues around “big data,” but that only gives us a start towards working on the more difficult problems, problems that original with us.

A much harder “programming” exercise. I suspect on Knuth’s scale of exercises, an 80 or 90.

October 26, 2012

Open Source Natural Language Spell-Checker [Disambiguation at the point of origin.]

Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.

I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?

My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.

It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”

That confusion resolved, I read:

Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.

Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”

Interested parties can check out this demo or read the tech overview and grab the source code here.

I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.

Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?

Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.

Disambiguation at the point of origin.

The title of the original article could become:

“<a href=”http://automattic.com/”>Automattic</a> Open Sources Natural Language Spell-Checker <a href=”http://www.afterthedeadline.com/”>After the Deadline</a>”

Seems less ambiguous to me.

Certainly less ambiguous to a search engine.

You?

October 25, 2012

Data Preparation: Know Your Records!

Filed under: Data,Data Quality,Semantics — Patrick Durusau @ 10:25 am

Data Preparation: Know Your Records! by Dean Abbott.

From the post:

Data preparation in data mining and predictive analytics (dare I also say Data Science?) rightfully focuses on how the fields in ones data should be represented so that modeling algorithms either will work properly or at least won’t be misled by the data. These data preprocessing steps may involve filling missing values, reigning in the effects of outliers, transforming fields so they better comply with algorithm assumptions, binning, and much more. In recent weeks I’ve been reminded how important it is to know your records. I’ve heard this described in many ways, four of which are:
the unit of analysis
the level of aggregation
what a record represents
unique description of a record

A bit further on Dean reminds us:

What isn’t always obvious is when our assumptions about the data result in unexpected results. What if we expect the unit of analysis to be customerID/Session but there are duplicates in the data? Or what if we had assumed customerID/Session data but it was in actuality customerID/Day data (where ones customers typically have one session per day, but could have a dozen)? (emphasis added)

Obvious once Dean says it, but how often do you question assumptions about data?

Do you know what impact incorrect assumptions about data will have on your operations?

If you investigate your assumptions about data, where do you record your observations?

Or will you repeat the investigation with every data dump from a particular source?

Describing data “in situ” could benefit you six months from now or your successor. (The data and or its fields would be treated as subjects in a topic map.)

October 23, 2012

Jurimetrics (Modern Uses of Logic in Law (MULL))

Filed under: Law,Legal Informatics,Logic,Semantics — Patrick Durusau @ 10:48 am

Jurimetrics (Modern Uses of Logic in Law (MULL))

From the about page:

Jurimetrics, The Journal of Law, Science, and Technology (ISSN 0897-1277), published quarterly, is the journal of the American Bar Association Section of Science & Technology Law and the Center for Law, Science & Innovation. Click here to view the online version of Jurimetrics.

Jurimetrics is a forum for the publication and exchange of ideas and information about the relationships between law, science and technology in all areas, including:

  • Physical, life and social sciences
  • Engineering, aerospace, communications and computers
  • Logic, mathematics and quantitative methods
  • The uses of science and technology in law practice, adjudication and court and agency administration
  • Policy implications and legislative and administrative control of science and technology.

Jurimetrics was first published in 1959 under the leadership of Layman Allen as Modern Uses of Logic in Law (MULL). The current name was adopted in 1966. Jurimetrics is the oldest journal of law and science in the United States, and it enjoys a circulation of more than 8,000, which includes all members of the ABA Section of Science & Technology Law.

I just mentioned this journal in Wyner et al.: An Empirical Approach to the Semantic Representation of Laws, but wanted to also capture its earlier title, Modern Uses of Logic in Law (MULL), because I am likely to search for it as well.

I haven’t looked at the early issues in some years but as I recall, they were quite interesting.

Wyner et al.: An Empirical Approach to the Semantic Representation of Laws

Filed under: Language,Law,Legal Informatics,Machine Learning,Semantics — Patrick Durusau @ 10:37 am

Wyner et al.: An Empirical Approach to the Semantic Representation of Laws

Legal Informatics brings news of Dr. Adam Wyner’s paper, An Empirical Approach to the Semantic Representation of Laws, and quotes the abstract as:

To make legal texts machine processable, the texts may be represented as linked documents, semantically tagged text, or translated to formal representations that can be automatically reasoned with. The paper considers the latter, which is key to testing consistency of laws, drawing inferences, and providing explanations relative to input. To translate laws to a form that can be reasoned with by a computer, sentences must be parsed and formally represented. The paper presents the state-of-the-art in automatic translation of law to a machine readable formal representation, provides corpora, outlines some key problems, and proposes tasks to address the problems.

The paper originated at Project IMPACT.

If you haven’t looked at semantics and the law recently, this is a good opportunity to catch up.

I have only skimmed the paper and its references but am already looking for online access to early issues of Jurimetrics (a journal by the American Bar Association) that addressed such issues many years ago.

Should be fun to see what has changed and by how much. What issues remain and how they are viewed today.

October 22, 2012

HBase Futures

Filed under: Hadoop,HBase,Hortonworks,Semantics — Patrick Durusau @ 2:28 pm

HBase Futures by Devaraj Das.

From the post:

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBaseā€™s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Probably just a personal prejudice but I would have mentioned semantics in that list.

You?

New version of Get-Another-Label available

Filed under: Crowd Sourcing,Mechanical Turk,oDesk,Semantics — Patrick Durusau @ 8:49 am

New version of Get-Another-Label available by Panos Ipeirotis.

From the post:

I am often asked what type of technique I use for evaluating the quality of the workers on Mechanical Turk (or on oDesk, or …). Do I use gold tests? Do I use redundancy?

Well, the answer is that I use both. In fact, I use the code “Get-Another-Label” that I have developed together with my PhD students and a few other developers. The code is publicly available on Github.

We have updated the code recently, to add some useful functionality, such as the ability to pass (for evaluation purposes) the true answers for the different tasks, and get back answers about the quality of the estimates of the different algorithms.

Panos continues his series on the use of crowd sourcing.

Just a thought experiment at the moment but could semantic gaps between populations be “discovered” by use of crowd sourcing?

That is to create tasks that require “understanding” some implicit semantic in the task and then collecting the answer.

There being no “incorrect” answers but answers that reflect the differing perceptions of the semantics of the task.

A way to get away from using small groups of college students for such research? (Nothing against small groups of college students but they best represent small groups of college students. May need a broader semantic range.)

A Strong ARM for Big Data [Semantics Not Included]

Filed under: BigData,HPC,Semantics — Patrick Durusau @ 4:00 am

A Strong ARM for Big Data (Datanami – Sponsored Content by Calxeda)

From the post:

Burgeoning data growth is one of the foremost challenges facing IT and businesses today. Multiple analyst groups, including Gartner, have reported that information volume is growing at a minimum rate of 59 percent annually. At the same time, companies increasingly are mining this data for invaluable business insight that can give them a competitive advantage.

The challenge the industry struggles with is figuring out how to build cost-effective infrastructures so data scientists can derive these insights for their organizations to make timely, more intelligent decisions. As data volumes continue their explosive growth and algorithms to analyze and visualize that data become more optimized, something must give.

Past approaches that primarily relied on using faster, larger systems just are not able to keep pace. There is a need to scale-out, instead of scaling-up, to help in managing and understanding Big Data. As a result, this has focused new attention on different technologies such as in-memory databases, I/O virtualization, high-speed interconnects, and software frameworks such as Hadoop.

To take full advantage of these network and software innovations requires re-examining strategies for compute hardware. For maximum performance, a well-balanced infrastructure based on densely packed, power-efficient processors coupled with fast network interconnects is needed. This approach will help unlock applications and open new opportunities in business and high performance computing (HPC). (emphasis added)

I like powerful hardware as much as the next person. Either humming within earshot or making the local grid blink when it comes online.

Still, hardware/software tools for big data need to come with the warning label: “Semantics not included.

To soften the disappointment when big data appliances and/or software arrive and the bottom line stays the same, or gets worse.

Using big data, or rather effective use of big data, that is improving your bottom line, requires semantics, your semantics.

October 18, 2012

A Glance at Information-Geometric Signal Processing

Filed under: Image Processing,Image Recognition,Semantics — Patrick Durusau @ 2:18 pm

A Glance at Information-Geometric Signal Processing by Frank Nielsen.

Slides from the MAHI workship (Methodological Aspects of Hyperspectral Imaging)

From the workshop homepage:

The scope of the MAHI workshop is to explore new pathways that can potentially lead to breakthroughs in the extraction of the informative content of hyperspectral images. It will bring together researchers involved in hyperspectral image processing and in various innovative aspects of data processing.

Images, their informational content and the tools to analyze them have semantics too.

Do Presidential Debates Approach Semantic Zero?

ReConstitution recreates debates through transcripts and language processing by Nathan Yau.

From Nathan’s post:

Part data visualization, part experimental typography, ReConstitution 2012 is a live web app linked to the US Presidential Debates. During and after the three debates, language used by the candidates generates a live graphical map of the events. Algorithms track the psychological states of Romney and Obama and compare them to past candidates. The app allows the user to get beyond the punditry and discover the hidden meaning in the words chosen by the candidates.

The visualization does not answer the thorny experimental question: Do presidential debates approach semantic zero?

Well, maybe the technique will improve by the next presidential election.

In the meantime, it was an impressive display of read time processing and analysis of text.

Imagine such an interface that was streaming text for you to choose subjects, associations between subjects, and the like.

Not trying to perfectly code any particular stretch of text but interacting with the flow of the text.

There are goals other than approaching semantic zero.

October 15, 2012

People and Process > Prescription and Technology

Filed under: Project Management,Semantic Diversity,Semantics,Software — Patrick Durusau @ 3:55 pm

Factors that affect software systems development project outcomes: A survey of research by Laurie McLeod and Stephen G. MacDonell. ACM Computing Surveys (CSUR) Surveys Volume 43 Issue 4, October 2011 Article No. 24, DOI: 10.1145/1978802.1978803.

Abstract:

Determining the factors that have an influence on software systems development and deployment project outcomes has been the focus of extensive and ongoing research for more than 30 years. We provide here a survey of the research literature that has addressed this topic in the period 1996ā€“2006, with a particular focus on empirical analyses. On the basis of this survey we present a new classification framework that represents an abstracted and synthesized view of the types of factors that have been asserted as influencing project outcomes.

As with most survey work, particularly ones that summarize 177 papers, this is a long article, some fifty-six pages.

Let me try to tempt you into reading it by quoting from Angelica de Antonio’s review of it (in Computing Reviews, Oct. 2012):

An interesting discussion about the very concept of project outcome precedes the survey of factors, and an even more interesting discussion follows it. The authors stress the importance of institutional context in which the development project takes place (an aspect almost neglected in early research) and the increasing evidence that people and process have a greater effect on project outcomes than technology. A final reflection on what projects still continue to fail—even if we seem to know the factors that lead to success—raises a question on the utility of prescriptive factor-based research and leads to considerations that could inspire future research. (emphasis added)

Before you run off to the library or download a copy of the survey, two thoughts to keep in mind:

First, if “people and process” are more important than technology, where should we place the emphasis in projects involving semantics?

Second, if “prescription” can’t cure project failure, what are its chances with semantic diversity?

Thoughts?

October 14, 2012

Tech That Protects The President, Part 1: Data Mining

Filed under: Data Mining,Natural Language Processing,Semantics — Patrick Durusau @ 3:41 pm

Tech That Protects The President, Part 1: Data Mining by Alex Popescu.

From the post:

President Obama’s appearance at the Democratic National Convention in September took place amid a rat’s nest of perils. But the local Charlotte, North Carolina, police weren’t entirely on their own. They were aided by a sophisticated data mining system that helped them identify threats and react to them quickly. (Part 1 of a 3-part series about the technology behind presidential security.)

The Charlotte-Mecklenberg police used a software from lxReveal to monitor the Internet for associations between Obama, the DNC, and potential treats. The company’s program, known as uReveal, combs news articles, status updates, blog posts, discussion forum comments. But it doesn’t simply search for keywords. It works on concepts defined by the user and uses natural language processing to analyze plain English based on meaning and context, taking into account slang and sentiment. If it detects something amiss, the system sends real-time alerts.

“We are able to read and alert almost as fast as [information] comes on the Web, as opposed to other systems where it takes hours,” said Bickford, vice president of operations of IxReveal.

In the past, this kind of task would have required large numbers of people searching and then reading huge volumes of information and manually highlighting relevant references. “Normally you have to take information like an email and shove it in to a database,” Bickford explained. “Someone has to physically read it or do a keyword search.

uReveal, on the other hand, lets machines do the reading, tracking, and analysis. “If you apply our patented technology and natural language processing capability, you can actually monitor that information for specific keywords and phrases based on meaning and context,” he says. The software can differentiate between a Volkswagen bug, a computer bug and an insect bug, Bickford explained – or, more to the point, between a reference to fire from a gun barrel and on to fire in a fireplace.

Bickford says the days of people slaving over sifting through piles of data, or ETL (extract, transform and load) data processing capabilities are over. “It’s just not supportable.”

I understand product promotion but do you think potential assassins are publishing letters to the editor, blogging or tweeting about their plans or operational details?

Granting contract killers in Georgia are caught when someone tries to hire an undercover police officer as a “hit” man.

Does that expectation of dumbness apply in other cases as well?

Or, is searching large amounts of data like the drunk looking for his keys under the street light?

A case of “the light is better here?”

October 9, 2012

A Semantic Look at the Presidential Debates

Filed under: Debate,Natural Language Processing,Politics,Semantics — Patrick Durusau @ 3:30 pm

A Semantic Look at the Presidential Debates

Warning: For entertainment purposes only.*

Angela Guess reports:

Luca Scagliarini of Expert System reports, ā€œThis weekā€™s presidential debate is being analyzed across the web on a number of fronts, from a factual analysis of what was said, to the number of tweets it prompted. Instead, we used our Cogito semantic engine to analyze the transcript of the debate through a semantic and linguistic lens. Cogito extracted the responses by question, breaking sentences down to their granular detail. This analysis allows us to look at the individual language elements to better understand what was said, as well as how the combined effect of word choice, sentence structure and sentence length might be interpreted by the audience.ā€

The full post: Presidential Debates 2012: Semantically speaking

*I don’t doubt the performance of the Cogito engine, just the semantics, if any, of the target content. šŸ˜‰

A Good Example of Semantic Inconsistency [C-Suite Appropriate]

Filed under: Marketing,Semantic Diversity,Semantic Inconsistency,Semantics — Patrick Durusau @ 10:27 am

A Good Example of Semantic Inconsistency by David Loshin.

You can guide users through the intellectual minefield of Frege, Peirce, Russell, Carnap, Sowa and others to illustrate the need for topic maps, with stunning (as in daunting) graphics.

Or, you can use David’s story:

I was at an event a few weeks back talking about data governance, and a number of the attendees were from technology or software companies. I used the term ā€œsemantic inconsistencyā€ and one of the attendees asked me to provide an example of what I meant.

Since we had been discussing customers, I thought about it for a second and then asked him what his definition was of a customer. He said that a customer was someone who had paid the company money for one of their products. I then asked if anyone in the audience was on the support team, and one person raised his hand. I asked him for a definition, and he said that a customer is someone to whom they provide support.

I then posed this scenario: the company issued a 30-day evaluation license to a prospect with full support privileges. Since the prospect had not paid any money for the product, according to the first definition that individual was not a customer. However, since that individual was provided full support privileges, according to the second definition that individual was a customer.

Within each silo, the associated definition is sound, but the underlying data sets are not compatible. An attempt to extract the two customer lists and merge them together into a single list will lead to inconsistent results. This may be even worse if separate agreements dictate how long a purchaser is granted full support privileges ā€“ this may lead to many inconsistencies across those two data sets.

Illustrating “semantic inconsistency,” one story at a time.

What’s your 250 – 300 word semantic inconsistency story?

PS: David also points to webinar that will be of interest. Visit his post.

October 7, 2012

The Forgotten Mapmaker: Nokia… [Lessons for Semantic Map Making?]

Filed under: Mapping,Maps,Semantics — Patrick Durusau @ 7:57 pm

The Forgotten Mapmaker: Nokia Has Better Maps Than Apple and Maybe Even Google by Alexis C. Madrigal.

What’s Nokia’s secret? Twelve billion probe data points a month, including data from FedEx and other logistic companies.

Notice that the logistic companies are not collecting mapping data per se, they are delivering goods.

Nokia is building maps based on data collected for another purpose, a completely routine and unrelated purpose to map making.

Does that suggest something to you about semantic map making?

That we need to capture semantics as users travel through data for other purposes?

If I knew what those opportunities were I would have put them at the top of this post. Suggestions?

PS: Sam Hunting pointed me towards this article.

Broken Telephone Game of Defining Software and UI Requirements [And Semantics]

Filed under: Project Management,Requirements,Semantics — Patrick Durusau @ 7:33 pm

The Broken Telephone Game of Defining Software and UI Requirements by Martin Crisp.

Martin is writing in a UI context but the lesson he teaches is equally applicable to any part of software/project management. (Even U.S. federal government big data projects.)

His counsel is not one of dispair, he outlines solutions that can lessen the impact of the broken telephone game.

But it is up to you to recognize the game that is afoot and to react accordingly.

From the post:

The broken telephone game is played all over the world. In it, according to Wikipedia, ā€œone person whispers a message to another, which is passed through a line of people until the last player announces the message to the entire group. Errors typically accumulate in the retellings, so the statement announced by the last player differs significantly, and often amusingly, from the one uttered by the first.ā€

This game is also played inadvertently by a large number of organizations seeking to define software and UI requirements, using information passed from customers, to business analysts, to UI/UX designers, to developers and testers.

Hereā€™s a typical example:

  • The BA or product owner elicits requirements from a customer and writes them down, often as a feature list and use cases.
  • The use cases are interpreted by the UI/UX team to develop UI mockups and storyboards.
  • Testing interprets the storyboards, mockups, and use cases to develop test cases,
  • Also, the developers will try to interpret the use cases, mockups, and storyboards to actually write the code.

As with broken telephone, at each handoff of information the original content is altered. The resulting approach includes a lot of re-work and escalating project costs due to combinations of the following:

  • Use cases donā€™t properly represent customer requirements.
  • UI/UX design is not consistent with the use cases.
  • Incorrect test cases create false bugs.
  • Missed test cases result in undiscovered bugs.
  • Developers build features that donā€™t meet customer needs.

The further down the broken telephone line the original requirements get, the more distorted they become. For this reason, UI storyboards, test cases, and code typically require a lot of reworking as requirements are misunderstood or improperly translated by the time they get to the UI and testing teams.

October 6, 2012

Perseus Gives Big Humanities Data Wings

Filed under: Humanities,Marketing,Semantics — Patrick Durusau @ 1:23 pm

Perseus Gives Big Humanities Data Wings by Ian Armas Foster.

From the post:

ā€œHow do we think about the human record when our brains are not capable of processing all the data in isolation?ā€ asked Professor Gregory Crane of students in a lecture hall at the University of Kansas.

But when he posed this question, Crane wasnā€™t referencing modern big data to a bunch of computer science majors. Rather, he was discussing data from ancient texts with a group of those studying the humanities (and one computer science major).

Crane, a professor of classics, adjunct professor of computer science, and chair of Technology and Entrepreneurship at Tufts University, spoke about the efforts of the Perseus Project, a project whose goals include storing and analyzing ancient texts with an eye toward building a global humanities model.

(video omitted)

The next step in humanities is to create that Crane calls ā€œa dialogue among civilizations.ā€ With regard to the study of humanities, it is to connect those studying classical Greek with those studying classical Latin, Arabic, and even Chinese. Like physicists want to model the universe, Crane wants to model the progression of intelligence and art on a global scale throughout human history.

… (a bit later)

Surprisingly, the biggest barrier is not actually the amount of space occupied by the data of the ancient texts, but rather the language barriers. Currently, the Perseus Project covers over a trillion words, but those words are split up into 400 languages. To give a specific example, Crane presented a 12th century Arabic document. It was pristine and easily readableā€”to anyone who can read ancient Arabic.

Substitute “semantic” for “language” in “language barriers” and I think the comment is right on the mark.

Assuming that you could read the “12th century Arabic document” and understand its semantics, where would you record your reading to pass it along to others?

Say you spot the name of a well known 12th figure. Must every reader duplicate your feat of reading and understanding the document to make that same discovery?

Or can we preserve your “discovery” for other readers?

Topic maps anyone?

October 2, 2012

Apache Stanbol graduates to Top-Level Project

Filed under: Semantics,Stanbol — Patrick Durusau @ 4:15 pm

Apache Stanbol graduates to Top-Level Project

From the post:

The Apache Software Foundation (ASF) has announced that Apache Stanbol has graduated from project incubation. Stanbol is an open source Java stack designed to interface with a content management system (CMS) to enhance it with semantic information. With the elevation to a Top-Level Project, the ASF recognises that the project’s community has been “well-governed” according to the foundation’s principles and follows “The Apache Way” for running a project.

Stanbol is a modular collection of reasoning engines, content enhancers and components to manage rules and metadata for content fed into the framework, all wrapped with a RESTful API and orchestrated within an Apache Felix OSGi container. A CMS adapter allows the system to connect to content management systems from which it can extract data to use in evaluating and developing rules and annotations.

The RESTful API can then be used to provide semantic information for content from a different source based upon information the server has previously analysed. Stanbol is more of a collection of reusable components than a complete solution for semantic searching, however. It is designed to work alongside CMS systems and existing search software.

I suppose too much *nix experience has made me suspicious of “complete solutions” for anything. Components, particularly interchangeable ones, seem a lot more robust.

September 30, 2012

Twitter Semantics Opportunity?

Filed under: Semantics,Tweets — Patrick Durusau @ 8:25 pm

Carl Bialik (Wall Street Journal) writes in Timing Twitter about the dangers of reading too much into tweet statistics and then says:

She [Twitter spokeswoman Elaine Filadelfo] noted that the company is being conservative in its counting, and that the true counts likely are higher than the ones reported by Twitter. For instance, the company didnā€™t include ā€œRyanā€ in its search tersm for the Republican convention, to avoid picking up tweets about, say, Ryan Gosling rather than those about Republican vice-presidential candidate Paul Ryan. And it has no way to catch tweets such as ā€œbeautiful dressā€ that are referring to presentersā€™ outfits during the Emmy Awards telecast. ā€œYou follow me during the Emmys, and you know Iā€™m talking about the Emmys,ā€ Filadelfo said of the hypothetical ā€œbeautiful dressā€ tweet. But Twitter doesnā€™t know that and doesnā€™t count that tweet.

Twitter may not “know” about the Emmys (they need to get out more) but certainly followers on Twitter did.

Followers probably bright enough to know which presenter was being identified in the tweet.

Imagine a crowd sourced twitter application where you follow particular people and add semantics to their tweets.

Might not return big bucks for the people adding semantics but if they were donating their time to an organization or group, could reach commercial mass.

We can keep waiting for computers to become dumb, at least, or we can pitch in to cover the semantic gap.

What do you think?

« Newer PostsOlder Posts »

Powered by WordPress