Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 18, 2014

From Frequency to Meaning: Vector Space Models of Semantics

Filed under: Meaning,Semantic Vectors,Semantics,Vector Space Model (VSM) — Patrick Durusau @ 6:31 pm

From Frequency to Meaning: Vector Space Models of Semantics by Peter D. Turney and Patrick Pantel.

Abstract:

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

At forty-eight (48) pages with a thirteen (13) page bibliography, this survey of vector space models (VSMs) of semantics should keep you busy for a while. You will have to fill in VSMs developments since 2010 but mastery of this paper will certain give you the foundation to do so. Impressive work.

I do disagree with the authors when they say:

Computers understand very little of the meaning of human language.

Truth be told, I would say:

Computers have no understanding of the meaning of human language.

What happens with a VSM of semantics is that we as human readers choose a model we think represents semantics we see in a text. Our computers blindly apply that model to text and report the results. We as human readers choose results that we think are closer to the semantics we see in the text, and adjust the model accordingly. Our computers then blindly apply the adjusted model to the text again and so on. At no time does the computer have any “understanding” of the text or of the model that it is applying to the text. Any “understanding” in such a model is from a human reader who adjusted the model based on their perception of the semantics of a text.

I don’t dispute that VSMs have been incredibly useful and like the authors, I think there is much mileage left in their development for text processing. That is not the same thing as imputing “understanding” of human language to devices that in fact have none at all. (full stop)

Enjoy!

I first saw this in a tweet by Christopher Phipps.

PS: You probably recall that VSMs are based on creating a metric space for semantics, which have no preordained metric space. Transitioning from a non-metric space to a metric space isn’t subject to validation, at least in my view.

September 16, 2014

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise

Filed under: Description Logic,RDF,Semantic Web,Semantics — Patrick Durusau @ 7:52 pm

Life Is Random: Biologists now realize that “nature vs. nurture” misses the importance of noise by Cailin O’Connor.

From the post:

Is our behavior determined by genetics, or are we products of our environments? What matters more for the development of living things—internal factors or external ones? Biologists have been hotly debating these questions since shortly after the publication of Darwin’s theory of evolution by natural selection. Charles Darwin’s half-cousin Francis Galton was the first to try to understand this interplay between “nature and nurture” (a phrase he coined) by studying the development of twins.

But are nature and nurture the whole story? It seems not. Even identical twins brought up in similar environments won’t really be identical. They won’t have the same fingerprints. They’ll have different freckles and moles. Even complex traits such as intelligence and mental illness often vary between identical twins.

Of course, some of this variation is due to environmental factors. Even when identical twins are raised together, there are thousands of tiny differences in their developmental environments, from their position in the uterus to preschool teachers to junior prom dates.

But there is more to the story. There is a third factor, crucial to development and behavior, that biologists overlooked until just the past few decades: random noise.

In recent years, noise has become an extremely popular research topic in biology. Scientists have found that practically every process in cells is inherently, inescapably noisy. This is a consequence of basic chemistry. When molecules move around, they do so randomly. This means that cellular processes that require certain molecules to be in the right place at the right time depend on the whims of how molecules bump around. (bold emphasis added)

Is another word for “noise” chaos?

The sort of randomness that impacts our understanding of natural languages? That leads us to use different words for the same thing and the same word for different things?

The next time you see a semantically deterministic system be sure to ask if they have accounted for the impact of noise on the understanding of people using the system. 😉

To be fair, no system can but the pretense that noise doesn’t exist in some semantic environments (think description logic, RDF) is more than a little annoying.

You might want to start following the work of Cailin O’Connor (University of California, Irvine, Logic and Philosophy of Science).

Disclosure: I have always had a weakness for philosophy of science so your mileage may vary. This is real philosophy of science and not the strained crys of “science” you see on most mailing list discussions.

I first saw this in a tweet by John Horgan.

Getting Started with S4, The Self-Service Semantic Suite

Filed under: Entity Resolution,Natural Language Processing,S4,Semantics,SPARQL — Patrick Durusau @ 7:15 pm

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.

August 7, 2014

Ebola: “Highly contagious…” or…

Filed under: Semantics,Topic Maps — Patrick Durusau @ 2:30 pm

NPR has developed a disturbing range of semantics for the current Ebola crisis.

Consider these two reports, one on August 7th and one on August 2nd, 2014.

Aug. 7th: Officials Fear Ebola Will Spread Across Nigeria

Dave Greene – Reports on there only being two or three cases in Lagos, but Nigeria is declaring a state of emergency because Ebola is “…highly contagious….”

Aug. 2nd: Atlanta Hospital Prepares To Treat 2 Ebola Patients

Jim Burress – comments on a news conference at Emory:

“He downplayed any threat to public safety because the virus can only be spread through close contact with an infected person.”

To me, “highly contagious” and “close contact with an infected person” are worlds apart. Why the shift in semantics in only five days?

Curious if you have noticed this or other shifting semantics around the Ebola outbreak from other news outlets?

Not that I would advocate any one “true” semantic for the crisis but I wonder who would benefit from a Ebola-fear panic in Nigeria? Or who would benefit from no panic and a possible successful treatment for Ebola?

Working on the assumption that semantics vary depending on who benefits from a particular semantic.

Topic maps could help you “out” the beneficiaries. Or help you plan to conceal connections to the beneficiaries, depending upon your business model.


Update: A close friend pointed me to: FILOVIR: Scientific Resource for Research on Filoviruses. Website, twitter feed, etc. In case you are looking for a current feed of Ebola information, both public and professional.

July 31, 2014

How To Create Semantic Confusion

Filed under: Cypher,Neo4j,Semantics — Patrick Durusau @ 10:38 am

Merge: to cause (two or more things, such as two companies) to come together and become one thing : to join or unite (one thing) with another (http://www.merriam-webster.com/dictionary/merge

Do you see anything common between that definition of merge and:

  • It ensures that a pattern exists in the graph by creating it if it does not exist already
  • It will not use partially existing (unbound) patterns- it will attempt to match the entire pattern and create the entire pattern if missing
  • When unique constraints are defined, MERGE expects to find at most one node that matches the pattern
  • It also allows you to define what should happen based on whether data was created or matched

The quote is from Cypher MERGE Explained by Luanne Misquitta. Great post if you want to understand the operation of Cypher “merge,” which has nothing in common with the term “merge” in English.

Want to create semantic confusion?

Choose a well-known term and define new and unrelated semantics for it. Creates a demand for training, tutorials as well as confused users.

I first saw this in a tweet by GraphAware.

July 30, 2014

Semantic Investment Needs A Balance Sheet Line Item

Filed under: Marketing,Semantics — Patrick Durusau @ 6:19 pm

The Hidden Shareholder Boost From Information Assets by Douglas Laney.

From the post:

It’s hard today not to see the tangible, economic benefits of information all around us: Walmart uses social media trend data to entice online shoppers to purchase 10 percent to 15 percent more stuff; Kraft spinoff Mondelez grew revenue by $100 million through improved in-store promotion configurations using detailed store, chain, product, stock and pricing data; and UPS saves more than $50 million, delivers 35 percent more packages per year and has doubled driver wages by continually collecting and analyzing more than 200 data points per truck along with GPS data to reduce accidents and miles driven.

Even businesses from small city zoos to mom-and-pop coffee shops to wineries are collecting, crushing and consuming data to yield palpable revenue gains or expense reductions. In addition, some businesses beyond the traditional crop of data brokers monetize their information assets directly by selling or trading them for goods or services.

Yet while as a physical asset, technology is easily given a value attribution and represented on balance sheets; information is treated as an asset also ran or byproduct of the IT department. Your company likely accounts for and manages your office furniture with greater discipline than your information assets. Why? Because accounting standards in place since the beginning of the information age more than 50 years ago continue to be based on 250-year-old Industrial Age realities. (emphasis in original)

Does your accounting system account for your investment in semantics?

Here’s some ways to find out:

  • For any ETL project in the last year, can your accounting department detail how much was spent discovering the semantics of the ETL data?
  • For any data re-used for an ETL project in the last three years, can your accounting department detail how much was spent duplicating the work of the prior ETL?
  • Can your accounting department produce a balance sheet showing your current investment in the semantics of your data?
  • Can your accounting department produce a balance sheet showing the current value of your information?

If the answer is “no,” to any of those questions, is your accounting department meeting your needs in the information age?

Douglas has several tips for getting people’s attention for the $$$ you have invested in information.

Is information an investment or an unknown loss on your books?

July 23, 2014

Supplying Missing Semantics? (IKWIM)

Filed under: Semantics — Patrick Durusau @ 3:22 pm

Chris Ford in Creating music with Clojure and Overtone uses an example of a harmonic sound missing its first two harmonic components and yet when heard, our ears supply the missing components. Quite spooky when you first see it but there is no doubt that the components are “missing” quite literally from the result.

Which makes me wonder, do we generally supply semantics, appropriately or inappropriately, to data?

Unless it is written in an unknown script, we “know” what data must mean, based on what we would mean by such data.

Using “data” in the broadest sense to include all recorded information.

Even unknown scripts don’t stop us from assigning our “meanings” to texts. I will have to run down some of the 17th century works on Egyptian Hieroglyphics at some point.

Entertaining and according to current work on historical Egyptian, not even close to what we now understand the texts to mean.

The “I know what it means” (IKWIM) syndrome may be the biggest single barrier to all semantic technologies. Capturing the semantics of texts is always an expensive proposition and if I already IKWIM, then why bother?

If you are capturing something I already know, that can be shared with others. Another disincentive for capturing semantics.

To paraphrase a tweet I saw today by no-hacker-news

Why take 1 minute to document when others can waste a day guessing?

July 16, 2014

Which gene did you mean?

Filed under: Annotation,Semantics,Tagging — Patrick Durusau @ 3:38 pm

Which gene did you mean? by Barend Mons.

Abstract:

Computational Biology needs computer-readable information records. Increasingly, meta-analysed and pre-digested information is being used in the follow up of high throughput experiments and other investigations that yield massive data sets. Semantic enrichment of plain text is crucial for computer aided analysis. In general people will think about semantic tagging as just another form of text mining, and that term has quite a negative connotation in the minds of some biologists who have been disappointed by classical approaches of text mining. Efforts so far have tried to develop tools and technologies that retrospectively extract the correct information from text, which is usually full of ambiguities. Although remarkable results have been obtained in experimental circumstances, the wide spread use of information mining tools is lagging behind earlier expectations. This commentary proposes to make semantic tagging an integral process to electronic publishing.

From within the post:

If all words had only one possible meaning, computers would be perfectly able to analyse texts. In reality however, words, terms and phrases in text are highly ambiguous. Knowledgeable people have few problems with these ambiguities when they read, because they use context to disambiguate ‘on the fly’. Even when fed a lot of semantically sound background information, however, computers currently lag far behind humans in their ability to interpret natural language. Therefore, proper semantic tagging of concepts in texts is crucial to make Computational Biology truly viable. Electronic Publishing has so far only scratched the surface of what is needed.

Open Access publication shows great potential, andis essential for effective information mining, but it will not achieve its full potential if information continues to be buried in plain text. Having true semantic mark up combined with open access for mining is an urgent need to make possible a computational approach to life sciences.

Creating semantically enriched content as part and parcel of the publication process should be a winning strategy.

First, for current data, estimates of what others will be searching for should not be hard to find out. That will help focus tagging on the material users are seeking. Second, a current and growing base of enriched material will help answer questions about the return on enriching material.

Other suggestions for BMC Bioinformatics?

…[S]emantically enriched open pharmacological space…

Filed under: Bioinformatics,Biomedical,Drug Discovery,Integration,Semantics — Patrick Durusau @ 2:25 pm

Scientific competency questions as the basis for semantically enriched open pharmacological space development by Kamal Azzaoui, et al. (Drug Discovery Today, Volume 18, Issues 17–18, September 2013, Pages 843–852)

Abstract:

Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compound–target–pathway–disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.

Pharmacology may not be your space but this is a good example of what it takes for semantic integration of resources in a complex area.

Despite the “…you too can be a brain surgeon with our new web-based app…” from various sources, semantic integration has been, is and will remain difficult under the best of circumstances.

I don’t say that to discourage anyone but to avoid the let-down when integration projects don’t provide easy returns.

It is far better to plan for incremental and measurable benefits along the way than to fashion grandiose goals that are ever receding on the horizon.

I first saw this in a tweet by ChemConnector.

July 12, 2014

Dilemmas in a General Theory of Planning [“Wicked” Problems]

Filed under: Problem Solving,Semantics — Patrick Durusau @ 1:39 pm

Dilemmas in a General Theory of Planning by Horst W. J. Rittel and Melvin M. Webber.

Abstract:

The search for scientific bases for confronting problems of social policy is bound to fail, because of the nature of these problems. They are “wicked” problems, whereas science has developed to deal with “tame” problems. Policy problems cannot be definitively described. Moreover, in a pluralistic society there is nothing like the undisputable public good; there is no objective definition of equity; policies that respond to social problems cannot be meaningfully correct or false; and it makes no sense to talk about “optimal solutions” to social problems unless severe qualifications are imposed first. Even worse, there are no “solutions” in the sense of definitive and objective answers.

If you have heard the phrase, “wicked” problems, here is your chance to read the paper that originated that phrase.

Rittel and Webber identify ten (10) properties of wicked problems, allowing for more to exist:

  1. There is no definite formulation of a wicked problem
  2. Wicked problems have no stopping rule
  3. Solutions to wicked problems are not true-or-false, but good-or-bad
  4. There is no immediate and no ultimate test of a solution to a wicked problem
  5. Every solution to a wicked problem is a “one-shot” operation”; because there is no opportunity to learn by trial-and-error, every attempt counts significantly
  6. Wicked problems do not have an enumerable (or an exhaustively describable) set of potential solutions, or is there a well-described set of permissible operations that may be incorporated into the plan
  7. Every wicked problem is essentially unique
  8. Every wicked problem can be considered to be a symptom of another problem
  9. The existence of a discrepancy representing a wicked problem can be explained in numerous ways. The choice of explanation determines the nature of the problem’s resolution.
  10. The planner has no right to be wrong

Important paper to read. It will help you spot “tame” solutions and their assumptions when posed as answers to “wicked” problems.

I first saw this in a tweet by Chris Diehl.

July 9, 2014

SAMUELS [English Historical Semantic Tagger]

Filed under: Humanities,Linguistics,Semantics,Tagging — Patrick Durusau @ 1:13 pm

SAMUELS (Semantic Annotation and Mark-Up for Enhancing Lexical Searches)

From the webpage:

The SAMUELS project (Semantic Annotation and Mark-Up for Enhancing Lexical Searches) is funded by the Arts and Humanities Research Council in conjunction with the Economic and Social Research Council (grant reference AH/L010062/1) from January 2014 to April 2015. It will deliver a system for automatically annotating words in texts with their precise meanings, disambiguating between possible meanings of the same word, ultimately enabling a step-change in the way we deal with large textual data. It uses the Historical Thesaurus of English as its core dataset, and will provide for each word in a text the Historical Thesaurus reference code for that concept. Textual data tagged in this way can then be accurately searched and precisely investigated, producing results which can be automatically aggregated at a range of levels of precision. The project also draws on a series of research sub-projects which will employ the software thus developed, testing and validating the utility of the SAMUELS tagger as a tool for wide-ranging further research.
….

To really appreciate this project, visit SAMUELS English Semantic Tagger Test Site.

There you can enter up to 2000 English words and select low/upper year boundaries!

Just picking a text at random, ;-), I chose:

Greenpeace flew its 135-foot-long thermal airship over the Bluffdale, UT, data center early Friday morning, carrying the message: “NSA Illegal Spying Below” along with a link steering people to a new web site, StandAgainstSpying.org, which the three groups launched with the support of a separate, diverse coalition of over 20 grassroots advocacy groups and Internet companies. The site grades members of Congress on what they have done, or often not done, to rein in the NSA.

Some terms and Semtag3 by time period:

1500-1600:

  • congress: C09d01 [Sexual intercourse]; E07e16 [Inclination]; E08e12 [Movement towards a thing/person/position]
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];

1700-1800

  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; E07e16 [Inclination];
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];

1800-1900

  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; O07 [Conversation];
  • data: H55a [Attestation, witness, evidence];
  • thermal: A04b02 [Spring]; C09a [Sexual desire]; D03c02 [Heat];
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B06d01 [Deformities of specific parts]; B25d [Tools and implements];

1900-2000

  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; O07 [Conversation];
  • data: F04v04 [Data]; H55a [Attestation, witness, evidence]; W05 [Information];
  • thermal: A04b02 [Spring]; B28b [Types/styles of clothing]; D03c02 [Heat];
  • UT: 04.10[Unrecognised]
  • web: B06d01 [Deformities of specific parts]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];

2000-2014

  • congress: 04.10[Unrecognised]
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: 04.10[Unrecognised]

I am assuming that the “04.10[unrecognized]” for all terms in 2000-2014 means there is no usage data for that time period.

I have never heard anyone deny that meanings of words change over time and domain.

What remains a mystery is why the value-add of documenting the meanings of words isn’t obvious?

I say “words,” I should be saying “data.” Remembering the loss of the $125 Million Mars Climate Orbiter. One system read a value as “pounds of force” and another read the same data as “newtons.” In that scenario, ET doesn’t get to call home.

So let’s rephrase the question to: Why isn’t the value-add of documenting the semantics of data obvious?

Suggestions?

July 2, 2014

Frege in Space:…

Filed under: Linguistics,Semantics — Patrick Durusau @ 10:09 am

Frege in Space: A Program of Compositional Distributional Semantics by Marco Baroni, Raffaela Bernardi, Roberto Zamparelli.

Abstract:

The lexicon of any natural language encodes a huge number of distinct word meanings. Just to understand this article, you will need to know what thousands of words mean. The space of possible sentential meanings is infinite: In this article alone, you will encounter many sentences that express ideas you have never heard before, we hope. Statistical semantics has addressed the issue of the vastness of word meaning by proposing methods to harvest meaning automatically from large collections of text (corpora). Formal semantics in the Fregean tradition has developed methods to account for the infinity of sentential meaning based on the crucial insight of compositionality, the idea that meaning of sentences is built incrementally by combining the meanings of their constituents. This article sketches a new approach to semantics that brings together ideas from statistical and formal semantics to account, in parallel, for the richness of lexical meaning and the combinatorial power of sentential semantics. We adopt, in particular, the idea that word meaning can be approximated by the patterns of co-occurrence of words in corpora from statistical semantics, and the idea that compositionality can be captured in terms of a syntax-driven calculus of function application from formal semantics.

At one hundred and ten (110) pages this is going to take a while to read and even longer to digest. What I have read so far is both informative and surprisingly, for the subject area, quite pleasant reading.

Thoughts about building up a subject identification by composition?

Enjoy!

I first saw this in a tweet by Stefano Bertolo.

May 13, 2014

Bringing machine learning and compositional semantics together

Filed under: Machine Learning,Semantics — Patrick Durusau @ 6:24 pm

Bringing machine learning and compositional semantics together by Percy Liang and Christopher Potts.

Abstract:

Computational semantics has long been seen as a fi eld divided between logical and statistical approaches, but this divide is rapidly eroding, with the development of statistical models that learn compositional semantic theories from corpora and databases. This paper presents a simple discriminative learning framework for defi ning such models and relating them to logical theories. Within this framework, we discuss the task of learning to map utterances to logical forms (semantic parsing) and the task of learning from denotations with logical forms as latent variables. We also consider models that use distributed (e.g., vector) representations rather than logical ones, showing that these can be seen as part of the same overall framework for understanding meaning and structural complexity.

My interest is in how computational semantics can illuminate issues in semantics. It has been my experience that the transition from natural language to more formal (and less robust) representations draws out semantic issues, such as ambiguity, that lurk unnoticed in natural language texts.

With right at seven pages of references, you will have no shortage of reading material on compositional semantics.

I first saw this in a tweet by Chris Brockett.

May 8, 2014

Avoid Philosophy?

Filed under: Humanities,Philosophy,Semantics — Patrick Durusau @ 2:00 pm

Why Neil deGrasse Tyson is a philistine by Damon Linker.

From the post:

Neil deGrasse Tyson may be a gifted popularizer of science, but when it comes to humanistic learning more generally, he is a philistine. Some of us suspected this on the basis of the historically and theologically inept portrayal of Giordano Bruno in the opening episode of Tyson’s reboot of Carl Sagan’s Cosmos.

But now it’s been definitively demonstrated by a recent interview in which Tyson sweepingly dismisses the entire history of philosophy. Actually, he doesn’t just dismiss it. He goes much further — to argue that undergraduates should actively avoid studying philosophy at all. Because, apparently, asking too many questions “can really mess you up.”

Yes, he really did say that. Go ahead, listen for yourself, beginning at 20:19 — and behold the spectacle of an otherwise intelligent man and gifted teacher sounding every bit as anti-intellectual as a corporate middle manager or used-car salesman. He proudly proclaims his irritation with “asking deep questions” that lead to a “pointless delay in your progress” in tackling “this whole big world of unknowns out there.” When a scientist encounters someone inclined to think philosophically, his response should be to say, “I’m moving on, I’m leaving you behind, and you can’t even cross the street because you’re distracted by deep questions you’ve asked of yourself. I don’t have time for that.”

“I don’t have time for that.”

With these words, Tyson shows he’s very much a 21st-century American, living in a perpetual state of irritated impatience and anxious agitation. Don’t waste your time with philosophy! (And, one presumes, literature, history, the arts, or religion.) Only science will get you where you want to go! It gets results! Go for it! Hurry up! Don’t be left behind! Progress awaits!

There are many ways to respond to this indictment. One is to make the case for progress in philosophical knowledge. This would show that Tyson is wrong because he fails to recognize the real advances that happen in the discipline of philosophy over time.

….

I remember thinking the first episode of Tyson’s Cosmos was rather careless with its handling of Bruno and the Enlightenment. But at the time I thought that was due to it being a “popular” presentation and not meant to be precise in every detail.

Damon has an excellent defense of philosophy and for that you should read his post.

I have a more pragmatic reason for recommending both philosophy in particular and the humanities in general to CS majors. You will waste less time in programming than you will from “deep questions.”

For example, why have intelligent to the point of being gifted CS types tried repeatedly to solve the issues of naming by proposing universal naming systems?

You don’t have to be very aware to know that naming systems are like standards. If you don’t like this one, make up another one.

That being the case, what makes anyone think their naming system will displace all others for any significant period of time? Considering there has never been a successful one.

Oh, I forgot, if you don’t know any philosophy, one place this issue gets discussed, or the humanities in general, you won’t be exposed to the long history of language and naming discussions. And the failures recorded there.

I would urge CS types to read and study both philosophy and the humanities for purely pragmatic reasons. CS pioneers were able to write the first FORTRAN compiler not because they had taken a compiler MOOC but because they had studied mathematics, linguistics, language, philosophy, history, etc.

Are you a designer (CS pioneers were) or are you a mechanic?

PS: If you are seriously interested in naming issues, my first suggestion would be to read The Search for the Perfect Language by Umberto Eco. It’s not all that needs to be read in this area but it is easily accessible.

I first saw this in a tweet by Christopher Phipps.

May 5, 2014

…immediate metrical meaning

Filed under: Metric Spaces,Semantics — Patrick Durusau @ 2:49 pm

Topology Fact tweeted today:

‘It’s not so easy to free oneself from the idea that coordinates must have an immediate metrical meaning.’ — Albert Einstein

In searching for that quote I found:

The simple fact is that in general relativity, coordinates are essentially arbitrary systems of markers chosen to distinguish one even from another. This gives us great freedom in how we define coordinates…. The relationship between the coordinate differences separating events and the corresponding intervals of time or distance that would be measured by a specified observer must be worked out using the metric of the spacetime. (Relativity, Gravitation and Cosmology by Robert J. A. Lambourne, page 155)

Let’s re-write the first sentence by Lambourne to read:

The simple fact is that in semantics, terms are essentially arbitrary systems of markers chosen to distinguish one semantic even from another.

Just to make clear that sets of terms have no external metric of semantic distance or closeness that separate them.

And re-write the second sentence to read:

The relationship between the term separating semantics and the corresponding semantic intervals would be measured by a specified observer.

I have omitted some words and added others to emphasize that “semantic intervals” have no metric other than as assigned and observed by some specified observer.

True, the original quote goes on to say: “…using the metric of the spacetime.” But spacetime has a generally accepted metric that has proven itself both accurate and useful since the early 20th century. So far as I know, despite contentions to the contrary, there is no similar metric for semantics.

In particular there is no general semantic metric that obtains across all observers.

Something to bear in mind when semantic distances are being calculated with great “precision” between terms. Most pocket calculators can be fairly precise. But being precise isn’t the same thing as being correct.

April 29, 2014

Question-answering system and method based on semantic labeling…

Filed under: Patents,Semantics — Patrick Durusau @ 6:57 pm

Question-answering system and method based on semantic labeling of text documents and user questions

From the patent:

A question-answering system for searching exact answers in text documents provided in the electronic or digital form to questions formulated by user in the natural language is based on automatic semantic labeling of text documents and user questions. The system performs semantic labeling with the help of markers in terms of basic knowledge types, their components and attributes, in terms of question types from the predefined classifier for target words, and in terms of components of possible answers. A matching procedure makes use of mentioned types of semantic labels to determine exact answers to questions and present them to the user in the form of fragments of sentences or a newly synthesized phrase in the natural language. Users can independently add new types of questions to the system classifier and develop required linguistic patterns for the system linguistic knowledge base.

Another reason to hope the United States Supreme Court goes nuclear on processes and algorithms.

That’s not an opinion on this patent but on the cloud that all process/algorithm patents cast on innovation.

I first saw this at: IHS Granted Key Patent for Proprietary, Next-Generation Search Technology by Angela Guess.

April 17, 2014

Reproducible Research/(Mapping?)

Filed under: Mapping,Research Methods,Science,Semantics,Topic Maps — Patrick Durusau @ 2:48 pm

Implementing Reproducible Research edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng.

From the webpage:

In many of today’s research fields, including biomedicine, computational tools are increasingly being used so that the results can be reproduced. Researchers are now encouraged to incorporate software, data, and code in their academic papers so that others can replicate their research results. Edited by three pioneers in this emerging area, this book is the first one to explore this groundbreaking topic. It presents various computational tools useful in research, including cloud computing, data repositories, virtual machines, R’s Sweave function, XML-based programming, and more. It also discusses legal issues and case studies.

There is a growing concern over the ability of scientists to reproduce the published results of other scientists. The Economist rang one of many alarm bells when it published: Trouble at the lab [Data Skepticism].

From the introduction to Reproducible Research:

Literate statistical programming is a concept introduced by Rossini () that builds on the idea of literate programming as described by Donald Knuth. With literate statistical programming, one combines the description of a statistical analysis and the code for doing the statistical analysis into a single document. Subsequently, one can take the combined document and produce either a human-readable document (i.e. PDF) or a machine readable code file. An early implementation of this concept was the Sweave system of Leisch which uses R as its programming language and LATEX as its documentation language (). Yihui Xie describes his knitr package which builds substantially on Sweave and incorporates many new ideas developed since the initial development of Sweave. Along these lines, Tanu Malik and colleagues describe the Science Object Linking and Embedding framework for creating interactive publications that allow authors to embed various aspects of computational research in document, creating a complete research compendium. Tools

Of course, we all cringe when we read that a drug company can reproduce only 1/4 of 67 “seminal” studies.

What has me curious is why we don’t have the same reaction when enterprise IT systems require episodic remapping, which requires the mappers to relearn what was known at the time of the last remapping? We all know that enterprise (and other) IT systems change and evolve, but practically speaking, no effort is make to capture the knowledge that would reduce the time, cost and expense of every future remapping.

We can see the expense and danger of science not being reproducible, but when our own enterprise data mappings are not reproducible, that’s just the way things are.

Take inspiration from the movement towards reproducible science and work towards reproducible semantic mappings.

I first saw this in a tweet by Victoria Stodden.

April 13, 2014

Will Computers Take Your Job?

Filed under: Data Analysis,Humor,Semantics — Patrick Durusau @ 10:42 am

Probability that computers will take away your job posted by Jure Leskovec.

jobs taken by computers

For your further amusement, I recommend the full study, “The Future of Employment: How Susceptible are Jobs to Computerisation?” by C. Frey and M. Osborne (2013).

The lower the number, the less likely for computer replacement:

  • Logisticians – #55, more replaceable than Rehabilitation Counselors at #47.
  • Computer and Information Research Scientists – #69, more replaceable than Public Relations and Fundraising Managers at #67. (Sorry Don.)
  • Astronomers – #128, more replaceable than Credit Counselors at #126.
  • Dancers – #179? I’m not sure the authors have even seen Paula Abdul dance.
  • Computer Programmers – #293, more replaceable than Historians at #283.
  • Bartenders – #422. Have you ever told a sad story to a coin-operated vending machine?
  • Barbers – #439. Admittedly I only see barbers at a distance but if I wanted one, I would prefer human one.
  • Technical Writers – #526. The #1 reason why technical documentation is so poor. Technical writers are under appreciated and treated like crap. Good technical writing should be less replaceable by computers than Lodging Managers at #12.
  • Tax Examiners and Collectors, and Revenue Agents – #586. Stop cheering so loudly. You are frightening other cube dwellers.
  • Umpires, Referees, and Other Sports Officials – 684. Now cheer loudly! 😉

If the results strike you as odd, consider this partial description of the approach taken to determine if a job could be taken over by a computer:

First, together with a group of ML researchers, we subjectively hand-labelled 70 occupations, assigning 1 if automatable, and 0 if not. For our subjective assessments, we draw upon a workshop held at the Oxford University Engineering Sciences Department, examining the automatability of a wide range of tasks. Our label assignments were based on eyeballing the O∗NET tasks and job description of each occupation. This information is particular to each occupation, as opposed to standardised across different jobs. The hand-labelling of the occupations was made by answering the question “Can the tasks of this job be sufficiently specified, conditional on the availability of big data, to be performed by state of the art computer-controlled equipment”. Thus, we only assigned a 1 to fully automatable occupations, where we considered all tasks to be automatable. To the best of our knowledge, we considered the possibility of task simplification, possibly allowing some currently non-automatable tasks to be automated. Labels were assigned only to the occupations about which we were most confident. (at page 30)

Not to mention that occupations were considered for automation on the basis of nine (9) variables.

Would you believe that semantics isn’t mentioned once in this paper? So now you know why I have issues with its methodology and conclusions. What do you think?

March 26, 2014

Big Data: Humans Required

Filed under: BigData,Semantics — Patrick Durusau @ 10:35 am

Big Data: Humans Required by Sherri Hammons.

From the post:

These simple examples outline the heart of the problem with data: interpretation. Data by itself is of little value. It is only when it is interpreted and understood that it begins to become information. GovTech recently wrote an article outlining why search engines will not likely replace actual people in the near future. If it were merely a question of pointing technology at the problem, we could all go home and wait for the Answer to Everything. But, data doesn’t happen that way. Data is very much like a computer: it will do just as it’s told. No more, no less. A human is required to really understand what data makes sense and what doesn’t. But, even then, there are many failed projects.

See Sherri’s post for a conversation overheard and a list of big data fallacies.

The same point has been made before but Sherri’s is a particularly good version of it.

Since it’s not news, at least to anyone who has been paying attention in the 20th – 21st century, the question becomes why do we keep making that same mistake over and over again?

That is relying on computers for “the answer” rather asking humans to setup the problem for a computer and to interpret the results.

Just guessing but I would say it has something to do with our wanting to avoid relying on other people. That in some manner, we are more independent, powerful, etc. if we can rely on machines instead of other people.

Here’s one example: Once upon a time if I wanted to hear Stabat Mater I would have to attend a church service and participate in its singing. In an age of iPods and similar devices, I can enjoy it in a cone of music that isolates me from my physical surrounding and people around me.

Nothing wrong with recorded music, but the transition from a communal, participatory setting to being in a passive, self-chosen sound cocoon seems lossy to me.

Can we say the current fascination with “big data” and the exclusion of people is also lossy?

Yes?

I first saw this in Nat Torkington’s Four short links: 18 March 2014.

March 13, 2014

Audit Trails Anyone?

Filed under: Auditing,BigData,Semantics — Patrick Durusau @ 2:44 pm

Instrumenting collaboration tools used in data projects:Built-in audit trails can be useful for reproducing and debugging complex data analysis projects by Ben Lorica.

From the post:

As I noted in a previous post, model building is just one component of the analytic lifecycle. Many analytic projects result in models that get deployed in production environments. Moreover, companies are beginning to treat analytics as mission-critical software and have real-time dashboards to track model performance.

Once a model is deemed to be underperforming or misbehaving, diagnostic tools are needed to help determine appropriate fixes. It could well be models need to be revisited and updated, but there are instances when underlying data sources1 and data pipelines are what need to be fixed. Beyond the formal systems put in place specifically for monitoring analytic products, tools for reproducing data science workflows could come in handy.

Ben goes onto suggest that an “activity log” is a great idea for capturing a work flow for later analysis/debugging. And so it is, but I would go one step further and capture some of the semantics of the work flow.

I knew a manager who had a “cheat sheet” of report writer jobs to run every month. They would pull the cheat sheet, enter the commands and produce the report. They were a roadblock to ever changing the system because then the “cheatsheet” would not work.

I am sure none of you have ever encountered the same situation. But I have seen it in at least one case.

March 4, 2014

Semantics in Support of Biodiversity Knowledge Discovery:…

Filed under: Ontology,Semantics — Patrick Durusau @ 3:08 pm

Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related Ontologies by Walls RL, Deck J, Guralnick R, Baskauf S, Beaman R, et al. (2014). (Walls RL, Deck J, Guralnick R, Baskauf S, Beaman R, et al. (2014) Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related Ontologies. PLoS ONE 9(3): e89606. doi:10.1371/journal.pone.0089606).

Abstract:

The study of biodiversity spans many disciplines and includes data pertaining to species distributions and abundances, genetic sequences, trait measurements, and ecological niches, complemented by information on collection and measurement protocols. A review of the current landscape of metadata standards and ontologies in biodiversity science suggests that existing standards such as the Darwin Core terminology are inadequate for describing biodiversity data in a semantically meaningful and computationally useful way. Existing ontologies, such as the Gene Ontology and others in the Open Biological and Biomedical Ontologies (OBO) Foundry library, provide a semantic structure but lack many of the necessary terms to describe biodiversity data in all its dimensions. In this paper, we describe the motivation for and ongoing development of a new Biological Collections Ontology, the Environment Ontology, and the Population and Community Ontology. These ontologies share the aim of improving data aggregation and integration across the biodiversity domain and can be used to describe physical samples and sampling processes (for example, collection, extraction, and preservation techniques), as well as biodiversity observations that involve no physical sampling. Together they encompass studies of: 1) individual organisms, including voucher specimens from ecological studies and museum specimens, 2) bulk or environmental samples (e.g., gut contents, soil, water) that include DNA, other molecules, and potentially many organisms, especially microbes, and 3) survey-based ecological observations. We discuss how these ontologies can be applied to biodiversity use cases that span genetic, organismal, and ecosystem levels of organization. We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers.

I want to call to your attention a great description of the current state of biodiversity data:

Assembling the data sets needed for global biodiversity initiatives remains challenging. Biodiversity data are highly heterogeneous, including information about organisms, their morphology and genetics, life history and habitats, and geographical ranges. These data almost always either contain or are linked to spatial, temporal, and environmental data. Biodiversity science seeks to understand the origin, maintenance, and function of this variation and thus requires integrated data on the spatiotemporal dynamics of organisms, populations, and species, together with information on their ecological and environmental context. Biodiversity knowledge is generated across multiple disciplines, each with its own community practices. As a consequence, biodiversity data are stored in a fragmented network of resource silos, in formats that impede integration. The means to properly describe and interrelate these different data sources and types is essential if such resources are to fulfill their potential for flexible use and re-use in a wide variety of monitoring, scientific, and policy-oriented applications [5]. (From the introduction)

Contrast that with the final claim in the abstract:

We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers. (emphasis added)

I am very confident that both of those statements, from the introduction and from the abstract, are as true as human speakers can achieve.

However, the displacement of an unknown number of communities of practice, which vary even within disciplines, to say nothing of between disciplines, by these ontologies, seems highly unlikely. Not to mention planning for the fate of data from soon to be previous community practices.

Or perhaps I should observe that such a displacement has never happened. True, over time a community of practice may die, only to be replaced by another one but I take that as different in kind from an artificial construct that is made by one group and urged upon all others.

Think of it this way, what if the top 100 members of the biodiversity community kept their current community practices but used these ontologies as conversion targets? Followers of those various members could use their community leader’s practice as their conversion target. Reasoning it is easier to follow someone in your own community.

Rather than arguments that will outlast the ontologies that are convenient conversion targets about those ontologies, once a basis for mapping is declared, conversion to any other target becomes immeasurably easier.

Reducing the semantic friction inherent in conversion to an ontology or data format in an investment in the future.

Battling semantic friction for a conversion to an ontology or data format is an investment you will make over and over again.

February 26, 2014

RDF 1.1: On Semantics of RDF Datasets

Filed under: RDF,Semantics — Patrick Durusau @ 9:51 am

RDF 1.1: On Semantics of RDF Datasets

Abstract:

RDF defines the concept of RDF datasets, a structure composed of a distinguished RDF graph and zero or more named graphs, being pairs comprising an IRI or blank node and an RDF graph. While RDF graphs have a formal model-theoretic semantics that determines what arrangements of the world make an RDF graph true, no agreed formal semantics exists for RDF datasets. This document presents some issues to be addressed when defining a formal semantics for datasets, as they have been discussed in the RDF 1.1 Working Group, and specify several semantics in terms of model theory, each corresponding to a certain design choice for RDF datasets.

I can see how not knowing the semantics of a dataset could be problematic.

What puzzles me about this particular effort is that it appears to be an attempt to define the semantics of RDF datasets for others. Yes?

That activity depends upon semantics being inherent in an RDF dataset so that everyone can “discover” the same semantics or that such uniform semantics can be conferred upon an RDF dataset by decree.

The first possibility, that RDF datasets have an inherent semantic need not delay us as this activity started because different people saw different semantics in RDF datasets. That alone is sufficient to defeat any proposal based on “inherent” semantics.

The second possibility, that of defining and conferring semantics, seems equally problematic to me.

In part because there no enforcement mechanism that can prevent users of RDF datasets from assigning any semantic they like to a dataset.

But this remains important work but I would change the emphasis to defining what this group considers to be the semantics of RDF datasets and a mechanism to allow others to signal their agreement with it for a particular dataset.

That has the advantage of other users being able to adopt wholesale an entire set of semantics for an RDF dataset. Which hopefully reflects the semantics with which it should be processed.

Declaring semantics may help avoid users silently using inconsistent semantics for the same datasets.

February 19, 2014

SEMANTiCS Conference

Filed under: Conferences,Semantics — Patrick Durusau @ 2:13 pm

SEMANTiCS Conference Liepzig, Germany.

Important Dates:

Papers Submissions: May 30, 2014

Notification: June 27, 2014

Camera-Ready: July 14, 2014

Conference 4th – 5th September 2014

From the webpage:

The annual SEMANTiCS conference (formerly known as I-Semantics) is the meeting place for professionals who make semantic computing work, and understand its benefits and know its limitations. Every year, SEMANTiCS attracts information managers, IT-architects, software engineers, and researchers, from organisations ranging from NPOs, public administrations to the largest companies in the world.

I don’t know this conference but its being held in Liepzig would tip the balance for me.

February 12, 2014

John von Neumann and the Barrier to Universal Semantics

Filed under: Semantics,Topic Maps — Patrick Durusau @ 8:39 pm

Chris Boshuizen posted an image of a letter by John von Neumann “…lamenting that people don’t read other’s code, in 1952!”

von Neumann writes:

The subject mentioned by Stone is not an easy one. Plans to standardize and publish code of various groups have been made in the past, and they have not been very successful so far. The difficulty is that most people who have been active in this field seem to believe that it is easier to write new code than to understand an old one. This is probably exaggerated, but it is certainly true that the process of understanding a code practically involves redoing it de novo. The situation is not very unlike the one that existed in formal logics over a long period, where every new author invented a new symbolism. It took several decades until a few of these found wider acceptance, at least within limited groups. In the case of computing machine codes, the situation is even more difficult, since all formal logics refer, at least ideally, to the same substratum, whereas the machine codes frequently refer to physically different machines. (emphasis added)

To reword von Neumann slightly: whereas semantics refer to the perceptions of physically different people.

Yes?

Non-adoption of RDF or OWL isn’t a reflection on their capabilities or syntax. Rather it reflects that the vast majority of users don’t see the world as presented by RDF or OWL.

Since it is more difficult to learn a way other than your own, inertia favors whatever system you presently follow.

None of that is to deny or minimize the benefits of integrating information from various viewpoints. But a starting premise that users need to change their world views to X, is a non-starter if the goal is integration of information from different viewpoints.

My suggestion is that we start where users are today, with their languages, their means of identification, their subjects as it were. How to do that has as many answers as there are users with goals and priorities. Which will make the journey all the more interesting and enjoyable.

February 4, 2014

Semantics of Business Vocabulary and Business Rules

Filed under: Business Intelligence,Semantics,Vocabularies — Patrick Durusau @ 4:52 pm

Semantics of Business Vocabulary and Business Rules

From 1.2 Applicability:

The SBVR specification is applicable to the domain of business vocabularies and business rules of all kinds of business activities in all kinds of organizations. It provides an unambiguous, meaning-centric, multilingual, and semantically rich capability for defining meanings of the language used by people in an industry, profession, discipline, field of study, or organization.

This specification is conceptualized optimally for business people rather than automated processing. It is designed to be used for business purposes, independent of information systems designs to serve these business purposes:

  • Unambiguous definition of the meaning of business concepts and business rules, consistently across all the terms, names and other representations used to express them, and across the natural languages in which those representations are expressed, so that they are not easily misunderstood either by “ordinary business people” or by lawyers.
  • Expression of the meanings of concepts and business rules in the wordings used by business people, who may belong to different communities, so that each expression wording is uniquely associated with one meaning in a given context.
  • Transformation of the meanings of concepts and business rules as expressed by humans into forms that are suitable to be processed by tools, and vice versa.
  • Interpretation of the meanings of concepts and business rules in order to discover inconsistencies and gaps within an SBVR Content Model (see 2.4) using logic-based techniques.
  • Application of the meanings of concepts and business rules to real-world business situations in order to enable reproducible decisions and to identify conformant and non-conformant business behavior.
  • Exchange of the meanings of concepts and business rules between humans and tools as well as between tools without losing information about the essence of those meanings.

I do need to repeat their warning from 6.2 How to Read this Specification:

This specification describes a vocabulary, or actually a set of vocabularies, using terminological entries. Each entry includes a definition, along with other specifications such as notes and examples. Often, the entries include rules (necessities) about the particular item being defined.

The sequencing of the clauses in this specification reflects the inherent logical order of the subject matter itself. Later clauses build semantically on the earlier ones. The initial clauses are therefore rather ‘deep’ in terms of SBVR’s grounding in formal logics and linguistics. Only after these clauses are presented do clauses more relevant to day-to-day business communication and business rules emerge.

This overall form of presentation, essential for a vocabulary standard, unfortunately means the material is rather difficult to approach. A figure presented for each sub-vocabulary does help illustrate its structure; however, no continuous ‘narrative’ or explanation is appropriate.

😉

OK. so you aren’t going to read it for giggles. But you will be encountering it in the wild world of data so at least mark the reference.

I first saw this in a tweet by Stian Danenbarger.

January 31, 2014

…only the information that they can ‘see’…

Filed under: Natural Language Processing,Semantics,Topic Maps — Patrick Durusau @ 1:31 pm

Jumping NLP Curves: A Review of Natural Language Processing Research by Erik Cambria and Bebo White.

From the post:

Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to look at the past, present, and future of NLP technology in a new light. Borrowing the paradigm of `jumping curves’ from the eld of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves –namely Syntactics, Semantics, and Pragmatics Curves– which will eventually lead NLP research to evolve into natural language understanding.

This is not your average review of the literature as the authors point out:

…this review paper focuses on the evolution of NLP research according to three diff erent paradigms, namely: the bag-of-words, bag-of-concepts, and bag-of-narratives models.

But what caught my eye was:

All such capabilities are required to shift from mere NLP to what is usually referred to as natural language understanding (Allen, 1987). Today, most of the existing approaches are still based on the syntactic representation of text, a method which mainly relies on word co-occurrence frequencies. Such algorithms are limited by the fact that they can process only the information that they can `see’. As human text processors, we do not have such limitations as every word we see activates a cascade of semantically related concepts, relevant episodes, and sensory experiences, all of which enable the completion of complex NLP tasks &endash; such as word-sense disambiguation, textual entailment, and semantic role labeling &endash; in a quick and e ffortless way. (emphasis added)

The phrase, “only the information that they can `see’” captures the essence of the problem that topic maps address. A program can only see the surface of a text, nothing more.

The next phrase summarizes the promise of topic maps, to capture “…a cascade of semantically related concepts, relevant episodes, and sensory experiences…” related to a particular subject.

Not that any topic map could capture the full extent of related information to any subject but it can capture information to the extent plausible and useful.

I first saw this in a tweet by Marin Dimitrov.

December 29, 2013

How semantic search is killing the keyword

Filed under: Searching,Semantic Search,Semantics,Topic Maps — Patrick Durusau @ 2:23 pm

How semantic search is killing the keyword by Rich Benci.

From the post:

Keyword-driven results have dominated search engine results pages (SERPs) for years, and keyword-specific phrases have long been the standard used by marketers and SEO professionals alike to tailor their campaigns. However, Google’s major new algorithm update, affectionately known as Hummingbird because it is “precise and fast,” is quietly triggering a wholesale shift towards “semantic search,” which focuses on user intent (the purpose of a query) instead of individual search terms (the keywords in a query).

Attempts have been made (in the relatively short history of search engines) to explore the value of semantic results, which address the meaning of a query, rather than traditional results, which rely on strict keyword adherence. Most of these efforts have ended in failure. However, Google’s recent steps have had quite an impact in the internet marketing world. Google began emphasizing the importance of semantic search by showcasing its Knowledge Graph, a clear sign that search engines today (especially Google) care a lot more about displaying predictive, relevant, and more meaningful sites and web pages than ever before. This “graph” is a massive mapping system that connects real-world people, places, and things that are related to each other and that bring richer, more relevant results to users. The Knowledge Graph, like Hummingbird, is an example of how Google is increasingly focused on answering questions directly and producing results that match the meaning of the query, rather than matching just a few words.

“Hummingbird” takes flight

Google’s search chief, Amit Singhal, says that the Hummingbird update is “the first time since 2001 that a Google algorithm has been so dramatically rewritten.” This is how Danny Sullivan of Search Engine Land explains it: “Hummingbird pays more attention to each word in a query, ensuring that the whole query — the whole sentence or conversation or meaning — is taken into account, rather than particular words.”

The point of this new approach is to filter out less-relevant, less-desirable results, making for a more satisfying, more accurate answer that includes rich supporting information and easier navigation. Google’s Knowledge Graph, with its “connect the dots” type of approach, is important because users stick around longer as they discover more about related people, events, and topics. The results of a simple search for Hillary Clinton, for instance, include her birthday, her hometown, her family members, the books she’s written, a wide variety of images, and links to “similar” people, like Barack Obama, John McCain, and Joe Biden.

The key to making your website more amenable to “semantic search” is the use of the microformat you will find at Schema.org.

That is to say Google’s graph has pre-fabricated information in its knowledge graph that it can match up with information specified using Schema.org markup.

Sounds remarkably like a topic map doesn’t it?

Useful if you are looking for “popular” people, places and things. Not so hot with intra-enterprise search results. Unless of course your enterprise is driven by “pop” culture.

Impressive if you want coarse semantic searching sufficient to sell advertising. (See Type Hierarchy at Schema.org for all available types.

I say coarse semantic searching, my count on the types at Schema.org, as of today, is seven hundred and nineteen (719) types. Is that what you get?

I ask because in scanning “InterAction,” I don’t see SexAction or any of its sub-categories. Under “ConsumeAction” I don’t see SmokeAction or SmokeCrackAction or SmokeWeedAction or any of the other sub-categories of “ConsumeAction.” Under “LocalBusiness” I did not see WhoreHouse, DrugDealer, S/MShop, etc.

I felt like I had fallen into BradyBunchville. 😉

Seriously, if they left out those mainstream activities, what are the chances they included what you need for your enterprise?

Not so good. That’s what I thought.

A topic map when paired with a search engine and your annotated content can take your enterprise beyond keyword search.

December 27, 2013

The Lens

Filed under: Patents,Semantics,Topic Maps — Patrick Durusau @ 5:58 pm

The Lens

From the about page:

Welcome to The Lens, an open global cyberinfrastructure built to make the innovation system more efficient, fair, transparent and inclusive. The Lens is an extension of work started by Cambia in 1999 to render the global patent system more transparent, called the Patent Lens. The Lens is a greatly expanded and updated version of the Patent Lens with vastly more data and greater analytical capabilities. Our goal is to enable more people to make better decisions, informed by evidence and inspired by imagination.

The Lens already hosts a number of powerful tools for analysis and exploration of the patent literature, from integrated graphical representation of search results to advanced bioinformatics tools. But this is only just the beginning and we have lot more planned! See what we’ve done and what we plan to do soon on our timeline below:

The Lens current covers 80 million patents in 100 different jurisdictions.

When you create an account, the following appears in your workspace:

Welcome to the Lens! The Lens is a tool for innovation cartography, currently featuring over 78 million patent documents – many of them full-text – from nearly 100 different jurisdictions. The Lens also features hyperlinks to the scientific literature cited in patent documents – over 5 million to date.

But more than a patent search tool, the Lens has ben designed to make the patent system navigable, so that non-patent professionals can access the knowledge contained in the global patent literature. Properly mapped out, the global patent system has the potential to accelerate the pace of invention, to generate new partnerships, and to make a vast wealth of scientific and technical knowledge available for free.

The Lens is currently in beta version, with future versions featuring expanded access to both patent and scientific literature collections, as well as improved search and analytic capabilities.

As you already know, patents have extremely rich semantics and mapping of those semantics could be very profitable.

If you saw the post: Secure Cloud Computing – Very Secure, you will know that patent searches on “homomorphic encryption” are about to become very popular.

Are you ready to bundle and ship patent research?

December 26, 2013

Legivoc – connecting laws in a changing world

Filed under: EU,Law,Semantics — Patrick Durusau @ 8:02 pm

Legivoc – connecting laws in a changing world by Hughes-Jehan Vibert, Pierre Jouvelot, Benoît Pin.

Abstract:

On the Internet, legal information is a sum of national laws. Even in a changing world, law is culturally specific (nation-specific most of the time) and legal concepts only become meaningful when put in the context of a particular legal system. Legivoc aims to be a semantic interface between the subject of law of a State and the other spaces of legal information that it will be led to use. This project will consist of setting up a server of multilingual legal vocabularies from the European Union Member States legal systems, which will be freely available, for other uses via an application programming interface (API).

And I thought linking all legal data together was ambitious!

So long as the EU was composed of civil law jurisdictions, I would not have taken odds on the success of the project but it could have some useful results.

One you add in common law jurisdictions like the United Kingdom, the project may still have some useful results but there isn’t going to be mapping across all the languages.

Part of the difficulty will be language but part of it will be at the most basic assumptions of both systems.

In civil law, the drafters of legal codes attempt to systematically set out a set of principles that take each other into account and represent a blueprint for an ordered society.

Common law, on the other hand, has at its core court decisions that determine the results between two parties. And those decisions can be relied upon by other parties.

Between civil and common law jurisdictions, some laws/concepts may be more mappable than others. Modern labor law for example, may be new enough for semantic accretions to not prevent a successful mapping.

Older laws, property and inheritance laws, for example, are usually the most unique for any jurisdiction. Those are likely to prove impossible to map or reconcile.

Still, it will be an interesting project, particularly if they disclose the basis for any possible mapping, as opposed to simply declaring a mapping.

Both would be useful, but the former robust in the face of changing law and the latter is brittle.

December 23, 2013

Where Does the Data Go?

Filed under: Data,Semantic Inconsistency,Semantics — Patrick Durusau @ 2:20 pm

Where Does the Data Go?

A brief editorial on The Availability of Research Data Declines Rapidly with Article Age by Timothy H. Vines, et.al., which reads in part:

A group of researchers in Canada examined 516 articles published between 1991 and 2011, and “found that availability of the data was strongly affected by article age.” For instance, the team reports that the odds of finding a working email address associated with a paper decreased by 7 percent each year and that the odds of an extant dataset decreased by 17 percent each year since publication. Some data was technically available, the researchers note, but stored on floppy disk or on zip drives that many researchers no longer have the hardware to access.

The one of highlights of the article (which appears in Current Biology) reads:

Broken e-mails and obsolete storage devices were the main obstacles to data sharing

Curious because I would have ventured that semantic drift over twenty (20) years would have been a major factor as well.

Then I read the paper and discovered:

To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). [Under Results, the online edition has no page numbers]

The authors appeared to have dodged the semantic bullet by the selection of data and their non-reporting of difficulties, if any, in using the data (19.5%) that was shared by the original authors.

Preservation of data is a major concern for researchers but I would urge that the semantics of data be preserved as well.

Imagine that feeling when you “ls -l” a directory and recognize only some of the file names writ large. Writ very large.

« Newer PostsOlder Posts »

Powered by WordPress