Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 9, 2013

Semantics Moving into the Clouds (and you?)

Filed under: Cloud Computing,Semantics — Patrick Durusau @ 12:55 pm

OpenNebula 4.0 Released – The Finest Open-source Enterprise Cloud Manager!

From the post:

The fourth generation of OpenNebula is the result of seven years of continuous innovation in close collaboration with its users.

The OpenNebula Project is proud to announce the fourth major release of its widely deployed OpenNebula cloud management platform, a fully open-source enterprise-grade solution to build and manage virtualized data centers and enterprise clouds. OpenNebula 4.0 (codename Eagle) brings valuable contributions from many of its thousands of users that include leading research and supercomputing centers like FermiLab, NASA, ESA and SARA; and industry leaders like Blackberry, China Mobile, Dell, Cisco, Akamai and Telefonica O2.

OpenNebula is used by many enterprises as an open, flexible alternative to vCloud on their VMware-based data center. OpenNebula is a drop-in replacement to the VMware’s cloud stack that additionally brings support for multiple hypervisors and broad integration capabilities to leverage existing IT investments and keep existing operational processes. As an enterprise-class product, OpenNebula offers an upgrade path so all existing users can easily migrate their production and experimental environments to the new version.

OpenNebula 4.0 includes new features in most of its subsystems. It shows for the first time a completely redesigned Sunstone, with a fresh and modern look. A whole new set of operations are available for virtual machines like system and disk snapshotting, capacity re-sizing, programmable VM actions, NIC hotplugging and IPv6 among others. The OpenNebula backend has been also improved with the support of new datastores, like Ceph, and new features for the VMware, KVM and Xen hypervisors. The Project continues with its complete support to de-facto and open standards, like Amazon and Open Cloud Computing APIs.

Despite all the buzz words about “big datq” and “cloud computing,” no one has left semantics behind.

Semantics don’t get much press in “big data” or “cloud computing.”

You can take that to mean semantic issues, thousands of years old, have been silently solved, or current vendors lack a semantic solution to offer.

I think it is the latter.

How about you?

May 4, 2013

Emergent Semantics – Prof. Karl Aberer

Filed under: Emergent Semantics,Semantics — Patrick Durusau @ 4:26 pm

Emergent Semantics – Prof. Karl Aberer

From the webpage:

In this research we view the problem of establishing semantic interoperability as a self-organizing process in which agreements on the interpretation of data are established in a localized, and incremental manner [5, 7, 11]. Starting from a P2P architecture setting we developed an approach of how local mappings among schemas can be used to infer new mappings and check the consistency of mappings among schemas [1, 2, 3, 4, 10, 13, 14, 15]. In this way semantic interoperability and increased semantic consistency at a global level is achieved. We also developed an architecture and implemented the GridVine system for demonstrating hte practicality of this approach [8]. Current work is directed towards evaluation of these principles in practical settings and towards further development of the underlying theory, in particular analyzing graph-theoretic properties of semantic mapping graphs [6, 9, 12]. More information can be found on the GridVine web site http://lsirwww.epfl.ch/GridVine/. [I corrected the link to the GridVine website.]

A bibliography of articles by Prof. Karl Aberer and collaborators on emergent semantics.

Last updated in 2007.

Search: Emergent and Extrinsic Semantics

Filed under: Emergent Semantics,Extrinsic Semantics,Semantics — Patrick Durusau @ 4:17 pm

Search: Emergent and Extrinsic Semantics by John Tait.

From the post:

Semantics is a term often used in the search technology and information retrieval community these days. A distinction is drawn between semantic and traditional search, implying that somehow semantic search is a more advanced or sophisticated form.

My claim in this article is that there are actually two forms of semantic search: emergent and extrinsic. Further I want to claim that they are related, and that one of them (emergent) is not new but has been in widespread use since the 1980’s when “natural language querying” (as embodied in Google for example) started to supplant pure Boolean querying as the usual query form for search on unstructured data.

My dictionary defines semantics as the branch of the science of language related to meaning. In search technology and information retrieval it has come to be associated with two very distinct ideas and communities.

(…)

Now it is very common to see emergent and extrinsic as somehow contrasting and irreconcilable. Whereas I want to claim they are really two sides of the same coin, and further complementary and supporting.

It is common for those in the extrinsic (really semantic web community) to be somewhat dismissive towards to emergent community: seeing the basis of their work as lacking (real) semantics. This misses the point, which is that there must be some notion of semantics in therein emergent systems, because even simple word matching is dealing with semantic notions like synonymy: crudely the same words (space delimited strings of characters in English text) in similar contexts often mean the same. The problem is that emergent semantics are obscure, hidden, and difficult to access.

In my view the difficulty of making visible the knowledge hidden in the term weighting schemes and indexing systems has led people to make the mistaken jump to the conclusion that these systems contain no semantics. My claim is that they do have semantics: but emergent semantics are generally obscure.

My first difficulty with John’s position is his odd use of the term “emergent semantics.”

He appears to be defining the term as: “…term weighting schemes and indexing systems….” for example.

A more common definition is found in Emergent Semantics by Philippe Cudre-Mauroux

Emergent semantics applies the conception of a closed correspondence continuum to the analysis of semantics in distributed information systems, by promoting recursive analyses of syntactic constructs { such asschemas, ontologies or mappings { in order to capture semantics.

Nor is it useful to claim that Tait-Emergent-Semantics (to distinguish it from the more common usage) has semantics but they are obscure.

If the semantics of Tait-Emergent-Semantics cannot be seen, then other evidence should be offered.

Saying that evidence for a proposition is “obscure” isn’t very convincing.

May 2, 2013

SemStats 2013

Filed under: Conferences,Semantics,Statistics — Patrick Durusau @ 5:09 am

First International Workshop on Semantic Statistics (SemStats 2013)

Deadline for paper submission: Friday, 12 July 2013, 23:59 (Hawaii time)
Notification of acceptance/rejection: Friday, 9 August 2013
Deadline for camera-ready version: Friday, 30 August 2013

From the call for papers:

The goal of this workshop is to explore and strengthen the relationship between the Semantic Web and statistical communities, to provide better access to the data held by statistical offices. It will focus on ways in which statisticians can use Semantic Web technologies and standards in order to formalize, publish, document and link their data and metadata.

The statistical community has recently shown an interest in the Semantic Web. In particular, initiatives have been launched to develop semantic vocabularies representing statistical classifications and discovery metadata. Tools are also being created by statistical organizations to support the publication of dimensional data conforming to the Data Cube specification, now in Last Call at W3C. But statisticians see challenges in the Semantic Web: how can data and concepts be linked in a statistically rigorous fashion? How can we avoid fuzzy semantics leading to wrong analyses? How can we preserve data confidentiality?

The workshop will also cover the question of how to apply statistical methods or treatments to linked data, and how to develop new methods and tools for this purpose. Except for visualisation techniques and tools, this question is relatively unexplored, but the subject will obviously grow in importance in the near future.

An unfortunate emphasis on linked data before understanding the problem of imbuing statistical data with semantics.

Studying the needs of the statistical community for semantics and to what degree would be more likely to yield useful requirements.

And from requirements, then to proceed to find appropriate solutions.

As opposed to arriving solution in hand, with saws, pry bars, shoe horns and similar tools for affixing the solution to any problem.

April 27, 2013

The Motherlode of Semantics, People

Filed under: Conferences,Crowd Sourcing,Semantic Web,Semantics — Patrick Durusau @ 8:08 am

1st International Workshop on “Crowdsourcing the Semantic Web” (CrowdSem2013)

Submission deadline: July 12, 2013 (23:59 Hawaii time)

From the post:

1st International Workshop on “Crowdsourcing the Semantic Web” in conjunction with the 12th Interantional Seamntic Web Conference (ISWC 2013), 21-25 October 2013, in Sydney, Australia. This interactive workshop takes stock of the emergent work and chart the research agenda with interactive sessions to brainstorm ideas and potential applications of collective intelligence to solving AI hard semantic web problems.

The Global Brain Semantic Web—a Semantic Web interleaving a large number of human and machine computation—has great potential to overcome some of the issues of the current Semantic Web. In particular, semantic technologies have been deployed in the context of a wide range of information management tasks in scenarios that are increasingly significant in both technical (data size, variety and complexity of data sources) and economical terms (industries addressed and their market volume). For many of these tasks, machine-driven algorithmic techniques aiming at full automation do not reach a level of accuracy that many production environments require. Enhancing automatic techniques with human computation capabilities is becoming a viable solution in many cases. We believe that there is huge potential at the intersection of these disciplines – large scale, knowledge-driven, information management and crowdsourcing – to solve technically challenging problems purposefully and in a cost effective manner.

I’m encouraged.

The Semantic Web is going to start asking the entities (people) that originate semantics about semantics.

Going the motherlode of semantics.

Now to see what they do with the answers.

April 15, 2013

2ND International Workshop on Mining Scientific Publications

Filed under: Conferences,Data Mining,Searching,Semantic Search,Semantics — Patrick Durusau @ 2:49 pm

2ND International Workshop on Mining Scientific Publications

May 26, 2013 – Submission deadline
June 23, 2013 – Notification of acceptance
July 7, 2013 – Camera-ready
July 26, 2013 – Workshop

From the CFP:

Digital libraries that store scientific publications are becoming increasingly important in research. They are used not only for traditional tasks such as finding and storing research outputs, but also as sources for mining this information, discovering new research trends and evaluating research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries makes it no longer sufficient to provide access to content to human readers only. It is equally important to allow machines analyse this information and by doing so facilitate the processes by which research is being accomplished. Recent developments in natural language processing, information retrieval, the semantic web and other disciplines make it possible to transform the way we work with scientific publications. However, in order to make this happen, researchers first need to be able to easily access and use large databases of scientific publications and research data, to carry out experiments.

This workshop aims to bring together people from different backgrounds who:
(a) are interested in analysing and mining databases of scientific publications,
(b) develop systems, infrastructures or datasets that enable such analysis and mining,
(c) design novel technologies that improve the way research is being accomplished or
(d) support the openness and free availability of publications and research data.

2. TOPICS

The topics of the workshop will be organised around the following three themes:

  1. Infrastructures, systems, open datasets or APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence and to aid content exploration.

Of particular interest for topic mappers:

Topics of interest relevant to theme 2 include, but are not limited to:

  • Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure etc.
  • Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
  • New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications, according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
  • Models for semantically representing and annotating publications. This topic is related to aspects of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of special interest.
  • Semantically enriching/annotating publications by crowdsourcing. Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing metadata, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, (d) other relevant crowdsourcing topics relevant to the domain of scientific publications.

The other themes could be viewed through a topic map lens but semantic enrichment seems like a natural.

April 11, 2013

Cargo Cult Data Science [Cargo Cult Semantics?]

Filed under: Data Science,Semantics — Patrick Durusau @ 3:30 pm

Cargo Cult Data Science by Jim Harris.

From the post:

Last week, Phil Simon blogged about being wary of snake oil salesman who claim to be data scientists. In this post, I want to explore a related concept, namely being wary of thinking that you are performing data science by mimicking what data scientists do.

The American theoretical physicist Richard Feynman coined the term cargo cult science to refer to practices that have the semblance of being scientific, but do not in fact follow the scientific method.

As Feynman described his analogy, “in the South Seas there is a cult of people. During the war they saw airplanes land with lots of materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas—he’s the controller—and they wait for the airplanes to land. They’re doing everything right. The form is perfect. But it doesn’t work. No airplanes land. So I call these things Cargo Cult Science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.”

Feynman’s description of the runway and controller reminds me of attempts to create systems with semantic “understanding.”

We load them up with word lists, thesauri, networks of terms, the equivalent of runways.

We give them headphones (ontologies) with bars of bamboo (syntax) sticking out of them.

And after all that, semantic understanding continues to elude us.

Maybe those efforts are missing something essential? (Like us?)

GroningenMeaningBank (GMB)

Filed under: Corpora,Corpus Linguistics,Linguistics,Semantics — Patrick Durusau @ 2:19 pm

GroningenMeaningBank (GMB)

From the “about” page:

The Groningen Meaning Bank consists of public domain English texts with corresponding syntactic and semantic representations.

Key features

The GMB supports deep semantics, opening the way to theoretically grounded, data-driven approaches to computational semantics. It integrates phenomena instead of covering single phenomena in isolation. This provides a better handle on explaining dependencies between various ambiguous linguistic phenomena, including word senses, thematic roles, quantifier scrope, tense and aspect, anaphora, presupposition, and rhetorical relations. In the GMB texts are annotated rather than
isolated sentences, which provides a means to deal with ambiguities on the sentence level that require discourse context for resolving them.

Method

The GMB is being built using a bootstrapping approach. We employ state-of-the-art NLP tools (notably the C&C tools and Boxer) to produce a reasonable approximation to gold-standard annotations. From release to release, the annotations are corrected and refined using human annotations coming from two main sources: experts who directly edit the annotations in the GMB via the Explorer, and non-experts who play a game with a purpose called Wordrobe.

Theoretical background

The theoretical backbone for the semantic annotations in the GMB is established by Discourse Representation Theory (DRT), a formal theory of meaning developed by the philosopher of language Hans Kamp (Kamp, 1981; Kamp and Reyle, 1993). Extensions of the theory bridge the gap between theory and practice. In particular, we use VerbNet for thematic roles, a variation on ACE‘s named entity classification, WordNet for word senses and Segmented DRT for rhetorical relations (Asher and Lascarides, 2003). Thanks to the DRT backbone, all these linguistic phenomena can be expressed in a first-order language, enabling the practical use of first-order theorem provers and model builders.

Step back towards the source of semantics (that would be us).

One practical question is how to capture semantics for a particular domain or enterprise?

Another is what to capture to enable the mapping of those semantics to those of other domains or enterprises?

April 9, 2013

Improving Twitter search with real-time human computation [“semantics supplied”]

Filed under: Human Computation,Search Engines,Searching,Semantics,Tweets — Patrick Durusau @ 1:54 pm

Improving Twitter search with real-time human computation by Edwin Chen.

From the post:

Before we delve into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

….

The post is quite awesome and I suggest you read it in full.

This resonates with a recent comment about Lotus Agenda.

The short version is a user creates a thesaurus in Agenda that enables searches enriched by the thesaurus. The user supplied semantics to enhance the searches.

In the Twitter case, human reviewers supply semantics to enhance the searches.

In both cases, Agenda and Twitter, humans are supplying semantics to enhance the searches.

I emphasize “supplying semantics” as a contrast to mechanistic searches that rely on text.

Mechanistic searches can be quite valuable but they pale beside searches where semantics have been “supplied.”

The Twitter experience is a an important clue.

The answer to semantics for searches lies somewhere between ask an expert (you get his/her semantics) and ask ask all of us (too many answers to be useful).

More to follow.

March 31, 2013

FrameNet

Filed under: Frames,Semantics — Patrick Durusau @ 2:08 pm

FrameNet

From the about page:

The FrameNet project is building a lexical database of English that is both human- and machine-readable, based on annotating examples of how words are used in actual texts. From the student’s point of view, it is a dictionary of more than 10,000 word senses, most of them with annotated examples that show the meaning and usage. For the researcher in Natural Language Processing, the more than 170,000 manually annotated sentences provide a unique training dataset for semantic role labeling, used in applications such as information extraction, machine translation, event recognition, sentiment analysis, etc. For students and teachers of linguistics it serves as a valence dictionary, with uniquely detailed evidence for the combinatorial properties of a core set of the English vocabulary. The project has been in operation at the International Computer Science Institute in Berkeley since 1997, supported primarily by the National Science Foundation, and the data is freely available for download; it has been downloaded and used by researchers around the world for a wide variety of purposes (See FrameNet users).

FrameNet is based on a theory of meaning called Frame Semantics, deriving from the work of Charles J. Fillmore and colleagues (Fillmore 1976, 1977, 1982, 1985, Fillmore and Baker 2001, 2010). The basic idea is straightforward: that the meanings of most words can best be understood on the basis of a semantic frame: a description of a type of event, relation, or entity and the participants in it. For example, the concept of cooking typically involves a person doing the cooking (Cook), the food that is to be cooked (Food), something to hold the food while cooking (Container) and a source of heat (Heating_instrument). In the FrameNet project, this is represented as a frame called Apply_heat, and the Cook, Food, Heating_instrument and Container are called frame elements (FEs) . Words that evoke this frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat frame. Other frames are more complex, such as Revenge, which involves more FEs (Offender, Injury, Injured_Party, Avenger, and Punishment) and others are simpler, such as Placing, with only an Agent (or Cause), a thing that is placed (called a Theme) and the location in which it is placed (Goal). The job of FrameNet is to define the frames and to annotate sentences to show how the FEs fit syntactically around the word that evokes the frame, as in the following examples of Apply_heat and Revenge:

At least for English based topic maps, possibly a rich source for roles in association and even templates for associations.

To say nothing of using associations (frames) as scopes.

Recalling that the frames themselves do not stand outside of semantics but have semantics of their own.

Suggestions of similar resources in other languages?

Semantics for Big Data [W3C late to semantic heterogeneity party]

Filed under: BigData,Conferences,Heterogeneous Data,Semantics — Patrick Durusau @ 12:50 pm

Semantics for Big Data

Dates:

Submission due: May 24, 2013

Acceptance Notification: June 21, 2013

Camera-ready Copies: June 28, 2013

Symposium: November 15-17, 2013

From the webpage:

AAAI 2013 Fall Symposium; Westin Arlington Gateway in Arlington, Virginia, November 15-17, 2013.

Workshop Description and Scope

One of the key challenges in making use of Big Data lies in finding ways of dealing with heterogeneity, diversity, and complexity of the data, while its volume and velocity forbid solutions available for smaller datasets as based, e.g., on manual curation or manual integration of data. Semantic Web Technologies are meant to deal with these issues, and indeed since the advent of Linked Data a few years ago, they have become central to mainstream Semantic Web research and development. We can easily understand Linked Data as being a part of the greater Big Data landscape, as many of the challenges are the same. The linking component of Linked Data, however, puts an additional focus on the integration and conflation of data across multiple sources.

Workshop Topics

In this symposium, we will explore the many opportunities and challenges arising from transferring and adapting Semantic Web Technologies to the Big Data quest. Topics of interest focus explicitly on the interplay of Semantics and Big Data, and include:

  • the use of semantic metadata and ontologies for Big Data,
  • the use of formal and informal semantics,
  • the integration and interplay of deductive (semantic) and statistical methods,
  • methods to establish semantic interoperability between data sources
  • ways of dealing with semantic heterogeneity,
  • scalability of Semantic Web methods and tools, and
  • semantic approaches to the explication of requirements from eScience applications.

The W3C is late to the party as evidenced by semantic heterogeneity becoming “…central to mainstream Semantic Web research and development” after the advent of Linked Data.

I suppose better late than never.

At least if they remember that:

Users experience semantic heterogeneity in data and in the means used to describe and store data.

Whatever solution is crafted, its starting premise must be to capture semantics as seen by some defined user.

Otherwise, it is capturing the semantics of designers, authors, etc., which may or may not be valuable to some particular user.

RDF is a good example of capturing someone else’s semantics.

As its uptake is evidence of the interest in someone else’s semantics. (Simple Web Semantics – The Semantic Web Is Failing — But Why?)

March 29, 2013

Learning Grounded Models of Meaning

Filed under: Linguistics,Meaning,Modeling,Semantics — Patrick Durusau @ 2:16 pm

Learning Grounded Models of Meaning

Schedule and readings for seminar by Katrin Erk and Jason Baldridge:

Natural language processing applications typically need large amounts of information at the lexical level: words that are similar in meaning, idioms and collocations, typical relations between entities,lexical patterns that can be used to draw inferences, and so on. Today such information is mostly collected automatically from large amounts of data, making use of regularities in the co-occurrence of words. But documents often contain more than just co-occurring words, for example illustrations, geographic tags, or a link to a date. Just like co-occurrences between words, these co-occurrences of words and extra-linguistic data can be used to automatically collect information about meaning. The resulting grounded models of meaning link words to visual, geographic, or temporal information. Such models can be used in many ways: to associate documents with geographic locations or points in time, or to automatically find an appropriate image for a given document, or to generate text to accompany a given image.

In this seminar, we discuss different types of extra-linguistic data, and their use for the induction of grounded models of meaning.

Very interesting reading that should keep you busy for a while! 😉

FLOPS Fall Flat for Intelligence Agency

Filed under: HPC,Intelligence,RFI-RFP,Semantics — Patrick Durusau @ 9:39 am

FLOPS Fall Flat for Intelligence Agency by Nicole Hemsoth.

From the post:

The Intelligence Advanced Research Projects Activity (IARPA) is putting out some RFI feelers in hopes of pushing new boundaries with an HPC program. However, at the core of their evaluation process is an overt dismissal of current popular benchmarks, including floating operations per second (FLOPS).

To uncover some missing pieces for their growing computational needs, IARPA is soliciting for “responses that illuminate the breadth of technologies” under the HPC umbrella, particularly the tech that “isn’t already well-represented in today’s HPC benchmarks.”

The RFI points to the general value of benchmarks (Linpack, for instance) as necessary metrics to push research and development, but argues that HPC benchmarks have “constrained the technology and architecture options for HPC system designers.” More specifically, in this case, floating point benchmarks are not quite as valuable to the agency as data-intensive system measurements, particularly as they relate to some of the graph and other so-called big data problems the agency is hoping to tackle using HPC systems.

Responses are due by Apr 05, 2013 4:00 pm Eastern.

Not that I expect most of you to respond to this RFI but I mention it as a step in the right direction for the processing of semantics.

Semantics are not native to vector fields and so every encoding of semantics in a vector field is a mapping.

As is every extraction of semantic from a vector field is the reverse of that mapping process.

The impact of this mapping/unmapping of semantics to and from a vector field on interpretation are unclear.

As mapping and unmapping decisions are interpretative, it seems reasonable to conclude there is some impact. How much isn’t known.

Vector fields are easy for high FLOPS systems to process but do you want a fast inaccurate answer or one that bears some resemblance to reality as experienced by others?

Graph databases, to name one alternative, are the current rage, at least according to graph database vendors.

But saying “graph database,” isn’t the same as usefully capturing semantics with a graph database.

Or processing semantics once captured.

What we need is an alternative to FLOPS that represents effective processing of semantics.

Suggestions?

March 17, 2013

Semantic Queries by Example [Identity by Example (IBE)?]

Filed under: Query Language,Searching,Semantics — Patrick Durusau @ 9:47 am

Semantic Queries by Example by Lipyeow Lim, Haixun Wang, Min Wang.

Abstract:

With the ever increasing quantities of electronic data, there is a growing need to make sense out of the data. Many advanced database applications are beginning to support this need by integrating domain knowledge encoded as ontologies into queries over relational data. However, it is extremely difficult to express queries against graph structured ontology in the relational SQL query language or its extensions. Moreover, semantic queries are usually not precise, especially when data and its related ontology are complicated. Users often only have a vague notion of their information needs and are not able to specify queries precisely. In this paper, we address these challenges by introducing a novel method to support semantic queries in relational databases with ease. Instead of casting ontology into relational form and creating new language constructs to express such queries, we ask the user to provide a small number of examples that satisfy the query she has in mind. Using those examples as seeds, the system infers the exact query automatically, and the user is therefore shielded from the complexity of interfacing with the ontology. Our approach consists of three steps. In the first step, the user provides several examples that satisfy the query. In the second step, we use machine learning techniques to mine the semantics of the query from the given examples and related ontologies. Finally, we apply the query semantics on the data to generate the full query result. We also implement an optional active learning mechanism to find the query semantics accurately and quickly. Our experiments validate the effectiveness of our approach.

Potentially deeply important work for both a topic map query language and topic map authoring.

The authors conclude:

In this paper, we introduce a machine learning approach to support semantic queries in relational database. In semantic query processing, the biggest hurdle is to represent ontological data in relational form so that the relational database engine can manipulate the ontology in a way consistent with manipulating the data. Previous approaches include transforming the graph ontological data into tabular form, or representing ontological data in XML and leveraging database extenders on XML such as DB2’s Viper. These approaches, however, are either expensive (materializing a transitive relationship represented by a graph may increase the data size exponentially) or requiring changes in the database engine and new extensions to SQL. Our approach shields the user from the necessity of dealing with the ontology directly. Indeed, as our user study indicates, the difficulty of expressing ontology-based query semantics in a query language is the major hurdle of promoting semantic query processing. With our approach, the users do not even need to know ontology representation. All that is required is that the user gives some examples that satisfy the query he has in mind. The system then automatically finds the answer to the query. In this process, semantics, which is a concept usually hard to express, remains as a concept in the mind of user, without having to be expressed explicitly in a query language. Our experiments and user study results show that the approach is efficient, effective, and general in supporting semantic queries in terms of both accuracy and usability. (emphasis added)

I rather like: “In this process, semantics, which is a concept usually hard to express, remains as a concept in the mind of user, without having to be expressed explicitly in a query language.

To take it a step further, it should apply to the authoring of topic maps as well.

A user selects from a set of examples the subjects they want to talk about. Quite different from any topic map authoring interface I have seen to date.

The “details” of capturing and querying semantics have stymied RDF:

F-16 cockpit

(From: The Semantic Web Is Failing — But Why? (Part 4))

And topic map authoring as well.

Is your next authoring/querying interface going to be by example?

I first saw this in a tweet by Stefano Bertolo.

March 15, 2013

CIA Prophet Pointed to Big Data Future

Filed under: BigData,Semantics — Patrick Durusau @ 7:10 pm

CIA Prophet Pointed to Big Data Future by Issac Lopez.

Issac writes:

“What does the size of the next coffee crop, bull flight attendance figures, local newspaper coverage of UN matters, the birth rate, the mean daily temperatures or refrigerator sales across the country have to do with who will next be elected president of Guatemala,” asks Orrin Clotworthy in the report, which he styled “a Jules Verne look at intelligence processes in a coming generation.”

“Perhaps nothing” he answers, but notes that there is a cause behind each vote cast in an election and many quantitative factors may exist to help shape that decision. “To learn just what the factors are, how to measure them, how to weight them, and how to keep them flowing into a computing center for continual analysis will some day be a matter of great concern to all of us in the intelligence community,” prophesied Clotworthy, describing the challenges that organizations around the globe face fifty years after the report was authored.

I’m not sure if Issac means big data is closer to measuring the factors that motivate people or if big data will seize upon what can be measured as motivators.

The issue of standardized tests is a current one in the United States and it is far from settled whether the tests measure anything about the educational process or do they measure the ability to take standardized tests? Or measure some other aspect of students?

You can read the report in full here.

Issac quotes another part of the report but only in part:

IBM has developed for public use a computer-based system called the ‘Selective Disseminator of Information.’ Intended for large organizations dealing with heterogeneous masses of information, it scans all incoming material and delivers those items that are of interest to specific offices in accordance with “profiles” of their needs which are continuously updated by a feed-back device.

But Clotworthy continues in the next sentence to say:

Any comment hear on the potential of the SDI for an intelligence agency would be superfluous; Air Intelligence has in fact been experimenting with such a mechanized dissemination system for some years.

Fifty (50) years later and the device that needs no description continues to elude us.

Is there a semantic equivalent to NP-complete?

March 13, 2013

Aaron Swartz’s A Programmable Web: An Unfinished Work

Filed under: Semantic Web,Semantics,WWW — Patrick Durusau @ 3:04 pm

Aaron Swartz’s A Programmable Web: An Unfinished Work

Abstract:

This short work is the first draft of a book manuscript by Aaron Swartz written for the series “Synthesis Lectures on the Semantic Web” at the invitation of its editor, James Hendler. Unfortunately, the book wasn’t completed before Aaron’s death in January 2013. As a tribute, the editor and publisher are publishing the work digitally without cost.

From the author’s introduction:

” . . . we will begin by trying to understand the architecture of the Web — what it got right and, occasionally, what it got wrong, but most importantly why it is the way it is. We will learn how it allows both users and search engines to co-exist peacefully while supporting everything from photo-sharing to financial transactions.

We will continue by considering what it means to build a program on top of the Web — how to write software that both fairly serves its immediate users as well as the developers who want to build on top of it. Too often, an API is bolted on top of an existing application, as an afterthought or a completely separate piece. But, as we’ll see, when a web application is designed properly, APIs naturally grow out of it and require little effort to maintain.

Then we’ll look into what it means for your application to be not just another tool for people and software to use, but part of the ecology — a section of the programmable web. This means exposing your data to be queried and copied and integrated, even without explicit permission, into the larger software ecosystem, while protecting users’ freedom.

Finally, we’ll close with a discussion of that much-maligned phrase, ‘the Semantic Web,’ and try to understand what it would really mean.”

Table of Contents: Introduction: A Programmable Web / Building for Users: Designing URLs / Building for Search Engines: Following REST / Building for Choice: Allowing Import and Export / Building a Platform: Providing APIs / Building a Database: Queries and Dumps / Building for Freedom: Open Data, Open Source / Conclusion: A Semantic Web?

Even if you disagree with Aaron, on issues both large and small, as I do, it is a very worthwhile read.

But I will save my disagreements for another day. Enjoy the read!

February 26, 2013

Simple Web Semantics: Multiple Dictionaries

Filed under: Semantic Web,Semantics,Simple Web Semantics — Patrick Durusau @ 2:06 pm

When I last posted about Simple Web Semantics, my suggested syntax was:

Simple Web Semantics (SWS) – Syntax Refinement

While you can use any one of multiple dictionaries for the URI in an <a> element, that requires manual editing of the source HTML.

Here is an improvement on that idea:

The content of the content attribute on a meta element with a name attribute with the value “dictionary” is one or more “URLs” (in the HTML 5 sense), if more than one, the “URLs” are separated by whitespace.

The content of the dictionary attribute on an a element is one or more “URLs” (in the HTML 5 sense), if more than one, the “URLs” are separated by whitespace.

Thinking that enables authors of content to give users choices as to which dictionaries to use with particular “URLs.”

For example, a popular account of a science experiment could use the term, H2O and have a dictionary entry pointing to: http://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/SnowflakesWilsonBentley.jpg/220px-SnowflakesWilsonBentley.jpg, which produces this image:

snowflakes

Which would be a great illustration for a primary school class about a form of H2O.

On the other hand, another dictionary entry for the same URL might point to: http://upload.wikimedia.org/wikipedia/commons/thumb/0/03/Liquid-water-and-ice.png/220px-Liquid-water-and-ice.png, which produces this image:

ice structure

Which would be more appropriate for a secondary school class.

Writing this for an inline <a> element, I would write:

<a href="http://en.wikipedia.org/wiki/Water" dictionary="http://upload.wikimedia.org/wikipedia/commons/
thumb/c/c2/SnowflakesWilsonBentley.jpg/220px-SnowflakesWilsonBentley.jpg http://upload.wikimedia.org/wikipedia/commons/
thumb/0/03/Liquid-water-and-ice.png/220px-Liquid-water-and-ice.png">H2O</a>

The use of a “URL” and images all from Wikipedia is just convenience for this example. Dictionary entries are not tied to the “URL” in the href attribute.

That presumes some ability on the part of the dictionary server to respond with meaningful information to display to a user who must choose between two dictionaries.

Enabling users to have multiple sources of additional information at their command versus the simplicity of a single dictionary, seems like a good choice.

Nothing prohibits a script writer from enabling users to insert their own dictionary preferences either for the document as a whole or for individual <a> elements.

If you missed my series on Simple Web Semantics, see: Simple Web Semantics — Index Post.


Apologies for quoting “URL/s” throughout the post but after reading:

Note: The term “URL” in this specification is used in a manner distinct from the precise technical meaning it is given in RFC 3986. Readers familiar with that RFC will find it easier to read this specification if they pretend the term “URL” as used herein is really called something else altogether. This is a willful violation of RFC 3986. [RFC3986]

in the latest HTML5 draft, it seemed like the right thing to do.

Would it have been too much trouble to invent “something else altogether” for this new meaning of “URL?”

February 21, 2013

Precursors to Simple Web Semantics

Filed under: RDF,Semantic Web,Semantics — Patrick Durusau @ 9:04 pm

A couple of precursors to Simple Web Semantics have been brought to my attention.

Wanted to alert you so you can consider these prior/current approaches while evaluating Simple Web Semantics.

The first one was from Rob Weir (IBM), who suggested I look at “smart tags” from Microsoft and sent the link to Soft tags (Wikipedia).

The second one was from Nick Howard (a math wizard I know) who pointed out the similarity to bookmarklets. On that see: Bookmarklet (Wikipedia).

I will be diving deeper into both of these technologies.

Not so much a historical study but what did/did not work, etc.

Other suggestions, directions, etc. are most welcome!

I have a another refinement to the syntax that I will be posting tomorrow.

February 17, 2013

Interpreting scientific literature: A primer

Filed under: Humor,Semantics — Patrick Durusau @ 8:16 pm

Interpreting scientific literature: A primer by kshameer.

It’s visual so follow the link.

I shouldn’t re-post this sort of thing, being something of a professional academic, but it’s too funny to resist.

Would be interesting to create an auto-tagger that could be run against online text to supply markup with the “they mean” values to be displayed on command.

😉

I first saw this at Christophe Lalanne’s A bag of tweets / January 2013.

February 15, 2013

Capturing the “Semantic Differential”?

Filed under: Language,Semantics — Patrick Durusau @ 11:51 am

Reward Is Assessed in Three Dimensions That Correspond to the Semantic Differential by John G. Fennell and Roland J. Baddeley. (Fennell JG, Baddeley RJ (2013) Reward Is Assessed in Three Dimensions That Correspond to the Semantic Differential. PLoS ONE 8(2): e55588. doi:10.1371/journal.pone.0055588)

Abstract:

If choices are to be made between alternatives like should I go for a walk or grab a coffee, a ‘common currency’ is needed to compare them. This quantity, often known as reward in psychology and utility in economics, is usually conceptualised as a single dimension. Here we propose that to make a comparison between different options it is important to know not only the average reward, but also both the risk and level of certainty (or control) associated with an option. Almost all objects can be the subject of choice, so if these dimensions are required in order to make a decision, they should be part of the meaning of those objects. We propose that this ubiquity is unique, so if we take an average over many concepts and domains these three dimensions (reward, risk, and uncertainty) should emerge as the three most important dimensions in the “meaning” of objects. We investigated this possibility by relating the three dimensions of reward to an old, robust and extensively studied factor analytic instrument known as the semantic differential. Across a very wide range of situations, concepts and cultures, factor analysis shows that 50% of the variance in rating scales is accounted for by just three dimensions, with these dimensions being Evaluation, Potency, and Activity [1]. Using a statistical analysis of internet blog entries and a betting experiment, we show that these three factors of the semantic differential are strongly correlated with the reward history associated with a given concept: Evaluation measures relative reward; Potency measures absolute risk; and Activity measures the uncertainty or lack of control associated with a concept. We argue that the 50% of meaning captured by the semantic differential is simply a summary of the reward history that allows decisions to be made between widely different options.

Semantic Differential” as defined by Wikipedia:

Semantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept.

Invented over 50 years ago, semantic differential scales, ranking a concept on a scale anchored by opposites, such as good-evil, has proven to be very useful.

What the scale was measuring, despite its success, was unknown. (May still be, depends on how persuasive you find the author’s proposal.)

The proposal merits serious discussion and additional research but I am leery about relying on blogs as representative of language usage.

Or rather I take blogs as representative of people who blog, which is a decided minority of all language users.

Just as I would take transcripts of “Sex and the City” as representing the fantasies of socially deprived writers. Interesting perhaps but not the same as the mores of New York City. (If that lowers your expectations about a trip to New York City, my apologies.)

How to set up Semantic Logging…

Filed under: .Net,Log Analysis,Semantics — Patrick Durusau @ 10:55 am

How to set up Semantic Logging: part one with Logstash, Kibana, ElasticSearch and Puppet, by Henrik Feldt.

While we are on the topic of semantic logging:

Logging today is mostly done too unstructured; each application developer has his own syntax for the logs, optimized for his personal requirements and when it is time to deploy, ops consider themselves lucky if there is even some logging in the application, and even luckier if that logging can be used to find problems as they occur by being able to adjust verbosity where needed.

I’ve come to the point where I want a really awesome piece of logging from the get-go – something I can pick up and install in a couple of minutes when I come to a new customer site without proper operations support.

I want to be able to be able to search, drill down into, filter out patterns and have good tooling that allow me to let logging be an obvious support as the application is brought through its life cycle, from development to production. And I don’t want to write my own log parsers, thank you very much!

That’s where semantic logging comes in – my applications should be broadcasting log data in a manner that allow code to route, filter and index it. That’s why I’ve spent a lot of time researching how logging is done in a bloody good manner – this post and upcoming ones will teach you how to make your logs talk!

It’s worth noting that you can read this post no matter your programming language. In fact, the tooling that I’m about to discuss will span multiple operating systems; Linux, Windows, and multiple programming languages: Erlang, Java, Puppet, Ruby, PHP, JavaScript and C#. I will demo logging from C#/Win initially and continue with Python, Haskell and Scala in upcoming posts.

I didn’t see any posts following this one. But it is complete enough to get you started on semantic logging.

Embracing Semantic Logging

Filed under: .Net,Log Analysis,Semantics — Patrick Durusau @ 10:49 am

Embracing Semantic Logging by Grigori Melnik.

From the post:

In the world of software engineering, every system needs to log. Logging helps to diagnose and troubleshoot problems with your system both in development and in production. This requires proper, well-designed instrumentation. All too often, however, developers instrument their code to do logging without having a clear strategy and without thinking through the ways the logs are going to be consumed, parsed, and interpreted. Valuable contextual information about events frequently gets lost, or is buried inside the log messages. Furthermore, in some cases logging is done simply for the sake of logging, more like a checkmark on the list. This situation is analogous to people fallaciously believing their backup system is properly implemented by enabling the backup but never, actually, trying to restore from those backups.

This lack of a thought-through logging strategy results in systems producing huge amounts of log data which is less useful or entirely useless for problem resolution.

Many logging frameworks exist today (including our own Logging Application Block and log4net). In a nutshell, they provide high-level APIs to help with formatting log messages, grouping (by means of categories or hierarchies) and writing them to various destinations. They provide you with an entry point – some sort of a logger object through which you call log writing methods (conceptually, not very different from Console.WriteLine(message)). While supporting dynamic reconfiguration of certain knobs, they require the developer to decide upfront on the template of the logging message itself. Even when this can be changed, the message is usually intertwined with the application code, including metadata about the entry such as the severity and entry id.

As ever in all discussions, even those of semantics, there is some impedance:

Imagine another world, where the events get logged and their semantic meaning is preserved. You don’t lose any fidelity in your data. Welcome to the world of semantic logging. Note, some people refer to semantic logging as “structured logging”, “strongly-typed logging” or “schematized logging”.

Whatever you want to call it:

The technology to enable semantic logging in Windows has been around for a while (since Windows 2000). It’s called ETW – Event Tracing for Windows. It is a fast, scalable logging mechanism built into the Windows operating system itself. As Vance Morrison explains, “it is powerful because of three reasons:

  1. The operating system comes pre-wired with a bunch of useful events
  2. It can capture stack traces along with the event, which is INCREDIBLY USEFUL.
  3. It is extensible, which means that you can add your own information that is relevant to your code.

EW has been improved in .NET Framework 4.5 but I will leave you to Grigori’s post to ferret out those details.

Semantic logging is important for all the reasons mentioned in Grigori’s post and because captured semantics provide grist for semantic mapping mills.

February 13, 2013

Saving the “Semantic” Web (part 4)

Filed under: RDF,Semantic Diversity,Semantic Web,Semantics — Patrick Durusau @ 4:15 pm

Democracy vs. Aristocracy

Part of a recent comment on this series reads:

What should we have been doing instead of the semantic web? ISO Topic Maps? There is some great work in there, but has it been a better success?

That is an important question and I wanted to capture it outside of comments on a prior post.

Earlier in this series of posts I pointed out the success of HTML, especially when contrasted with Semantic Web proposals.

Let me hasten to add the same observation is true for ISO Topic Maps (HyTime or later versions).

The critical difference between HTML (the early and quite serviceable versions) and Semantic Web/Topic Maps is that the former democratizes communication and the latter fosters a technical aristocracy.

Every user who can type and some who hunt-n-peck, can author HTML and publish their content for others around the world to read, discuss, etc.

That is a very powerful and democratizing notion about content creation.

The previous guardians, gate keepers, insiders, and their familiars, who didn’t add anything of value to prior publications processes, are still reeling from the blow.

Even as old aristocracies crumble, new ones evolve.

Technical aristocracies for example. A phrase relevant to both the Semantic Web and ISO Topic Maps.

Having tasted freedom, the crowds aren’t as accepting of the lash/leash as they once were. Nor of the aristocracies who would wield them. Nor should they be.

Which make me wonder: Why the emphasis on creating dumbed down semantics for computers?

We already have billions of people who are far more competent semantically than computers.

Where are our efforts to enable them to transverse the different semantics of other users?

Such as the semantics of the aristocrats who have self-anointed themselves to labor on their behalf?

If you have guessed that I have little patience with aristocracies, you are right in one.

I came by that aversion honestly.

I practiced law in a civilian jurisdiction for a decade. A specialist language, law, can be more precise, but it also excludes others from participation. The same experience was true when I studied theology and ANE languages. A bit later, in markup technologies (then SGML/HyTime), the same lesson was repeated. What I do with ODF and topic maps are two more specialized languages.

Yet a reasonably intelligent person can discuss issues in any of those fields, if they can get past the language barriers aristocrats take so much comfort in maintaining.

My answer to what we should be doing is:

Looking for ways to enable people to traverse and enjoy the semantic diversity that accounts for the richness of the human experience.

PS: Computers have a role to play in that quest, but a subordinate one.


February 10, 2013

Why Most BI Programs Under-Deliver Value

Filed under: Business Intelligence,Data Integration,Data Management,Integration,Semantics — Patrick Durusau @ 1:52 pm

Why Most BI Programs Under-Deliver Value by Steve Dine.

From the post:

Business intelligence initiatives have been undertaken by organizations across the globe for more than 25 years, yet according to industry experts between 60 and 65 percent of BI projects and programs fail to deliver on the requirements of their customers.

This impact of this failure reaches far beyond the project investment, from unrealized revenue to increased operating costs. While the exact reasons for failure are often debated, most agree that a lack of business involvement, long delivery cycles and poor data quality lead the list. After all this time, why do organizations continue to struggle with delivering successful BI? The answer lies in the fact that they do a poor job at defining value to the customer and how that value will be delivered given the resource constraints and political complexities in nearly all organizations.

BI is widely considered an umbrella term for data integration, data warehousing, performance management, reporting and analytics. For the vast majority of BI projects, the road to value definition starts with a program or project charter, which is a document that defines the high level requirements and capital justification for the endeavor. In most cases, the capital justification centers on cost savings rather than value generation. This is due to the level of effort required to gather and integrate data across disparate source systems and user developed data stores.

As organizations mature, the number of applications that collect and store data increase. These systems usually contain few common unique identifiers to help identify related records and are often referred to as data silos. They also can capture overlapping data attributes for common organizational entities, such as product and customer. In addition, the data models of these systems are usually highly normalized, which can make them challenging to understand and difficult for data extraction. These factors make cost savings, in the form of reduced labor for data collection, easy targets. Unfortunately, most organizations don’t eliminate employees when a BI solution is implemented; they simply work on different, hopefully more value added, activities. From the start, the road to value is based on a flawed assumption and is destined to under deliver on its proposition.

This post merits a close read, several times.

In particular I like the focus on delivery of value to the customer.

Err, that would be the person paying you to do the work.

Steve promises a follow-up on “lean BI” that focuses on delivering more value that it costs to deliver.

I am inherently suspicious of “lean” or “agile” approaches. I sat on a committee that was assured by three programmers they had improved upon IBM’s programming methodology but declined to share the details.

Their requirements document for a content management system, to be constructed on top of subversion, was a paragraph in an email.

Fortunately the committee prevailed upon management to tank the project. The programmers persist, management being unable or unwilling to correct past mistakes.

I am sure there are many agile/lean programming projects that deliver well documented, high quality results.

But I don’t start with the assumption that agile/lean or other methodology projects are well documented.

That is a question of fact. One that can be answered.

Refusal to answer due to time or resource constraints, is a very bad sign.

I first saw this in a top ten tweets list from KDNuggets.

February 8, 2013

Saving the “Semantic” Web (part 1)

Filed under: Semantic Web,Semantics — Patrick Durusau @ 5:17 pm

Semantics: Who You Gonna Call?

I quote “semantic” in ‘Semantic Web’ to emphasize the web had semantics long before puff pieces in Scientific American.

As a matter of fact, people traffic in semantics every day, in a variety of mediums. The “Web,” for all of its navel gazing, is just one.

At your next business or technical meeting, if a colleague uses a term you don’t know, here are some options:

  1. Search Cyc.
  2. Query WordNet.
  3. Call Pat Hayes.
  4. Ask the speaker what they meant.

Take a minute to think about it and put your answer in a comment below.

Other than Tim Berners-Lee, I suspect the vast majority of us will pick #4.

Here’s another quiz.

If asked, will the speaker respond with:

  1. Repeating the term over again, perhaps more loudly? (An Americanism that English spoken loudly is more understandable by non-English speakers. Same is true for technical terms.)
  2. Restating the term in Common Logic syntax?
  3. Singing a “cool” URI?
  4. Expanding the term by offering other properties that may be more familiar to you?

Again, other than Tim Berners-Lee, I suspect the vast majority of us will pick #4.

To summarize up to this point:

  1. We all have experience with semantics and encountering unknown semantics.
  2. We all (most of us) ask the speaker of unknown semantics to explain.
  3. We all (most of us) expect an explanation to offer additional information to clue us into the unknown semantic.

My answer to the question of “Semantics: Who You Gonna Call?” is the author of the data/information.

Do you have a compelling reason for asking someone else?


February 5, 2013

Chaotic Nihilists and Semantic Idealists [And What of Users?]

Filed under: Algorithms,Ontology,Semantics,Taxonomy,Topic Maps — Patrick Durusau @ 5:54 pm

Chaotic Nihilists and Semantic Idealists by Alistair Croll.

From the post:

There are competing views of how we should tackle an abundance of data, which I’ve referred to as big data’s “odd couple”.

One camp—made up of semantic idealists who fetishize taxonomies—is to tag and organize it all. Once we’ve marked everything and how it relates to everything else, they hope, the world will be reasonable and understandable.

The poster child for the Semantic Idealists is Wolfram Alpha, a “reasoning engine” that understands, for example, a question like “how many blue whales does the earth weigh?”—even if that question has never been asked before. But it’s completely useless until someone’s told it the weight of a whale, or the earth, or, for that matter, what weight is.

They’re wrong.

Alistair continues with the other camp:

Wolfram Alpha’s counterpart for the Algorithmic Nihilists is IBM’s Watson, a search engine that guesses at answers based on probabilities (and famously won on Jeopardy.) Watson was never guaranteed to be right, but it was really, really likely to have a good answer. It also wasn’t easily controlled: when it crawled the Urban Dictionary website, it started swearing in its responses[1], and IBM’s programmers had to excise some of its more colorful vocabulary by hand.

She’s wrong too.

And projects the future as:

The future of data is a blend of both semantics and algorithms. That’s one reason Google recently introduced a second search engine, called the Knowledge Graph, that understands queries.[3] Knowledge Graph was based on technology from Metaweb, a company it acquired in 2010, and it augments “probabilistic” algorithmic search with a structured, tagged set of relationships.

Why are we missing asking users what they meant as a third option?

Depends on who you want to be in charge:

Algorithms — Empower Computer Scientists.

Ontologies/taxonomies — Empower Ontologists.

Asking Users — Empowers Users.

Topic maps are a solution that can ask users.

Any questions?

February 2, 2013

Semantic Search for Scala – Post 1

Filed under: Programming,Scala,Semantics — Patrick Durusau @ 3:08 pm

Semantic Search for Scala – Post 1 by Mads Hartmann Jensen.

From the post:

The goal of the project is to create a semantic search engine for Scala, in the form of a library, and integrate it with the Scala IDE plugin for Eclipse. Part of the solution will be to index all aspects of a Scala code, that is:

  • Definitions of the usual Scala elements: classes, traits, objects, methods, fields, etc.
  • References to the above elements. Some more challenging case to consider are self-types, type-aliases, code injected by the compiler, and implicits.

With this information the library should be able to

  • Find all occurrences of any type of Scala element
  • Create a call-hierarchy, this is list all in- and outgoing method invocations, for any Scala method.
  • Create a type-hierarchy, i.e. list all super- and subclasses, of a specific type (I won’t necessarily find time to implement this during my thesis but nothing is stopping me from working on the project even after I hand in the report)

Mads is working on his master’s thesis and Typesafe has agreed to collaborate with him.

For a longer description of the project (or to comment), see: Features and Trees

If you have suggestions on semantic search for programming languages, please contact Mads on Twitter, Twitter @Mads_Hartmann.

January 30, 2013

Identity In A Context

Filed under: Search Engines,Searching,Semantics — Patrick Durusau @ 8:44 pm

Jasmine Ashton frames a quote about Julie Lynch, an archivist saying:

Due to the nature of her work, Lynch is the human equivalent of a search engine. However, she differs in one key aspect:

“Unlike Google, Lynch delivers more than search results, she provides context. That sepia-tinged photograph of the woman in funny-looking clothes on a funny-looking bicycle actually offers a window into the impact bicycles had on women’s independence. An advertisement touting “can build frame houses” demonstrates construction restrictions following the Great Chicago Fire. Surprisingly, high school yearbooks — the collection features past editions from Lane Tech, Amundsen and Lake View High Schools — serve as more than a cautionary tale in the evolution of hairstyles.”

Despite the increase in technology that makes searching information as easy as tapping a touch screen, this article reiterates the importance of having real people to contextualize these documents. (How Librarians Play an Integral Role When Searching for Historical Documents
)

Rather than say “contextualize,” I would prefer to say that librarians provide alternative “contexts” for historical documents.

Recognition of a document, or any other subject, takes place in a context. A librarian can offer the user different contexts in which to understand a document.

Doesn’t invalidate the initial context of understanding, simply becomes an alternative one.

Quite different from our search engines, which see only “matches” and no context for those matches.

January 26, 2013

Is Google Hijacking Semantic Markup/Structured Data? [FALSE]

Filed under: Search Data,Search Engines,Searching,Semantics,Structured Data — Patrick Durusau @ 1:42 pm

Is Google Hijacking Semantic Markup/Structured Data? by Barbara Starr.

From the post:

On December 12, 2012, Google rolled out a new tool, called the Google Data Highlighter for event data. Upon a cursory read, it seems to be a tagging tool, where a human trains the Data Highlighter using a few pages on their website, until Google can pick up enough of a pattern to do the remainder of the site itself.

Better yet, you can see all of these results in the structured data dashboard. It appears as if event data is marked up and is compatible with schema.org. However, there is a caveat here that some folks may not notice.

No actual markup is placed on the page, meaning that none of the semantic markup using this Data Highlighter tool is consumable by Bing, Yahoo or any other crawler on the Web; only Google can use it!

Google is essentially hi-jacking semantic markup so only Google can take advantage of it. Google has the global touch and the ability to execute well-thought-out and brilliantly strategic plans.

Let’s do this by the numbers:

  1. Google develops a service for webmasters to add semantic annotations to their webpages.
  2. Google allows webmasters to use that service at no charge.
  3. Google uses those annotations to improve the search results it provides users (for free).

Google used its own resources to develop a valuable service for webmasters that enhances their websites and user experience with Google, for free.

Perhaps there is a new definition of highjacking?

Webster says the traditional definition includes “to steal or rob as if by hijacking.”

The Semantic Web:

Highjacking

(a) Failing to whitewash the Semantic Web’s picket fence while providing free services to webmasters and users to enhance searching of web content.

(b) Failing to give away data from free services to webmasters and users to those who did not plant, reap, spin, weave or sew.

I don’t find the Semantic Web’s definition of “hijacking” persuasive.

You?

I first saw this at: Google’s Structured Data Take Over by Angela Guess.

« Newer PostsOlder Posts »

Powered by WordPress