Archive for the ‘Semantic Search’ Category

How semantic search is killing the keyword

Sunday, December 29th, 2013

How semantic search is killing the keyword by Rich Benci.

From the post:

Keyword-driven results have dominated search engine results pages (SERPs) for years, and keyword-specific phrases have long been the standard used by marketers and SEO professionals alike to tailor their campaigns. However, Google’s major new algorithm update, affectionately known as Hummingbird because it is “precise and fast,” is quietly triggering a wholesale shift towards “semantic search,” which focuses on user intent (the purpose of a query) instead of individual search terms (the keywords in a query).

Attempts have been made (in the relatively short history of search engines) to explore the value of semantic results, which address the meaning of a query, rather than traditional results, which rely on strict keyword adherence. Most of these efforts have ended in failure. However, Google’s recent steps have had quite an impact in the internet marketing world. Google began emphasizing the importance of semantic search by showcasing its Knowledge Graph, a clear sign that search engines today (especially Google) care a lot more about displaying predictive, relevant, and more meaningful sites and web pages than ever before. This “graph” is a massive mapping system that connects real-world people, places, and things that are related to each other and that bring richer, more relevant results to users. The Knowledge Graph, like Hummingbird, is an example of how Google is increasingly focused on answering questions directly and producing results that match the meaning of the query, rather than matching just a few words.

“Hummingbird” takes flight

Google’s search chief, Amit Singhal, says that the Hummingbird update is “the first time since 2001 that a Google algorithm has been so dramatically rewritten.” This is how Danny Sullivan of Search Engine Land explains it: “Hummingbird pays more attention to each word in a query, ensuring that the whole query — the whole sentence or conversation or meaning — is taken into account, rather than particular words.”

The point of this new approach is to filter out less-relevant, less-desirable results, making for a more satisfying, more accurate answer that includes rich supporting information and easier navigation. Google’s Knowledge Graph, with its “connect the dots” type of approach, is important because users stick around longer as they discover more about related people, events, and topics. The results of a simple search for Hillary Clinton, for instance, include her birthday, her hometown, her family members, the books she’s written, a wide variety of images, and links to “similar” people, like Barack Obama, John McCain, and Joe Biden.

The key to making your website more amenable to “semantic search” is the use of the microformat you will find at Schema.org.

That is to say Google’s graph has pre-fabricated information in its knowledge graph that it can match up with information specified using Schema.org markup.

Sounds remarkably like a topic map doesn’t it?

Useful if you are looking for “popular” people, places and things. Not so hot with intra-enterprise search results. Unless of course your enterprise is driven by “pop” culture.

Impressive if you want coarse semantic searching sufficient to sell advertising. (See Type Hierarchy at Schema.org for all available types.

I say coarse semantic searching, my count on the types at Schema.org, as of today, is seven hundred and nineteen (719) types. Is that what you get?

I ask because in scanning “InterAction,” I don’t see SexAction or any of its sub-categories. Under “ConsumeAction” I don’t see SmokeAction or SmokeCrackAction or SmokeWeedAction or any of the other sub-categories of “ConsumeAction.” Under “LocalBusiness” I did not see WhoreHouse, DrugDealer, S/MShop, etc.

I felt like I had fallen into BradyBunchville. 😉

Seriously, if they left out those mainstream activities, what are the chances they included what you need for your enterprise?

Not so good. That’s what I thought.

A topic map when paired with a search engine and your annotated content can take your enterprise beyond keyword search.

The Gap Between Documents and Answers

Thursday, October 24th, 2013

I mentioned the webinar: Driving Knowledge-Worker Performance with Precision Search Results a few days ago in Findability As Value Proposition.

There was one nugget (among many) in the webinar before I lose sight of how important it is to topic maps and semantic technologies in general.

Dan Taylor (Earley and Associates) was presenting a maturation diagram for knowledge technologies.

See the presentation for the details but what struck me was than on the left side (starting point) there were documents. On the right side (the goal) were answers.

Think about that for a moment.

When you search in Google or any other search engine, what do you get back? Pointers to documents, presentations, videos, etc.

What task remains? Digging out answers from those documents, presentations, videos.

A mature knowledge technology goes beyond what an average user is searching for (the Google model) and returns information based on a specific user for a particular domain, that is, an answer.

For the average user there may be no better option than to drop them off in the neighborhood of a correct answer. Or what may be a correct answer to the average user. No guarantees that you will find it.

The examples in the webinar are in specific domains where user queries can be modeled accurately enough to formulate answers (not documents) to answer queries.

Reminds me of TaxMap. You?

If you want to do a side by side comparison, try USC: Title 26 – Internal Revenue Code. From the Legal Information Institute (Cornell)

Don’t get me wrong, the Cornell materials are great but they reflect the U.S. Code, nothing more or less. That is to say the text you find there isn’t engineered to provide answers. 😉

I will update this point with the webinar address as soon as it appears.

Semantic Search and Linked Open Data Special Issue

Friday, September 27th, 2013

Semantic Search and Linked Open Data Special Issue

Paper submission: 15 December 2013
Notice of review results: 15 February 2013
Revisions due: 31 March 2014
Publication: Aslib Proceedings, issue 5, 2014.

From the call:

The opportunities and challenges of Semantic Search from theoretical and practical, conceptual and empirical perspectives. We are particularly interested in papers that place carefully conducted studies into the wider framework of current Semantic Search research in the broader context of Linked Open Data. Topics of interest include but are not restricted to:

  • The history of semantic search –  the latest techniques and technology developments in the last 1000 years
  • Technical approaches to semantic search : linguistic/NLP, probabilistic, artificial intelligence, conceptual/ontological
  • Current trends in Semantic Search, including best practice, early adopters, and cultural heritage
  • Usability and user experience; Visualisation; and techniques and technologies in the practice for Semantic Search
  • Quality criteria and Impact of norms and standardisation similar to ISO 25964 “Thesauri for information retrieval“
  • Cross-industry collaboration and standardisation
  • Practical problems in brokering consensus and agreement – defining concepts, terms and classes, etc
  • Curation and management of ontologies
  • Differences between web-scale, enterprise scale, and collection-specific scale techniques
  • Evaluation of Semantic Search solutions, including comparison of data collection approaches
  • User behaviour including evolution of norms and conventions; Information behaviour; and Information literacy
  • User surveys; usage scenarios and case studies

Papers should clearly connect their studies to the wider body of Semantic Search scholarship, and spell out the implications of their findings for future research. In general, only research-based submissions including case studies and best practice will be considered. Viewpoints, literature reviews or general reviews are generally not acceptable.

See the post for submission requirements, etc.

I am encouraged by the inclusion of:

The history of semantic search –  the latest techniques and technology developments in the last 1000 years

Wondering who will take up the gauntlet on that topic?

Broccoli: Semantic Full-Text Search at your Fingertips

Friday, April 19th, 2013

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas, see http://broccoli.informatik.uni-freiburg.de.

The most impressive part of an impressive paper was the new index, context lists.

The second idea, which is the main idea behind our new index, is to have what we call context lists instead of inverted lists. The context list for a pre x contains one index item per occurrence of a word starting with that pre x, just like the inverted list for that pre x would. But along with that it also contains one index item for each occurrence of an arbitrary entity in the same context as one of these words.

The performance numbers speak for themselves.

This should be a feature in the next release of Lucene/Solr. Or perhaps even configurable for the number of entities that can appear in a “context list.”

Was it happenstance or a desire for simplicity that caused the original indexing engines to parse text into single tokens?

Literature references on that point?

2ND International Workshop on Mining Scientific Publications

Monday, April 15th, 2013

2ND International Workshop on Mining Scientific Publications

May 26, 2013 – Submission deadline
June 23, 2013 – Notification of acceptance
July 7, 2013 – Camera-ready
July 26, 2013 – Workshop

From the CFP:

Digital libraries that store scientific publications are becoming increasingly important in research. They are used not only for traditional tasks such as finding and storing research outputs, but also as sources for mining this information, discovering new research trends and evaluating research excellence. The rapid growth in the number of scientific publications being deposited in digital libraries makes it no longer sufficient to provide access to content to human readers only. It is equally important to allow machines analyse this information and by doing so facilitate the processes by which research is being accomplished. Recent developments in natural language processing, information retrieval, the semantic web and other disciplines make it possible to transform the way we work with scientific publications. However, in order to make this happen, researchers first need to be able to easily access and use large databases of scientific publications and research data, to carry out experiments.

This workshop aims to bring together people from different backgrounds who:
(a) are interested in analysing and mining databases of scientific publications,
(b) develop systems, infrastructures or datasets that enable such analysis and mining,
(c) design novel technologies that improve the way research is being accomplished or
(d) support the openness and free availability of publications and research data.

2. TOPICS

The topics of the workshop will be organised around the following three themes:

  1. Infrastructures, systems, open datasets or APIs that enable analysis of large volumes of scientific publications.
  2. Semantic enrichment of scientific publications by means of text-mining, crowdsourcing or other methods.
  3. Analysis of large databases of scientific publications to identify research trends, high impact, cross-fertilisation between disciplines, research excellence and to aid content exploration.

Of particular interest for topic mappers:

Topics of interest relevant to theme 2 include, but are not limited to:

  • Novel information extraction and text-mining approaches to semantic enrichment of publications. This might range from mining publication structure, such as title, abstract, authors, citation information etc. to more challenging tasks, such as extracting names of applied methods, research questions (or scientific gaps), identifying parts of the scholarly discourse structure etc.
  • Automatic categorization and clustering of scientific publications. Methods that can automatically categorize publications according to an established subject-based classification/taxonomy (such as Library of Congress classification, UNESCO thesaurus, DOAJ subject classification, Library of Congress Subject Headings) are of particular interest. Other approaches might involve automatic clustering or classification of research publications according to various criteria.
  • New methods and models for connecting and interlinking scientific publications. Scientific publications in digital libraries are not isolated islands. Connecting publications using explicitly defined citations is very restrictive and has many disadvantages. We are interested in innovative technologies that can automatically connect and interlink publications or parts of publications, according to various criteria, such as semantic similarity, contradiction, argument support or other relationship types.
  • Models for semantically representing and annotating publications. This topic is related to aspects of semantically modeling publications and scholarly discourse. Models that are practical with respect to the state-of-the-art in Natural Language Processing (NLP) technologies are of special interest.
  • Semantically enriching/annotating publications by crowdsourcing. Crowdsourcing can be used in innovative ways to annotate publications with richer metadata or to approve/disapprove annotations created using text-mining or other approaches. We welcome papers that address the following questions: (a) what incentives should be provided to motivate users in contributing metadata, (b) how to apply crowdsourcing in the specialized domains of scientific publications, (c) what tasks in the domain of organising scientific publications is crowdsourcing suitable for and where it might fail, (d) other relevant crowdsourcing topics relevant to the domain of scientific publications.

The other themes could be viewed through a topic map lens but semantic enrichment seems like a natural.

Semantic Search Over The Web (SSW 2013)

Monday, March 18th, 2013

3RD International Workshop onSemantic Search Over The Web (SSW 2013)

Dates:

Abstract Papers submission: May 31, 2013 – 15:00 (3:00 pm) EDT
(Short) Full Paper submission: June 7, 2013 – 15:00 (3:00 pm) EDT
Author notification: July 19, 2013
Camera-ready copy due: August 2, 2013
Workshop date: During VLDB (Aug 26 – Aug 30)

From the webpage:

We are witnessing a smooth evolution of the Web from a worldwide information space of linked documents to a global knowledge base, composed of semantically interconnected resources. To date, the correlated and semantically annotated data available on the web amounts to 25 billion RDF triples, interlinked by around 395 million RDF links. The continuous publishing and the integration of the plethora of semantic datasets from companies, government and public sector projects is leading to the creation of the so-called Web of Knowledge. Each semantic dataset contributes to extend the global knowledge and increases its reasoning capabilities. As a matter of facts, researchers are now looking with growing interest to semantic issues in this huge amount of correlated data available on the Web. Many progresses have been made in the field of semantic technologies, from formal models to repositories and reasoning engines. While the focus of many practitioners is on exploiting such semantic information to contribute to IR problems from a document centric point of view, we believe that such a vast, and constantly growing, amount of semantic data raises data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web.

The third edition of the International Workshop on Semantic Search over the Web (SSW) will discuss about data management issues related to the search over the web and the relationships with semantic web technologies, proposing new models, languages and applications.

The research issues can be summarized by the following problems:

  • How can we model and efficiently access large amounts of semantic web data?
  • How can we effectively retrieve information exploiting semantic web technologies?
  • How can we employ semantic search in real world scenarios?

The SSW Workshop invites researchers, engineers, service developers to present their research and works in the field of data management for semantic search. Papers may deal with methods, models, case studies, practical experiences and technologies.

Apologies for the uncertainty of the workshop date. (There is confusion about the date on the workshop site, one place says the 26th, the other the 30th. Check before you make reservation/travel arrangements.)

I differ with the organizers on some issues but on the presence of: “…data management issues that must be faced in a dynamic, highly distributed and heterogeneous environment such as the Web,” there is no disagreement.

That’s the trick isn’t it? In any confined or small group setting, just about any consistent semantic solution will work.

The hurly-burly of a constant stream of half-heard, partially understood communications across distributed and heterogeneous systems tests the true mettle of semantic solutions.

Not a quest for perfect communication but “good enough.”

Go3R [Searching for Alternatives to Animal Testing]

Monday, December 17th, 2012

Go3R

A semantic search engine for finding alternatives to animal testing.

I mention it as an example of a search interface that assists the user in searching.

The help documentation is a bit sparse if you are looking for an opportunity to contribute to such a project.

I did locate some additional information on the project, all usefully with the same title to make locating it “easy.” 😉

[Introduction] Knowledge-based semantic search engine for alternative methods to animal experiments

[PubMed – entry] Go3R – semantic Internet search engine for alternative methods to animal testing by Sauer UG, Wächter T, Grune B, Doms A, Alvers MR, Spielmann H, Schroeder M. (ALTEX. 2009;26(1):17-31).

Abstract:

Consideration and incorporation of all available scientific information is an important part of the planning of any scientific project. As regards research with sentient animals, EU Directive 86/609/EEC for the protection of laboratory animals requires scientists to consider whether any planned animal experiment can be substituted by other scientifically satisfactory methods not entailing the use of animals or entailing less animals or less animal suffering, before performing the experiment. Thus, collection of relevant information is indispensable in order to meet this legal obligation. However, no standard procedures or services exist to provide convenient access to the information required to reliably determine whether it is possible to replace, reduce or refine a planned animal experiment in accordance with the 3Rs principle. The search engine Go3R, which is available free of charge under http://Go3R.org, runs up to become such a standard service. Go3R is the world-wide first search engine on alternative methods building on new semantic technologies that use an expert-knowledge based ontology to identify relevant documents. Due to Go3R’s concept and design, the search engine can be used without lengthy instructions. It enables all those involved in the planning, authorisation and performance of animal experiments to determine the availability of non-animal methodologies in a fast, comprehensive and transparent manner. Thereby, Go3R strives to significantly contribute to the avoidance and replacement of animal experiments.

[ALTEX entry – full text available] Go3R – Semantic Internet Search Engine for Alternative Methods to Animal Testing

Broccoli: Semantic Full-Text Search at your Fingertips

Friday, July 13th, 2012

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, and Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g. edible), classes (e.g. plants), instances (e.g. Broccoli), and relations (e.g. occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully-functional prototype based on our ideas, see this http URL

It’s good to see CS projects work so hard to find unambiguous names. That won’t be confused with far more common uses of the same names. 😉

For all that, on quick review it does look like a clever, if annoyingly named, project.

Hmmm, doesn’t like the “-” (hyphen) character. “graph-theoretical tree” returns 0 results, “graph theoretical tree” returns 1 (the expected one).

Definitely worth a close read.

One puzzle though. There are a number of projects that use Wikipedia data dumps. The problem is most of the documents I am interested in searching aren’t in Wikipedia data dumps. Like the Enron emails.

Techniques that work well with clean data may work less well with documents composed of the vagaries of human communication. Or attempts at communication.

Joint International Workshop on Entity-oriented and Semantic Search

Thursday, May 31st, 2012

1st Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012

Important Dates:

  • Submissions Due: July 2, 2012
  • Notification of Acceptance: July 23, 2012
  • Camera Ready: August 1, 2012
  • Workshop date: August 16th, 2012

Located at the 35th ACM SIGIR Conference, Portland, Oregon, USA, August 12–16, 2012.

From the homepage of the workshop:

About the Workshop:

The workshop encompasses various tasks and approaches that go beyond the traditional bag-of-words paradigm and incorporate an explicit representation of the semantics behind information needs and relevant content. This kind of semantic search, based on concepts, entities and relations between them, has attracted attention both from industry and from the research community. The workshop aims to bring people from different communities (IR, SW, DB, NLP, HCI, etc.) and backgrounds (both academics and industry practitioners) together, to identify and discuss emerging trends, tasks and challenges. This joint workshop is a sequel of the Entity-oriented Search and Semantic Search Workshop series held at different conferences in previous years.

Topics

The workshop aims to gather all works that discuss entities along three dimensions: tasks, data and interaction. Tasks include entity search (search for entities or documents representing entities), relation search (search entities related to an entity), as well as more complex tasks (involving multiple entities, spatio-temporal relations inclusive, involving multiple queries). In the data dimension, we consider (web/enterprise) documents (possibly annotated with entities/relations), Linked Open Data (LOD), as well as user generated content. The interaction dimension gives room for research into user interaction with entities, also considering how to display results, as well as whether to aggregate over multiple entities to construct entity profiles. The workshop especially encourages submissions on the interface of IR and other disciplines, such as the Semantic Web, Databases, Computational Linguistics, Data Mining, Machine Learning, or Human Computer Interaction. Examples of topic of interest include (but are not limited to):

  • Data acquisition and processing (crawling, storage, and indexing)
  • Dealing with noisy, vague and incomplete data
  • Integration of data from multiple sources
  • Identification, resolution, and representation of entities (in documents and in queries)
  • Retrieval and ranking
  • Semantic query modeling (detecting, modeling, and understanding search intents)
  • Novel entity-oriented information access tasks
  • Interaction paradigms (natural language, keyword-based, and hybrid interfaces) and result representation
  • Test collections and evaluation methodology
  • Case studies and applications

We particularly encourage formal evaluation of approaches using previously established evaluation benchmarks: Semantic Search Challenge 2010, Semantic Search Challenge 2011, TREC Entity Search Track.

All workshops are special to someone. This one sounds more special than most. Collocated with the ACM SIGIR 2012 meeting. Perhaps that’s the difference.