Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 15, 2012

Information Retrieval: Berkeley School of Information

Filed under: Information Retrieval — Patrick Durusau @ 7:12 pm

Information Retrieval: Berkeley School of Information

The PDFs are password protected (on the outline) but the course slides are available.

Good slides by the way. Particularly the illustrations.

The course used one of the mini-TREC data sets.

If you are not familiar with TREC, you should be.

March 8, 2012

History of Information Organization (Infographic)

Filed under: Information Overload,Information Retrieval,Information Science — Patrick Durusau @ 8:49 pm

From Cartography to Card Catalogs [Infographic]: History of Information Organization

Mindjet has posted an infographic and blog post about the history of information organization. I have embedded the graphic below.

Let me preface my remarks by saying I have known people at Mindjet and it is a fairly remarkable organization. And to be fair, the history of information organization is of interest to me, although I am far from being a specialist in the field.

However, when a graphic jumps from “850 CE The First Byzantine Encyclopedia,” to “1276 CE Oldest Continuously Functioning Library” and informs the reader on the edge in between that was “3,000 years ago,” it seems to be lacking in precision or proofing, perhaps both.

Although information has to be summarized for such a presentation, I thought the rise of writing in Egypt/Sumeria would have merited a note, perhaps the library of Ashurbanipal (first library of the ancient Middle East) or the Library of Alexandria, just to name two. Noting you would have to go before Ashurbanipal to get 3,000 years ago. And there were written texts and collections of such texts for anywhere from 2,000 to 3,000 years before that.

I do appreciate that Mindjet doesn’t think information issues arose with the digital computer. I am hopeful that they will encourage a re-examination of older methods and solutions in hopes of finding clues to new solutions.

February 25, 2012

A Survey of Automatic Query Expansion in Information Retrieval

Filed under: Information Retrieval,Query Expansion — Patrick Durusau @ 7:39 pm

A Survey of Automatic Query Expansion in Information Retrieval by Claudio Carpineto, Giovanni Romano.

Abstract:

The relative ineffectiveness of information retrieval systems is largely caused by the inaccuracy with which a query formed by a few keywords models the actual user information need. One well known method to overcome this limitation is automatic query expansion (AQE), whereby the user’s original query is augmented by new features with a similar meaning. AQE has a long history in the information retrieval community but it is only in the last years that it has reached a level of scientific and experimental maturity, especially in laboratory settings such as TREC. This survey presents a unified view of a large number of recent approaches to AQE that leverage various data sources and employ very different principles and techniques. The following questions are addressed. Why is query expansion so important to improve search effectiveness? What are the main steps involved in the design and implementation of an AQE component? What approaches to AQE are available and how do they compare? Which issues must still be resolved before AQE becomes a standard component of large operational information retrieval systems (e.g., search engines)?

Have you heard topic maps described as being the solution to the following problem?

The most critical language issue for retrieval effectiveness is the term mismatch problem: the indexers and the users do often not use the same words. This is known as the vocabulary problem Furnas et al. [1987], compounded by synonymy (same word with different meanings, such as “java”) and polysemy (different words with the same or similar meanings, such as “tv” and “television”). Synonymy, together with word inflections (such as with plural forms, “television” versus “televisions”), may result in a failure to retrieve relevant documents, with a decrease in recall (the ability of the system to retrieve all relevant documents). Polysemy may cause retrieval of erroneous or irrelevant documents, thus implying a decrease in precision (the ability of the system to retrieve only relevant documents).

That sounds like the XWindows index merging problem doesn’t it? (Different terms being used by *nix vendors who wanted to use a common set of XWindows documentation.)

The authors describe the amount of data on the web searched with only one, two or three terms:

In this situation, the vocabulary problem has become even more serious because the paucity of query terms reduces the possibility of handling synonymy while the heterogeneity and size of data make the effects of polysemy more severe.

But the size of the data isn’t a given. What if a topic map with scoped names were used to delimit the sites searched using a particular identifier.

For example, a topic could have the name: “TRIM19” and a scope of: “http://www.ncbi.nlm.nih.gov/gene.” If you try a search with “TRIM19” at the scoping site, you get a very different result than if you use “TRIM19” with say “http://www.google.com.”

Try it, I’ll wait.

Now, imagine that your scoping topic on “TRIM19” isn’t just that one site but a topic that represents all the gene database sites known to you. I don’t know the number but it can’t be very large, at least when compared to the WWW.

That simple act of delimiting the range of your searches, makes them far less subject to polysemy.

Not to mention that a topic map could be used to supply terms for use in automated query expansion.

BTW, the survey is quite interesting and deserves a slow read with follow up on the cited references.

February 20, 2012

Attention-enhancing information retrieval

Filed under: Information Retrieval,Interface Research/Design,Users — Patrick Durusau @ 8:36 pm

Attention-enhancing information retrieval

William Webber writes:

Last week I was at SWIRL, the occasional talkshop on the future of information retrieval. To me the most important of the presentations was Dianne Kelly’s “Rage against the Machine Learning”, in which she observed the way information retrieval currently works has changed the way people think. In particular, she proposed that the combination of short query with snippet response has reworked peoples’ plastic brains to focus on working memory, and forgo the processing of information required for it to lay its tracks down in our long term memory. In short, it makes us transactionally adept, but stops us from learning.

This is as important as Bret Victor’s presentation.

I particularly liked the line:

Various fanciful scenarios were given, but the ultimate end-point of such a research direction is that you walk into the shopping mall, and then your mobile phone leads you round telling you what to buy.

Reminds me of a line I remember imperfectly as judging from advertising, we are all “…insecure, sex-starved neurotics with 15-second attention spans.”

I always thought that was being generous on the attention span but opinions differ on that point. 😉

How do you envision your users? Serious question but not one you have to answer here. Ask yourself.

February 15, 2012

KDIR 2012 : International Conference on Knowledge Discovery and Information

Filed under: Conferences,Information Retrieval,Knowledge Discovery — Patrick Durusau @ 8:31 pm

KDIR 2012 : International Conference on Knowledge Discovery and Information

Regular Paper Submission: April 17, 2012
Authors Notification (regular papers): June 12, 2012
Final Regular Paper Submission and Registration: July 4, 2012

From the call for papers:

Knowledge Discovery is an interdisciplinary area focusing upon methodologies for identifying valid, novel, potentially useful and meaningful patterns from data, often based on underlying large data sets. A major aspect of Knowledge Discovery is data mining, i.e. applying data analysis and discovery algorithms that produce a particular enumeration of patterns (or models) over the data. Knowledge Discovery also includes the evaluation of patterns and identification of which add to knowledge. This has proven to be a promising approach for enhancing the intelligence of software systems and services. The ongoing rapid growth of online data due to the Internet and the widespread use of large databases have created an important need for knowledge discovery methodologies. The challenge of extracting knowledge from data draws upon research in a large number of disciplines including statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing, to deliver advanced business intelligence and web discovery solutions.

Information retrieval (IR) is concerned with gathering relevant information from unstructured and semantically fuzzy data in texts and other media, searching for information within documents and for metadata about documents, as well as searching relational databases and the Web. Automation of information retrieval enables the reduction of what has been called “information overload”.

Information retrieval can be combined with knowledge discovery to create software tools that empower users of decision support systems to better understand and use the knowledge underlying large data sets.

Part of IC3K 2012 – International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management.

February 14, 2012

Scienceography: the study of how science is written

Filed under: Data Mining,Information Retrieval — Patrick Durusau @ 5:05 pm

Scienceography: the study of how science is written by Graham Cormode, S. Muthukrishnan and Jinyun Yun.

Abstract:

Scientific literature has itself been the subject of much scientific study, for a variety of reasons: understanding how results are communicated, how ideas spread, and assessing the influence of areas or individuals. However, most prior work has focused on extracting and analyzing citation and stylistic patterns. In this work, we introduce the notion of ‘scienceography’, which focuses on the writing of science. We provide a first large scale study using data derived from the arXiv e-print repository. Crucially, our data includes the “source code” of scientific papers-the LATEX source-which enables us to study features not present in the “final product”, such as the tools used and private comments between authors. Our study identifies broad patterns and trends in two example areas-computer science and mathematics-as well as highlighting key differences in the way that science is written in these fields. Finally, we outline future directions to extend the new topic of scienceography.

What content are you searching/indexing in a scientific context?

The authors discover what many of us have overlooked. The “source” of scientific papers. A source that can reflects a richer history than the final product.

Some questions:

Will searching the source give us finer grained access to the content? That is can we separate portions of text that recite history, related research, background, from new insights/conclusions? To access the other material only if needed. (Every graph paper starts off with nodes and edges, complete with citations. Anyone reading a graph paper is likely to know those terms.)

Other disciplines use LaTeX. Do those LaTeX files differ from the ones reported here? If so, in what way?

January 20, 2012

ISO 25964-­-1 Thesauri for information retrieval

Filed under: Cloud Computing,Information Retrieval,ISO/IEC,JTC1,Standards,Thesaurus — Patrick Durusau @ 9:18 pm

Information and documentation -­- Thesauri and interoperability with other vocabularies -­- Part 1: Thesauri for information retrieval

Actually that is the homepage for Networked Knowledge Organization Systems/Services – N K O S but the lead announcement item is for ISO 25964-1, etc.

From that webpage:

New international thesaurus standard published

ISO 25964-­-1 is the new international standard for thesauri, replacing ISO 2788 and ISO 5964. The full title is Information and documentation -­- Thesauri and interoperability with other vocabularies -­- Part 1: Thesauri for information retrieval. As well as covering monolingual and multilingual thesauri, it addresses 21st century needs for data sharing, networking and interoperability.

Content includes:

  • construction of mono-­- and multi-­-lingual thesauri;
  • clarification of the distinction between terms and concepts, and their inter-­-relationships;
  • guidance on facet analysis and layout;
  • guidance on the use of thesauri in computerized and networked systems;
  • best practice for the management and maintenance of thesaurus development;
  • guidelines for thesaurus management software;
  • a data model for monolingual and multilingual thesauri;
  • brief recommendations for exchange formats and protocols.

An XML schema for data exchange has been derived from the data model, and is available free of charge at http://www.niso.org/schemas/iso25964/ . Coming next ISO 25964-­-1 is the first of two publications. Part 2: Interoperability with other vocabularies is in the public review stage and will be available by the end of 2012.

Find out how you can obtain a copy from the news release.

Let me help you there, the correct number is: ISO 25964-1:2011 and the list price for a PDF copy is CHF 238,00, or in US currency (today), $257.66 (for 152 pages).

Shows what I know about semantic interoperability.

If you want semantic interoperability, you change people $1.69 per page (152 pages) for access to the principles of thesauri to be used for information retrieval.

ISO/IEC and JTC 1 are all parts of a system of viable international (read non-vendor dominated) organizations for information/data standards. They are the natural homes for the management of data integration standards that transcend temporal, organizational, governmental and even national boundaries.

But those roles will not fall to them by default. They must seize the initiative and those roles. Clinging to old-style publishing models for support makes them appear timid in the face of current challenges.

Even vendors recognize their inability to create level playing fields for technology/information standards. And the benefits that come to vendors from de jure as well as non-de jure standards organizations.

ISO/IEC/JTC1, provided they take the initiative, can provide an international, de jure home for standards that form the basis for information retrieval and integration.

The first step to take is to make ISO/IEC/JTC1 information standards publicly available by default.

The second step is to call up all members and beneficiaries, both direct and indirect, of ISO/IEC/JTC 1 work, to assist in the creation of mechanisms to support the vital roles played by ISO/IEC/JTC 1 as de jure standards bodies.

We can all learn something from ISO 25964-1 but how many of us will with that sticker price?

December 6, 2011

IR – Foundation?

Filed under: Information Retrieval,Semantics — Patrick Durusau @ 8:02 pm

I find find the following statement troubling. See if you can see what’s missing from:

In terms of research, the area may be studied from two rather distinct and complementary points of view: a computer-centered one and a human-centered one. In the computer-centered view, IR consists mainly of building up efficient indexes, processing user queries with high performance, and developing ranking algorithms to improve the results. In the human-centered view, IR consists mainly of studying the behavior of the user, understanding their main needs, and of determining how such understanding affects the organization and operation of the retrieval system. In this book, we focus mainly on the computer-centered view of IR, which is dominant in academia and in the market place. (page 1, Modern Information Retrieval, 2nd ed., Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Pearson 2011)

I am not challenging the accuracy of the statement. Although I might explain some of it differently from the authors.

The terminology by which computer-centered IR is described is one clue: “….efficient…, ….high performance, ….improve the results.” That is computer-centered IR is mostly concerned with measurable results. Things to which we can put numbers and rank one as higher than others. Nothing wrong with that. Personally I have a great deal of interest in such approaches.

Human-centered IR is said: “….behavior…, ….needs, ….understanding….organization and operation….” Human-centered IR is mostly concerned with how users perform IR. Not as measurable but just as important as computer-centered IR. The authors point out, computer-centered IR dominates in academia and in the market place. I suspect because what can be easily measured is more attractive.

Do you notice something missing yet?

I thought it was quite remarkable that semantics weren’t mentioned. That is whatever computer or human centered approaches you take, the efficacy of those are going to vary by the semantics of the language on which IR is being performed. If that seems like an odd claim, consider the utility of an IR system that does not properly sort European much less Asian words, whether written in their scripts or transliteration.

True enough, we can make an IR system that is very fast that simply ignores the correct sort orders for such languages and in the past have taught readers of such languages to accept what the IR system was providing. So the behavior of the users was adapted to the systems. Human-centered I suppose but not the way I usually think about it.

And, after all, semantics are the reason we want to do IR in the first place. If the contents we were searching had no semantics, it is very unlikely we would want to search them at all. No matter now efficient or well organized a system might be.

My real concern is that semantics are being assumed as a matter of course. We all “know” the semantics. Hardly worth discussing. But that is why search results so seldom meet our expectations. We didn’t discuss the semantics up front. Everyone from system architect, programmer, UI designer, content author, all the way to and including the searcher, “knew” the semantics.

Trouble is, the semantics they “know,” are often different.

Of course the authors are free to include or exclude any content they wish and to fully cover semantic issues in general, would require a volume at least as long as this one. (A little over 900 pages with the index.)

I would start with something like:

to make the point that we always start with languages and semantics and that data/texts are recorded in systems using languages and semantics. Our data structures are not neutral bystanders. They determine as much of what we can find as they determine the how we will interpret it.

Try running a modern genealogy for someone and when you find an arrest record for being a war criminal or child molester of a close relative, see if the family wants that included. Suddenly that will be more important that other prizes or honors they have won. Still the same person but the label on the data, arrest record, makes us suspect the worse. Had it read: “False Arrests, a record of false charges during the regime of XXX,” we are likely to react differently.

I am going to use Baeza-Yates and Ribeiro-Neto as one of the required texts in the next topic maps class. So we can cover some of the mining techniques that will help populate topic maps.

But I will also cover the issue of languages/semantics as well as data/texts (in how they are stored and the semantics of the same).

Does anyone have a favorite single volume on languages/semantics. I would lean towards Doing What Comes Naturally by Stanley Fish but I am sure there are other volumes equally as good.

The data/text formats an their semantics is likely to be harder to come by. I don’t know of anything off hand that is focused on that in monograph length treatment. Suggestions?

PS: I know I got the image wrong but I am about to post. I will post a slightly amended image tomorrow when I have thought about it some more.

Don’t let that deter you from posting criticisms of the current image in the meantime.

November 28, 2011

Template-Based Information Extraction without the Templates

Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky.

Abstract:

Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

Can you say association?

Definitely points towards a pipeline approach to topic map authoring. To abuse the term, perhaps a “dashboard” that allows selection of data sources followed by the construction of workflows with preliminary analysis being displayed at “breakpoints” in the processing. No particular reason why stages have to be wired together other than tradition.

Just looking a little bit into the future, imagine that some entities weren’t being recognized at a high enough rate. So you shift that part of the data to several thousand human entity processors and take the average of their results, higher than what you were getting and feed that back into the system. Could have knowledge workers who work full time but shift from job to job performing tasks too difficult to program effectively.

November 18, 2011

Information retrieval model based on graph comparison

Filed under: Graphs,Information Retrieval — Patrick Durusau @ 9:38 pm

Information retrieval model based on graph comparison (pdf) Quoc-Dinh Truong, Taoufiq Dkaki, Josiane Mothe, Pierre-Jean Charrel.

We propose a new method for Information Retrieval (IR) based on graph vertices comparison. The main goal of this method is to enhance the core IR-process of finding relevant documents in a collection of documents according to a user’s needs. The method we propose is based on graph comparison and involves recursive computation of similarity. In the framework this approach, documents, queries and indexing terms are viewed as vertices of a bipartite graph where edges go from a document or a query – first node type- to an indexing term – second node type-. Edges reflect the link that exists between documents or queries on the one hand and indexing terms on the other hand. In our model, graph edge settings reflect the tf-ifd paradigm. The proposed similarity measure instantiates and extends this principle, stipulating that the resemblance of two items or objects can be computed using the similarities of the items to which they are related. Our method also takes into account the concept of similarity propagation over graph edges.

Experiments conducted using four small sized IR test collections (TREC 2004 Novelty Track, CISI, Cranfield & Medline) demonstrate the effectiveness of our approach and its feasibility as long as the graph size does not exceed a few thousand nodes. The experiment’s results show that our method outperforms the vector-based model. Our method actually highly outperforms the vector-based cosine model, sometimes by more than doubling the precision, up to the top sixty returned documents. The computational complexity issue is resituated in the context of MAC-FAC approaches – many are called but few are chosen. More precisely, we suggest that our method can be successfully used as a FAC stage combined with a fast and computationally cheap method used as a MAC stage.

Very interesting article. Perhaps more so because searches of DBLP and Citeseer show no other publications by this author. A singularity that appears in 2008. I haven’t taken the time to look more deeply but commend the paper to your attention.

If you have pointers to later (earlier?) work by the same author, email or comments would be appreciated.

November 14, 2011

Stephen Robertson on Why Recall Matters

Filed under: Information Retrieval,Precision,Recall — Patrick Durusau @ 7:14 pm

Stephen Robertson on Why Recall Matters November 14th, 2011 by Daniel Tunkelang.

Daniel has the slides and an extensive summary of the presentation. Just to give you an taste of what awaits at Daniel’s post:

Stephen started by reminding us of ancient times (i.e., before the web), when at least some IR researchers thought in terms of set retrieval rather than ranked retrieval. He reminded us of the precision and recall “devices” that he’d described in his Salton Award Lecture — an idea he attributed to the late Cranfield pioneer Cyril Cleverdon. He noted that, while set retrieval uses distinct precision and recall devices, ranking conflates both into decision of where to truncate a ranked result list. He also pointed out an interesting asymmetry in the conventional notion of precision-recall tradeoff: while returning more results can only increase recall, there is no certainly that the additional results will decrease precision. Rather, this decrease is a hypothesis that we associate with systems designed to implement the probability ranking principle, returning results in decreasing order of probability of relevance.

Interested? There’s more where that came from, see like to Daniel’s post above.

November 12, 2011

HCIR 2011 keynote

Filed under: HCIR,Information Retrieval — Patrick Durusau @ 8:40 pm

HCIR 2011 keynote by Gene Golovchinsky

From the post:

HCIR 2011 took place almost three weeks ago, but I am just getting caught up after a week at CIKM 2011 and an actual almost-no-internet-access vacation. I wanted to start off my reflections on HCIR with a summary of Gary Marchionini‘s keynote, titled “HCIR: Now the Tricky Part.” Gary coined the term “HCIR” and has been a persuasive advocate of the concepts represented by the term. The talk used three case studies of HCIR projects as a lens to focus the audience’s attention on one of the main challenges of HCIR: how to evaluate the systems we build.

The projects reviewed are themselves worthy of separate treatments, at length.

Gene’s summary makes one wish for video of the keynote. Perhaps I have overlooked it? If so, please post the link.

November 4, 2011

A Taxonomy of Enterprise Search and Discovery

A Taxonomy of Enterprise Search and Discovery by Tony Russell-Rose.

Abstract:

Classic IR (information retrieval) is predicated on the notion of users searching for information in order to satisfy a particular “information need”. However, it is now accepted that much of what we recognize as search behaviour is often not informational per se. Broder (2002) has shown that the need underlying a given web search could in fact be navigational (e.g. to find a particular site) or transactional (e.g. through online shopping, social media, etc.). Similarly, Rose & Levinson (2004) have identified the consumption of online resources as a further common category of search behaviour.

In this paper, we extend this work to the enterprise context, examining the needs and behaviours of individuals across a range of search and discovery scenarios within various types of enterprise. We present an initial taxonomy of “discovery modes”, and discuss some initial implications for the design of more effective search and discovery platforms and tools.

If you are flogging software/interfaces for search/discovery in an enterprise context, you really need to read this paper. In part because of their initial findings but in part to establish the legitimacy of evaluating how users search before designing an interface for them to search with. They may not be able to articulate all their search behaviors which means you will have to do some observation to establish what may be the elements that make a difference in a successful interface and one that is less so. (No one wants to be the next Virtual Case Management project at the FBI.)

Read the various types of searching as rough guides to what you may find true for your users. When in doubt, trust your observations of and feedback from your users. Otherwise you will have an interface that fits an abstract description in a paper but not your users. I leave it for you to judge which one results in repeat business.

Don’t take that as a criticism of the paper, I think it is one of the best I have read lately. My concern is that the evaluation of user needs/behaviour be an ongoing process and not prematurely fixed or obscured by categories or typologies of how users “ought” to act.

The paper is also available in PDF format.

Information Literacy 2.0

Filed under: Information Retrieval,Research Methods — Patrick Durusau @ 6:08 pm

Information Literacy 2.0 by Meredith Farkas.

From the post:

Critical inquiry in the age of social media

Ideas about information literacy have always adapted to changes in the information environment. The birth of the web made it necessary for librarians to shift more towards teaching search strategies and evaluation of sources. The tool-focused “bibliographic instruction” approach was later replaced by the skill-focused “information literacy” approach. Now, with the growth of Web 2.0 technologies, we need to start shifting towards providing instruction that will enable our patrons to be successful information seekers in the Web 2.0 environment, where the process of evaluation is quite a bit more nuanced.

Critical inquiry skills are among the most important in a world in which the half-life of information is rapidly shrinking. These days, what you know is almost less important than what you can find out. And finding out today requires a set of skills that are very different from what most libraries focus on. In addition to academic sources, a huge wealth of content is being produced by people every day in knowledgebases like Wikipedia, review sites like Trip Advisor, and in blogs. Some of this content is legitimate and valuable—but some of it isn’t.

While I agree with Meredith that evaluation of information is a critical skill, I am less convinced that it is a new one. Research, even pre-Internet, was never about simply finding resources for the purpose of citation. There always was an evaluative aspect with regard to sources.

I was able to take a doctoral seminar in research methods for Old Testament students that taught critical evaluation of resources. I don’t remember the text off hand but we were reading a transcription of a cuneiform text which had a suggested “emendation” (think added characters) for a broken place in the text. The professor asked whether we should accept the “emendation” or not and on what basis we would make that judgement. The article was by a known scholar so of course we argued about the “emendation” but never asked one critical question: What about the original text? The source the scholar was relying upon.

The theology library had a publication with an image of the text that we reviewed for the next class. Even though it was only a photograph, it was clear that you might get one, maybe two characters in the broken space of the text, but there was no way you would have the five or six required by the “emendation.”

We were told to never rely upon quotations, transcriptions of texts, etc., unless there was simply no way to verify the source. Not that many of us do that in practice but that is the ideal. There is even less excuse for relying on quotations and other secondary materials now that so many primary materials are easy to access online and more are coming online every day.

I think the lesson of information literacy 2.0 should be critical evaluation of information but as part of that evaluation to seek out the sources of the information. You would be surprised how many times what an authors said is not what they are quoted as saying, when read in the context of the original.

October 27, 2011

HCIR 2011

Filed under: Conferences,Information Retrieval — Patrick Durusau @ 4:46 pm

HCIR 2011 Papers

From the homepage:

The Fifth Workshop on Human-Computer Interaction and Information Retrieval took place all day on Thursday, October 20th, 2011, at Google’s main campus in Mountain View, California. There was a reception on Wednesday evening before the workshop, which attracted about a hundred participants.

By my count fourteen (14) papers and twenty-eight (28) posters.

Quite a gold mine of material and I look forward to a long weekend with them!

Enjoy!

PS: Interesting that papers from prior conferences only start to be available starting in 2010.

October 18, 2011

The Second International Workshop on Diversity in Document Retrieval (DDR-2012)

Filed under: Conferences,Information Retrieval,Semantic Diversity — Patrick Durusau @ 2:40 pm

The Second International Workshop on Diversity in Document Retrieval (DDR-2012)

Dates:

When Feb 12, 2012 – Feb 12, 2012
Where Seattle WA, USA
Submission Deadline Dec 5, 2011
Notification Due Jan 10, 2012
Final Version Due Jan 17, 2012

From the webpage:

In conjunction with WSDM 2012 – the 5th ACM International Conference on Web Search and Data Mining

Overview
=======
When an ambiguous query is received, a sensible approach is for the information retrieval (IR) system to diversify the results retrieved for this query, in the hope that at least one of the interpretations of the query intent will satisfy the user. Diversity is an increasingly important topic, of interest to both academic researchers (such as participants in the TREC Web and Blog track diversity tasks), as well as to search engines professionals. In this workshop, we solicit submissions both on approaches and models for diversity, the evaluation of diverse search results, and on applications and presentation of diverse search results.

Topics:

  • Modelling Diversity:
    • Implicit diversification approaches
    • Explicit diversification approaches
    • Query log mining for diversity
    • Learning-to-rank for diversification
    • Clustering of results for diversification
    • Query intent understanding
    • Query type classification
  • Modelling Risk:
    • Probability ranking principle
    • Risk Minimization frameworks and role diversity
  • Evaluation:
    • Test collections for diversity
    • Evaluating of diverse search results
    • Measuring the ambiguity of queries
    • Measuring query aspects importance
  • Applications:
    • Product & review diversification
    • Opinion and sentiment diversification
    • Diversifying Web crawling policy
    • Graph analysis for diversity
    • Summarisation
    • Legal precedents & patents
    • Diverse recommender systems
    • Diversifying in real-time & news search
    • Diversification in other verticals (image/video search etc.)
    • Presentation of diverse search results

While typing this up, I remembered the “little search engine that could” post (Going Head to Head with Google (and winning)). Are we really condemned to have to manage unforeseeable complexity or is that a poor design choice we made for search engines?

After all, I am not really interested in the entire WWW. At least for this blog I am interested in probably less than 1/10 of 1% of the web (or less). So if I had a search engine for all the CS/Library/Informatics publications, blogs, subject domains relevant to data/information, I would pretty much be set. A big semantic field and one that is changing, but not anything like search everything that is connected (or not, for the DeepWeb) to the WWW.

I don’t have an answer for that but I think it is an issue that may enable management of semantic diversity. That is we get to declare the edge of the map. Yes, there are other things beyond the edge but we aren’t going to include them in this particular map.

October 17, 2011

CENDI: Federal STI Managers Group

Filed under: Government Data,Information Retrieval,Librarian/Expert Searchers,Library — Patrick Durusau @ 6:44 pm

CENDI: Federal STI Managers Group

From the webpage:

Welcome to the CENDI web site

CENDI’s vision is to provide its member federal STI agencies a cooperative enterprise where capabilities are shared and challenges are faced together so that the sum of accomplishments is greater than each individual agency can achieve on its own.

CENDI’s mission is to help improve the productivity of federal science- and technology-based programs through effective scientific, technical, and related information-support systems. In fulfilling its mission, CENDI agencies play an important role in addressing science- and technology-based national priorities and strengthening U.S. competitiveness.

CENDI is an interagency working group of senior scientific and technical information (STI) managers from 14 U.S. federal agencies:

  • Defense Technical Information Center (Department of Defense)
  • Office of Research and Development & Office of Environmental Information (Environmental Protection Agency)
  • Government Printing Office
  • Library of Congress
  • NASA Scientific and Technical Information Program
  • National Agricultural Library (Department of Agriculture)
  • National Archives and Records Administration
  • National Library of Education (Department of Education)
  • National Library of Medicine (Department of Health and Human Services)
  • National Science Foundation
  • National Technical Information Service (Department of Commerce)
  • National Transportation Library (Department of Transportation)
  • Office of Scientific and Technical Information (Department of Energy)
  • USGS/Core Science Systems (Department of Interior)

These programs represent over 97% of the federal research and development budget.

The CENDI web site is hosted by the Defense Technical Information Center (DTIC), and is maintained by the CENDI secretariat. (emphasis added)

Yeah, I thought the 97% figure would catch your attention. 😉 Not sure how it compares with spending on IT and information systems in law enforcement and the spook agencies.

Topic Maps Class Project: Select one of the fourteen members and prepare a report for the class on their primary web interface. What did you like/dislike about the interface? How would you integrate the information you found there with your “home” library site (for students already employed elsewhere) or with the GSLIS site?

BTW, I think you will find that these agencies and their personnel have bee thinking deeply about information integration for decades. It is an extremely difficult problem that has no fixed or easy solution.

September 21, 2011

CITRIS – Center for Information Technology Research in the Interest of Society

Filed under: Biomedical,Environment,Funding,Health care,Information Retrieval — Patrick Durusau @ 7:08 pm

CITRIS – Center for Information Technology Research in the Interest of Society

The mission statement:

The Center for Information Technology Research in the Interest of Society (CITRIS) creates information technology solutions for many of our most pressing social, environmental, and health care problems.

CITRIS was created to “shorten the pipeline” between world-class laboratory research and the creation of start-ups, larger companies, and whole industries. CITRIS facilitates partnerships and collaborations among more than 300 faculty members and thousands of students from numerous departments at four University of California campuses (Berkeley, Davis, Merced, and Santa Cruz) with industrial researchers from over 60 corporations. Together the groups are thinking about information technology in ways its never been thought of before.

CITRIS works to find solutions to many of the concerns that face all of us today, from monitoring the environment and finding viable, sustainable energy alternatives to simplifying health care delivery and developing secure systems for electronic medical records and remote diagnosis, all of which will ultimately boost economic productivity. CITRIS represents a bold and exciting vision that leverages one of the top university systems in the world with highly successful corporate partners and government resources.

I mentioned CITRIS as an aside (News: Summarization and Visualization) yesterday but then decided it needed more attention.

Its grants are limited the four University of California campuses mentioned above. Shades of EU funding restrictions. Location has a hand in the selection process.

Still, the projects funded by CITRIS could likely profit from the use of topic maps and as they say, a rising tide lifts all boats.

September 16, 2011

Information Bridge

Filed under: Information Retrieval,Library — Patrick Durusau @ 6:41 pm

Information Bridge

From the webpage:

The Information Bridge: DOE Scientific and Technical Information provides free public access to over 282,000 full-text documents and bibliographic citations of Department of Energy (DOE) research report literature. Documents are primarily from 1991 forward and were produced by DOE, the DOE contractor community, and/or DOE grantees. Legacy documents are added as they become available in electronic format.

The Information Bridge contains documents and citations in physics, chemistry, materials, biology, environmental sciences, energy technologies, engineering, computer and information science, renewable energy, and other topics of interest related to DOE’s mission.

Another important source of US government funded research on information retrieval.

September 15, 2011

DTIC Online

Filed under: Information Retrieval,Library — Patrick Durusau @ 7:50 pm

DTIC Online

From the webpage:

The Defense Technical Information Center (DTIC®) serves the DoD community as the largest central resource for DoD and government-funded scientific, technical, engineering, and business related information available today .

For more than 65 years DTIC has provided the warfighter and researchers, scientists, engineers, laboratories, and universities timely access to over 2 million publications covering over 250 subject areas. Our mission supports the nation’s warfighter.
….

The United States government and I suspect other national governments has sponsored decades worth of research on text processing, mining and evaluation. This is one of the major interfaces to US based literature. The Literature-Related Discovery (LRD) material originated from this source.

You will find things such as: “Research in Information Retrieval – Final Report – An investigation of the techniques and concepts of information retrieval,” dated 31 July 1964 as well as current reports.

A real treasure trove of historical and current material on information retrieval. The historical material will help you recognize when you are re-solving a well known problem. And sometimes help you avoid repeating old mistakes.

September 14, 2011

Literature-Related Discovery (LRD)

Filed under: Information Retrieval,Literature-based Discovery — Patrick Durusau @ 7:02 pm

Literature-Related Discovery (LRD) by Kostoff, Ronald N. ; Block, Joel A. ; Solka, Jeffrey L. ; Briggs, Michael B. ; Rushenberg, Robert L. ; Stump, Jesse A. ; Johnson, Dustin ; Lyons, Terence J. ; Wyatt, Jeffrey R.

Short Abstract:

Discovery in science is the generation of novel, interesting, plausible, and intelligible knowledge about the objects of study. Literature-related discovery (LRD) is the linking of two or more literature concepts that have heretofore not been linked (i.e., disjoint), in order to produce novel interesting, plausible, and intelligible knowledge (i.e., potential discovery).

From the longer abstract in the monograph:

LRD offers the promise of large amounts of potential discovery, for the following reasons:

  • the burgeoning technical literature contains a very large pool of technical concepts in myriad technical areas;
  • researchers spend full time trying to cover the literature in their own research fields and are relatively unfamiliar with research in other especially disparate fields of research;
  • the large number of technical concepts (and disparate technical concepts) means that many combinations of especially disparate technical concepts exist
  • by the laws of probability, some of these combinations will produce novel, interesting, plausible, and intelligible knowledge about the objects of study

This monograph presents the LRD methodology and voluminous discovery results from five problem areas: four medical (treatments for Parkinson’s Disease (PD), Multiple Sclerosis (MS), Raynaud’s Phenomenon (RP), and Cataracts) and one non-medical (Water Purification (WP)). In particular, the ODS aspect of LRD is addressed, rather than the CDS aspect. In the presentation of potential discovery, a ‘vetting’ process is used that insures both requirements for ODS LBD are met: concepts are linked that have not been linked previously, and novel, interesting, plausible, and intelligible knowledge is produced.

The potential discoveries for the PD, MS, Cataracts, and WP problems are the first we have seen reported by this ODS LBD approach, and the numbers of potential discoveries for the ODS LBD benchmark RP problem are almost two orders of magnitude greater than those reported in the open literature by any other ODS LBD researcher who has addressed this benchmark RP problem. The WP problem is the first non-medical technical topic to have been addressed successfully by ODS LBD.

(ODS = open discovery system)

If you are looking for validation with supporting data for the literature-related discovery method, seek no further. The text plus annexes runs 884 pages.

This is a technique that fits quite well with topic maps.

PS: Yes, I know, this monograph says “literature-related discovery” (5.8 million “hits” in a popular search engine) versus “literature-based discovery” (6.3 million “hits” in the same search engine), another name for the same technique. Sigh, even semantic integration is afflicted with semantic integration woes.

September 11, 2011

New Challenges in Distributed Information Filtering and Retrieval

New Challenges in Distributed Information Filtering and Retrieval

Proceedings of the 5th International Workshop on New Challenges in Distributed Information Filtering and Retrieval
Palermo, Italy, September 17, 2011.

Edited by:

Cristian Lai – CRS4, Loc. Piscina Manna, Building 1 – 09010 Pula (CA), Italy

Giovanni Semeraro – Dept. of Computer Science, University of Bari, Aldo Moro, Via E. Orabona, 4, 70125 Bari, Italy

Eloisa Vargiu – Dept. of Electrical and Electronic Engineering, University of Cagliari, Piazza d’Armi, 09123 Cagliari, Italy

Table of Contents:

  1. Experimenting Text Summarization on Multimodal Aggregation
    Giuliano Armano, Alessandro Giuliani, Alberto Messina, Maurizio Montagnuolo, Eloisa Vargiu
  2. From Tags to Emotions: Ontology-driven Sentimental Analysis in the Social Semantic Web
    Matteo Baldoni, Cristina Baroglio, Viviana Patti, Paolo Rena
  3. A Multi-Agent Decision Support System for Dynamic Supply Chain Organization
    Luca Greco, Liliana Lo Presti, Agnese Augello, Giuseppe Lo Re, Marco La Cascia, Salvatore Gaglio
  4. A Formalism for Temporal Annotation and Reasoning of Complex Events in Natural Language
    Francesco Mele, Antonio Sorgente
  5. Interaction Mining: the new Frontier of Call Center Analytics
    Vincenzo Pallotta, Rodolfo Delmonte, Lammert Vrieling, David Walker
  6. Context-Aware Recommender Systems: A Comparison Of Three Approaches
    Umberto Panniello, Michele Gorgoglione
  7. A Multi-Agent System for Information Semantic Sharing
    Agostino Poggi, Michele Tomaiuolo
  8. Temporal characterization of the requests to Wikipedia
    Antonio J. Reinoso, Jesus M. Gonzalez-Barahona, Rocio Muñoz-Mansilla, Israel Herraiz
  9. From Logical Forms to SPARQL Query with GETARUN
    Rocco Tripodi, Rodolfo Delmonte
  10. ImageHunter: a Novel Tool for Relevance Feedback in Content Based Image Retrieval
    Roberto Tronci, Gabriele Murgia, Maurizio Pili, Luca Piras, Giorgio Giacinto

September 6, 2011

First Look – Oracle Data Mining Update

Filed under: Data Mining,Database,Information Retrieval,SQL — Patrick Durusau @ 7:18 pm

First Look – Oracle Data Mining Update by James Taylor.

From the post:

I got an update from Oracle on Oracle Data Mining (ODM) recently. ODM is an in-database data mining and predictive analytics engine that allows you to build and use advanced predictive analytic models on data that can be accessed through your Oracle data infrastructure. I blogged about ODM extensively last year in this First Look – Oracle Data Mining and since then they have released ODM 11.2.

The fundamental architecture has not changed, of course. ODM remains a “database-out” solution surfaced through SQL and PL-SQL APIs and executing in the database. It has the 12 algorithms and 50+ statistical functions I discussed before and model building and scoring are both done in-database. Oracle Text functions are integrated to allow text mining algorithms to take advantage of them. Additionally, because ODM mines star schema data it can handle an unlimited number of input attributes, transactional data and unstructured data such as CLOBs, tables or views.

This release takes the preview GUI I discussed last time and officially releases it. This new GUI is an extension to SQL Developer 3.0 (which is available for free and downloaded by millions of SQL/database people). The “Classic” interface (wizard-based access to the APIs) is still available but the new interface is much more in line with the state of the art as far as analytic tools go.

BTW, the correct link to: First Look – Oracle Data Mining. (Taylor’s post last year on Oracle Data Mining.)

For all the buzz about NoSQL, topic map mavens should be aware of the near universal footprint of SQL and prepare accordingly.

September 1, 2011

Spatio Temporal data Integration and Retrieval

Filed under: Conferences,Data Integration,Information Retrieval,Spatial Index — Patrick Durusau @ 6:06 pm

STIR 2012 : ICDE 2012 Workshop on Spatio Temporal data Integration and Retrieval

Dates:

When Apr 1, 2012 – Apr 1, 2012
Where Washington DC, USA
Submission Deadline Oct 21, 2011

From the notice:

International Workshop on Spatio Temporal data Integration and Retrieval (STIR2012) in conjunction with ICDE 2012

April 1, 2012, Washington DC, USA

http://research.ihost.com/stir12/index.html

As the world?s population increases and it puts increasing demands on the planet?s limited resources due to shifting life-styles, we not only need to monitor how we consume resources but also optimize resource usage. Some examples of the planet?s limited resources are water, energy, land, food and air. Today, significant challenges exist for reducing usage of these resources, while maintaining quality of life. The challenges range from understanding regionally varied impacts of global environmental change, through tracking diffusion of avian flu and responding to natural disasters, to adapting business practice to dynamically changing resources, markets and geopolitical situations. For these and many other challenges reference to location – and time – is the glue that connects disparate data sources. Furthermore, most of the systems and solutions that will be built to solve the above challenges are going to be heavily depend on structured data (generated by sensors and sensor based applications) which will be streaming in real-time, come in large volumes and will have spatial and temporal aspects to them.

This workshop is focused on making the research in information integration and retrieval more relevant to the challenges in systems with significant spatial and temporal components.

Sounds like they are playing our song!

August 17, 2011

Recent Advances in Literature Based Discovery

Recent Advances in Literature Based Discovery

Abstract:

Literature Based Discovery (LBD) is a process that searches for hidden and important connections among information embedded in published literature. Employing techniques from Information Retrieval and Natural Language Processing, LBD has potential for widespread application yet is currently implemented primarily in the medical domain. This article examines several published LBD systems, comparing their descriptions of domain and input data, techniques to locate important concepts from text, models of discovery, experimental results, visualizations, and evaluation of the results. Since there is no comprehensive “gold standard, ” or consistent formal evaluation methodology for LBD systems, the development and usage of effective metrics for such systems is also discussed, providing several options. Also, since LBD is currently often time-intensive, requiring human input at one or more points, a fully-automated system will enhance the efficiency of the process. Therefore, this article considers methods for automated systems based on data mining.

Not “recent” now because the paper dates from 2006 but it is a good overview of Literature Based Discovery (LBD) at the time.

July 27, 2011

Open Source Search Engines (comparison)

Filed under: Information Retrieval,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 7:02 pm

Open Source Search Engines (comparison)

A comparison of ten (10) open source search engines.

Appears as an appendix to Modern Information Retrieval, second edition.

I probably don’t need yet another IR book.

But the first edition was well written, the second edition website includes teaching slides for all chapters, a nice set of pointers to additional resources, problems and solutions is “under construction” as of 27 July 2011, all of which are things I like to encourage in authors.

OK, I talked myself into it, I am ordering a copy today. 😉

More comments to follow.

May 20, 2011

SIREn: Efficient semi-structured Information Retrieval for Lucene

Filed under: Information Retrieval,Lucene,RDF — Patrick Durusau @ 4:06 pm

SIREn: Efficient semi-structured Information Retrieval for Lucene

From the announcement:

Efficient, large scale handling of semi-structured data (including RDF) is increasingly an important issue to many web and enterprise information reuse scenarios.

Querying graph structured data (RDF) is commonly achieved using specific solutions, called triplestores, typically based on DBMS backends. In Sindice we however needed something much more scalable than DBMS and with the desirable features of the typical Web Search engines: top-k query processing, real time updates, full text search, distributed indexes over shards, etc.

While Lucene has long offered these capabilities, its native capabilities are not intended for large semi-structured document collections (or documents with very different schemas). For this reason we developed SIREn – Semantic Information Retrieval Engine – a Lucene plugin to overcome these shortcomings and efficiently index and query RDF, as well as any textual document with an arbitrary amount of metadata fields.

Given its general applicability, we are delighted to release SIREn under the Apache 2.0 open source license. We hope businesses will find SIREn useful in implementing solutions upon the Web of Data.

You can start by looking at the features, review the performance benchmarks, learn more by reading the short tutorial and then download and try SIREn by yourself.

This looks very cool!

It’s tuple processing capabilities in particular!

May 12, 2011

KFTF – Keeping Found Things Found™

Filed under: Information Retrieval,Information Reuse — Patrick Durusau @ 7:58 am

KFTF – Keeping Found Things Found™

From the website:

Much of our lives is spent in the finding of things. Find a house or a car that’s just right for you. Find your dream job. Or your dream mate. But, once found, what then?

As with other things, so it is with our information. Finding is just the first step. How do we keep this information so that it’s there later when we need it? How do we organize it in ways that make sense for us in the lives we want to lead? Information found does us little good if we misplace it or forget to use it. And just as we must maintain a house or a car, we need to maintain our information – backing it up, archiving or deleting old information, updating information that is no longer accurate. In our digital world, advances in technologies of search and storage have far outpaced balancing advances in tools and techniques that help us to manage and make sense of our information. This project combines fieldwork with selective prototyping in an effort to understand what is needed for us to “keep found things found.”

There is also software, called Plantz™, that has been open sourced by the project.

Take control of the information in your life through one consolidated interface. Plan by typing your thoughts freehand. Link your thoughts to files, Web pages, and email messages. Organize everything into a single, integrated document that helps you manage all the projects you want to get done. Planz™ is an overlay to your file system so your information stays under your control.

Is anyone familiar with this software? Thanks!

April 30, 2011

Bridging the Gulf:…

Filed under: Conferences,Digital Library,Information Retrieval — Patrick Durusau @ 10:16 am

Bridging the Gulf: Communication and Information in Society, Technology, and Work

October 9-13, 2011, New Orleans, Louisiana

From the website:

The ASIST Annual Meeting is the main venue for disseminating research centered on advances in the information sciences and related applications of information technology.

ASIST 2011 builds on the success of the 2010 conference structure and will have the integrated program that is an ASIST strength. This will be achieved using the six reviewing tracks pioneered in 2010, each with its own committee of respected reviewers to ensure that the conference meets your high expectations for standards and quality. These reviewers, experts in their fields, will assist with a rigorous peer-review process.

Important Dates:

  1. Papers, Panels, Workshops & Tutorials
    • Deadline for submissions: May 31
    • Notification to authors: June 28
    • Final copy: July 15
  2. Posters, Demos & Videos:
    • Deadline for submissions: July 1
    • Notification to authors: July 20
    • Final copy: July 27

One of the premier technical conferences for librarians and information professionals in the United States.

The track listings are:

  • Track 1 – Information Behaviour
  • Track 2 – Knowledge Organization
  • Track 3 – Interactive Information & Design
  • Track 4 – Information and Knowledge Management
  • Track 5 – Information Use
  • Track 6 – Economic, Social, and Political Issues

A number of opportunities for topic map based presentations.

The conference being located in New Orleans is yet another reason to attend! The food, music, and street life has to be experienced to be believed. No description would be adequate.

April 11, 2011

A Data Parallel toolkit for Information Retrieval

Filed under: Data Mining,Information Retrieval,Search Algorithms,Searching — Patrick Durusau @ 5:53 am

A Data Parallel toolkit for Information Retrieval

From the website:

Many modern information retrieval data analyses need to operate on web-scale data collections. These collections are sufficiently large as to make single-computer implementations impractical, apparently necessitating custom distributed implementations.

Instead, we have implemented a collection of Information Retrieval analyses atop DryadLINQ, a research LINQ provider layer over Dryad, a reliable and scalable computational middleware. Our implementations are relatively simple data parallel adaptations of traditional algorithms, and, due entirely to the scalability of Dryad and DryadLINQ, scale up to very large data sets. The current version of the toolkit, available for download below, has been successfully tested against the ClueWeb corpus.

Are you using large data sets in the construction of your topic maps?

Where large is taken to mean data sets in the range of one billion documents. (http://boston.lti.cs.cmu.edu/Data/clueweb09/)

The authors of this work are attempting to extend access to large data sets to a larger audience.

Did they succeed?

Is their work useful for smaller data sets?

What tools would you add to assist more specifically with topic map construction?

« Newer PostsOlder Posts »

Powered by WordPress