Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 24, 2010

Indexing Nature: Carl Linnaeus (1707-1778) and His Fact-Gathering Strategies

Filed under: Indexing,Information Retrieval,Interface Research/Design,Ontology — Patrick Durusau @ 9:52 am

Indexing Nature: Carl Linnaeus (1707-1778) and His Fact-Gathering Strategies Authors: Staffan Müller-Wille & Sara Scharf (Working Papers on The Nature of Evidence: How Well Do ‘Facts’ Travel? No. 36/08)

Interesting article that traces the strategies used by Linnaeus when confronted with the “first bio-information crisis” as the authors term it.

Questions:

  1. In what ways do ontologies resemble the bound library catalogs of the early 18th century?
  2. Do computers make ontologies any less like those bound library catalogs?
  3. Short report (3-5 pages, with citations) on transition of libraries from bound catalogs to index cards.
  4. Linnaeus’s colleagues weren’t idle. What other strategies, successful or otherwise, were in use? (project)

October 23, 2010

CASPAR (Cultural, Artistic, and Scientific Knowledge for Preservation, Access and Retrieval)

Filed under: Cataloging,Digital Library,Information Retrieval,Preservation — Patrick Durusau @ 7:58 am

CASPAR (Cultural, Artistic, and Scientific Knowledge for Preservation, Access and Retrieval).

From the website:

CASPAR methodological and technological solution:

  • is compliant to the OAIS Reference Model – the main standard of reference in digital preservation
  • is technology-neutral: the preservation environment could be implemented using any kind of emerging technology
  • adopts a distributed, asynchronous, loosely coupled architecture and each key component is self-contained and portable: it may be deployed without dependencies on different platform and framework
  • is domain independent: it could be applied with low additional effort to multiple domains/contexts.
  • preserves knowledge and intelligibility, not just the “bits”
  • guarantees the integrity and identity of the information preserved as well as the protection of digital rights

FYI: OAIS Reference Model

As a librarian, you will be confronted with claims similar to these in vendor literature, grant applications and other marketing materials.

Questions:

  1. Pick one of these claims. What documentation/software produced by the project would you review to evaluate the claim you have chosen?
  2. What other materials do you think would be relevant to your review?
  3. Perform the actual review (10 – 15 pages, with citations, project)

October 21, 2010

Research: What is the Interaction Cost in Information Visualization?

Research: What is the Interaction Cost in Information Visualization? by Enrico Bertini, came to us via Sam Hunting.

A summary of Heidi Lam’s A Framework of Interaction Costs in Information Visualization but both will repay the time spent reading/studying them.

However intuitive it may seem to its designers, no “semantic” interface is any better than it is perceived to be by its users.

Questions:

  1. After reading Lam’s article, evaluate two interfaces, one familiar to you and one you encounter as a first-time user.
  2. Using Lam’s framework, how do you evaluate the interfaces?
  3. What aspects of those interfaces would you most like to test with users?
  4. Design a test for two aspects of one of your interfaces. (project*)
  5. Care to update Lam’s listing of papers listing interactivity issues? (project)

* Warning: Test design is partially an art, partially a science and partially stumbling around in semantic darkness. Just so you are aware that done properly, this project will require extra work.

October 19, 2010

The effect of audience design on labeling, organizing, and finding shared files (unexpected result – see below)

The effect of audience design on labeling, organizing, and finding shared files Authors: Emilee Rader Keywords: audience design, common ground, file labeling and organizing, group information management

Abstract:

In an online experiment, I apply theory from psychology and communications to find out whether group information management tasks are governed by the same communication processes as conversation. This paper describes results that replicate previous research, and expand our knowledge about audience design and packaging for future reuse when communication is mediated by a co-constructed artifact like a file-and-folder hierarchy. Results indicate that it is easier for information consumers to search for files in hierarchies created by information producers who imagine their intended audience to be someone similar to them, independent of whether the producer and consumer actually share common ground. This research helps us better understand packaging choices made by information producers, and the direct implications of those choices for other users of group information systems.

Examples from the paper:

  • A scientist needs to locate procedures and results from an experiment conducted by another researcher in his lab.
  • A student learning the open-source, command-line statistical computing environment R needs to find out how to calculate the mode of her dataset.
  • A new member of a design team needs to review requirements analysis activities that took place before he joined the team.
  • An intelligence analyst needs to consult information collected by other agencies to assess a potential threat.

Do any of those sound familiar?

Unexpected result:

In general, Consumers performed best (fewest clicks to find the target file) when the Producer created a hierarchy for an Imagined Audience from the same community, regardless of the community the Consumer community. Consumers had the most difficulty when searching in hierarchies created by a Producer for a dissimilar Imagined Audience.

In other words, imagining an audience is a bad strategy. Create a hierarchy that works for you. (And with a topic map you could let others create hierarchies that work for them.)

(Apologies for the length of this post but unexpected interface results merit the space.)

October 18, 2010

The X Factor of Information Systems

Filed under: Information Retrieval,Natural Language Processing,Semantics — Patrick Durusau @ 5:02 am

David Segal’s “The X Factor of Economics,” NYT, Sunday, October 17, 2010, Week in Review, concludes that standard economic models don’t account for one critical factor.

Economics can be dressed up in mathematical garb, with after the fact precision, but the X factor causes it to lack before the fact precision. Precision? Seems like an inadequate term for a profession that can’t agree on what has happened, is in fact happening, much less what is about to happen.

But in any event, the X factor? That would be us, people.

People who gleefully buy, save, work, rest and generally live our lives without any regard for theories of economic behavior.

The same people who live without any regard for theories of semantics.

People are the X factor in information systems.

Just a caution to take into account when evaluating information, metadata or semantic systems.

October 13, 2010

Exploiting knowledge-in-the-head and knowledge-in-the-social-web: effects of domain expertise on exploratory search in individual and social search environments

Exploiting knowledge-in-the-head and knowledge-in-the-social-web: effects of domain expertise on exploratory search in individual and social search environments Authors: Ruogu Kang, Wai-Tat Fu, Thomas George Kannampallil Keywords: domain expertise, exploratory search, search behavior

Abstract:

Our study compared how experts and novices performed exploratory search using a traditional search engine and a social tagging system. As expected, results showed that social tagging systems could facilitate exploratory search for both experts and novices. We, however, also found that experts were better at interpreting the social tags and generating search keywords, which made them better at finding information in both interfaces. Specifically, experts found more general information than novices by better interpretation of social tags in the tagging system; and experts also found more domain-specific information by generating more of their own keywords. We found a dynamic interaction between knowledge-in-the-head and knowledge-in-the-social-web that although information seekers are more and more reliant on information from the social Web, domain expertise is still important in guiding them to find and evaluate the information. Implications on the design of social search systems that facilitate exploratory search are also discussed.

Every librarian should have the first page of this article posted to their office door, every library school on the local bulletin board.

Think about it. Expert searchers (read librarians) find better information than novices and can serve as guides to better information.

More research is needed on how to bridge that gap in search interfaces.

In libraries I think it is now called a “reference interview.”

(Please email, tweet, etc. this post to your librarian friends.)

Effects of popularity and quality on the usage of query suggestions during information search

Filed under: Information Retrieval,Search Interface,Searching — Patrick Durusau @ 4:44 am

Effects of popularity and quality on the usage of query suggestions during information search Authors: Diane Kelly, Amber Cushing, Maureen Dostert, Xi Niu, Karl Gyllstrom Keywords: query popularity, query quality, query recommendation, query suggestion, search behavior, social search, usage

Abstract:

Many search systems provide users with recommended queries during online information seeking. Although usage statistics are often used to recommend queries, this information is usually not displayed to the user. In this study, we investigate how the presentation of this information impacts use of query suggestions. Twenty-three subjects used an experimental search system to find documents about four topics. Eight query suggestions were provided for each topic: four were high quality queries and four were low quality queries. Fake usage information indicating how many other people used the queries was also provided. For half the queries this information was high and for the other half this information was low. Results showed that subjects could distinguish between high and low quality queries and were not influenced by the usage information. Qualitative data revealed that subjects felt favorable about the suggestions, but the usage information was less important for the search task used in this study.

Another small sample study but raises questions that successful interfaces will consider.

Where successful means used effectively and seen by users as effective. The latter being the most important measure.

October 11, 2010

Analyzing the Role of Dimension Arrangement for Data Visualization in Radviz

Filed under: High Dimensionality,Information Retrieval,Visualization — Patrick Durusau @ 6:25 am

Analyzing the Role of Dimension Arrangement for Data Visualization in Radviz Authors: Luigi Caro, Vanessa Frias-Martinez, Enrique Frias-Martinez Keywords: Radial Coordinate Visulalization, Radviz, Dimension arrangement

Abstract:

The Radial Coordinate Visualization (Radviz) technique has been widely used to effectively evaluate the existence of patterns in highly dimensional data sets. A crucial aspect of this technique lies in the arrangement of the dimensions, which determines the quality of the posterior visualization. Dimension arrangement (DA) has been shown to be an NP-problem and different heuristics have been proposed to solve it using optimization techniques. However, very little work has focused on understanding the relation between the arrangement of the dimensions and the quality of the visualization. In this paper we first present two variations of the DA problem: (1) a Radviz independent approach and (2) a Radviz dependent approach. We then describe the use of the Davies-Bouldin index to automatically evaluate the quality of a visualization i.e., its visual usefulness. Our empirical evaluation is extensive and uses both real and synthetic data sets in order to evaluate our proposed methods and to fully understand the impact that parameters such as number of samples, dimensions, or cluster separability have in the relation between the optimization algorithm and the visualization tool.

Interesting both for exploration of data sets for constructing topic maps and quite possibly for finding the “best” visualizations for a topic map deliverable.

October 1, 2010

Tell me more, not just “more of the same”

Tell me more, not just “more of the same” Authors: Francisco Iacobelli, Larry Birnbaum, Kristian J. Hammond Keywords: dimensions of similarity, information retrieval, new information detection

Abstract:

The Web makes it possible for news readers to learn more about virtually any story that interests them. Media outlets and search engines typically augment their information with links to similar stories. It is up to the user to determine what new information is added by them, if any. In this paper we present Tell Me More, a system that performs this task automatically: given a seed news story, it mines the web for similar stories reported by different sources and selects snippets of text from those stories which offer new information beyond the seed story. New content may be classified as supplying: additional quotes, additional actors, additional figures and additional information depending on the criteria used to select it. In this paper we describe how the system identifies new and informative content with respect to a news story. We also how that providing an explicit categorization of new information is more useful than a binary classification (new/not-new). Lastly, we show encouraging results from a preliminary evaluation of the system that validates our approach and encourages further study.

If you are interested in the automatic extraction, classification and delivery of information, this article is for you.

I think there are (at least) two interesting ways for “Tell Me More” to develop:

First, persisting entity recognition with other data (such as story, author, date, etc.) in the form of associations (with appropriate roles, etc.).

Second, and perhaps more importantly, to enable users to add/correct information presented as part of a mapping of information about particular entities.

September 28, 2010

International Workshop on Similarity Search and Applications (SISAP)

Filed under: Indexing,Information Retrieval,Search Engines,Searching,Software — Patrick Durusau @ 4:47 pm

International Workshop on Similarity Search and Applications (SISAP)

Website:

The International Workshop on Similarity Search and Applications (SISAP) is a conference devoted to similarity searching, with emphasis on metric space searching. It aims to fill in the gap left by the various scientific venues devoted to similarity searching in spaces with coordinates, by providing a common forum for theoreticians and practitioners around the problem of similarity searching in general spaces (metric and non-metric) or using distance-based (as opposed to coordinate-based) techniques in general.

SISAP aims to become an ideal forum to exchange real-world, challenging and exciting examples of applications, new indexing techniques, common testbeds and benchmarks, source code, and up-to-date literature through a Web page serving the similarity searching community. Authors are expected to use the testbeds and code from the SISAP Web site for comparing new applications, databases, indexes and algorithms.

Proceedings from prior years, source code, sample data, a real gem of a site.

September 20, 2010

Astroinformatics 2010

Filed under: Astroinformatics,Information Retrieval,Searching,Semantics — Patrick Durusau @ 6:17 pm

Astroinformatics 2010.

Conference on astronomical data, its processing and semantics.

The astronomical community has made data interchangeable, in terabyte and soon to be petabyte quantities.

Questions:

Have they solved the problem of interchangeable semantics?

Or have they reduced semantics to the point interchange becomes easier/possible?

Do semantic interchange problems/issues/opportunities reappear when more semantics are imposed?

What about preservation of semantics?

September 18, 2010

TREC 2010/2011

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Searching,Software — Patrick Durusau @ 7:34 am

It’s too late to become a participant in TREC 2010 but everyone interested in building topic maps should be aware of this conference.

The seven tracks for this year are blog, chemical IR, entity, legal, relevance feedback, “session,” and web.

Prior TREC conferences are online, along with a host of other materials, at the Text REtrieval Conference (TREC) site.

The 2011 cycle isn’t that far away so consider being a participant next year.

September 13, 2010

A million answers to twenty questions: choosing by checklist

Filed under: Information Retrieval,Interface Research/Design,Topic Map Software — Patrick Durusau @ 6:09 pm

A million answers to twenty questions: choosing by checklist Authors: Michael Mandler , Paola Manzini , Marco Mariotti, Keywords: Bounded rationality, utility maximization, choice function, lexicographic utility.

Mentions:

Checklist users can in effect perform a binary search, which makes the number of preference discriminations they make an exponential function of the number of properties that they use. As a result, an agent who makes a 1,000,000 preference discriminations needs a checklist that is just 20 properties long.

Substitute “identity” for “preference.”

  1. How many discriminations are necessary to identify a subject?
  2. Does the order of discrimination matter?
  3. What properties discriminate more than others?
  4. Do the answers to 1-3 vary by domain? If so, in what way?

Empirical question, unlike ontologies, classifications, cataloging, the answers come from users.

September 5, 2010

“Linguistic terms do not hold exact meaning….”

Filed under: Data Integration,Fuzzy Sets,Information Retrieval,Subject Identity — Patrick Durusau @ 10:36 am

In some background research I ran across:

One of the most important applications of fuzzy set theory is the concept of linguistic variables. A linguistic variable is a variable whose values are not numbers, but words or sentences in a natural or artificial language. The value of a linguistic variable is defined as an element of its term set? a predefined set of appropriate linguistic terms. Linguistic terms are essentially subjective categories for a linguistic variable.

Linguistic terms do not hold exact meaning, however, and may be understood differently by different people. The boundaries of a given term are rather subjective, and may also depend on the situation. Linguistic terms therefore cannot be expressed by ordinary set theory; rather, each linguistic term is associated with a fuzzy set. (“Soft sets and soft groups,” by Haci Akta? and Naim Ça?man, Information Sciences, Volume 177, Issue 13, 1 July 2007, Pages 2726-2735

Fuzzy sets are yet another useful approach that has recognized linguistic uncertainty as an issue and developed mechanisms to address it.

What is “linguistic uncertainty” if it isn’t a question of “subject identity?”

Fuzzy sets have developed another way to answer questions about subject identity.

As topic maps mature I want to see the development of equivalences between approaches to subject identity.

Imagine a topic map system consisting of a medical scanning system that is identifying “subjects” in cultures using rough sets, with equivalences to “subjects” identified in published literature using fuzzy sets, that is refined by “subjects” from user contributions and interactions using PSIs or other mechanisms. (Or other mechanisms, past, present or future.)

September 3, 2010

Making Wikileaks Effective

Filed under: Information Retrieval,Marketing,Subject Identity,Topic Maps — Patrick Durusau @ 7:57 pm

Wikileaks has captured the headlines with the release of Afghan War Diary, 2004-2010.

I haven’t looked at the documents but document collections present the same issues for effective use.

First, document semantics vary depending upon whether they are being read by their intended audience, another military command or other audience. For example, locations may be identified by unfamiliar terms.

Second, and nearly as important, what if one analyst bridges the different semantics and identifies a location? How do they map it to their semantic and communicate that fact to others?

Could pass around a sticky note. Put it on a blackboard. Write it up in a multi-page report.

Topic maps are an effective means to navigate data and multiple interpretations of it, not to mention integrating other data you may have on hand.

Topic maps don’t constrain what subjects you can identify in advance, the basis on which you identify them, and can quickly share discoveries with others.

Wikileaks can be annoying. Topic maps can make Wikileaks effective. There’s a difference.

August 25, 2010

Murray – Presentation History

Filed under: Graphs,Information Retrieval,Subject Identity — Patrick Durusau @ 3:36 pm

Ronald Murray forwarded a Presentation History that clarifies some of the issues raised in Ethnomathematics Doodles.

Please use “Presentation History” instead of “Ethnomathematics Doodles” on its own.

August 24, 2010

Ethnomathematics Doodles

Filed under: Graphs,Information Retrieval,Subject Identity — Patrick Durusau @ 7:29 pm

Ethnomathematics Doodles came by way of Ronald Murray, whose presentation, Moby-Dick to Mashups, was mentioned here not all that long ago.

BTW, Ron has placed the slides from that presentation up on Slideshare.net and is seeking comments on them.

August 13, 2010

Prescriptive vs. Adaptive Information Retrieval?

Filed under: Concept Hierarchies,Indexing,Information Retrieval,Thesaurus — Patrick Durusau @ 8:47 pm

Gary W. Strong and M. Carl Drott, contend in A Thesaurus for End-User Indexing and Retrieval, Information Processing & Management, Vol. 22, No. 6, pp. 487-492, 1986, that:

A low-cost, practical information retrieval system, if it were to be designed, would require a thesaurus, but one in which end-users would be able to browse research topics by means of an organization that is concept-based rather than term-based as is the typical thesaurus.

…. (while elsewhere)

It is our hypothesis that, when the thesaurus can be envisioned by users as a simple, yet meaningful, organization of concepts, the entire information system is much more likely to be useable in an efficient manner by novice users. (emphasis added)

It puzzles me that experts are building a system of concepts for novices to use. Do you suspect experts have different views of the domains in question than novices? And approach their search for information with different assumptions?

Any concept system designed by an expert is a prescriptive information retrieval system. It represents their view of the domain and not that of a novice. Or rather it represents how the expert thinks a novice should navigate the field.

While the expert’s view may be useful for some purposes, such as socializing a novice into a particular view of the domain, it may be more useful for novices to use a novice’s view of the domain. To build that we would need to turn to novices in a domain. Perhaps through the use of adaptive information retrieval, IR that adapts to its user, rather than the other way around.

Adaptive information retrieval systems, I like that, ones that grow to be more like their users and less like their builders with every use.

August 8, 2010

Gephi – The Open Graph Viz Platform

Filed under: Gephi,Graphs,Information Retrieval,Interface Research/Design,Maps,Software — Patrick Durusau @ 3:51 pm

Gephi is an “interactive visualization and exploration platform” for graphs.

From the site:

  • Exploratory Data Analysis: intuition-oriented analysis by networks manipulations in real time.
  • Link Analysis: revealing the underlying structures of associations between objects, in particular in scale-free networks.
  • Social Network Analysis: easy creation of social data connectors to map community organizations and small-world networks.
  • Biological Network analysis: representing patterns of biological data.
  • Poster creation: scientific work promotion with hi-quality printable maps.

I find the notion of interaction with a graph, or in our case a topic map represented as a graph quite fascinating.

Imagine selecting or even adding properties as the basis for merging and then examining those results in an interactive rather than batch process.

Can “drag-n-drop” topic map authoring be that far away?

July 25, 2010

Dependency and Subject Identity

Filed under: Information Retrieval,Subject Identity,Topic Maps — Patrick Durusau @ 5:44 am

Dependence language model for information retrieval by Jianfeng Gao , Jian-yun Nie , Guangyuan Wu , and Guihong Cao, is a good introduction to dependency analysis in information retrieval.

The theory is that terms (words) in a document depend upon other words and that those dependencies can be used to improve the results of information retrieval efforts.

Beyond its own merits, I find the analogy of dependency analysis to subject identification interesting. That any subject identification depends upon other subjects being identified, whether those identifications are explicit or not.

If not explicit, we have the traditional IR problem of trying to determine what subjects were meant. We can see the patterns of usage but the reasons for the patterns lie just beyond our reach.

Dependency analysis does not seek an explicit identification but identifies patterns that appear to be associated with a particular term. That improves out “guesses” to a degree.

Topic maps enable us to make explicit what subjects the identification of a particular subject depends upon. Or rather to make explicit our identifications of subjects upon which an identification depends.

Whether the same subject is being identified, even by use of the same dependent identifications, is a question best answered by a user.

July 13, 2010

The FLAMINGO Project on Data Cleaning – Site

The FLAMINGO Project on Data Cleaning is the other project that has influenced the self-similarity work with MapReduce.

From the project description:

Supporting fuzzy queries is becoming increasingly more important in applications that need to deal with a variety of data inconsistencies in structures, representations, or semantics. Many existing algorithms require an offline analysis of data sets to construct an efficient index structure to support online query processing. Fuzzy join queries of data sets are more time consuming due to the computational complexity. The PI is studying three research problems: (1) constructing high-quality inverted lists for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of large data sets using Hadoop; and (3) using the developed techniques to improve data quality of large collections of documents.

See the project webpage to learn more about their work on “us[ing] limited programming primitives in the cloud to implement index structures and search algorithms.”

The relationship between “dirty” data and the increase in data overall is at least linear, but probably worse. Far worse. Whether data is “dirty” depends on your perspective. The more data that appears on “***” format (fill in the one you like the least) the dirtier the universe of data has become. “Dirty” data will be with you always.

ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis – SITE

ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis is one of the projects behind the self-similarity and MapReduce posting.

From the project page:

The ASTERIX project is developing new technologies for ingesting, storing, managing, indexing, querying, analyzing, and subscribing to vast quantities of semi-structured information. The project is combining ideas from three distinct areas – semi-structured data, parallel databases, and data-intensive computing – to create a next-generation, open source software platform that scales by running on large, shared-nothing computing clusters.

Home of Hydrax Hyrax: Demonstrating a New Foundation for Data-Parallel Computation, “out-of-the-box support for common distributed communication patterns and set-oriented data operators.” (Need I say more?)

July 11, 2010

Efficient Parallel Set-Similarity Joins Using MapReduce

Efficient Parallel Set-Similarity Joins Using MapReduce by Rares Vernica, Michael J. Carey, and, Chen Li, Department of Computer Science, University of California, Irvine, used Citeseer (1.3M publications) and DBLP (1.2M publications) and “…increased their sizes as needed.”

The contributions of this paper are:

  • “We describe efficient ways to partition a large dataset across nodes in order to balance the workload and minimize the need for replication. Compared to the equi-join case, the set-similarity joins case requires “partitioning” the data based on set contents.
  • We describe efficient solutions that exploit the MapReduce framework. We show how to efficiently deal with problems such as partitioning, replication, and multiple
    inputs by manipulating the keys used to route the data in the framework.
  • We present methods for controlling the amount of data kept in memory during a join by exploiting the properties of the data that needs to be joined.
  • We provide algorithms for answering set-similarity self-join queries end-to-end, where we start from records containing more than just the join attribute and end with actual pairs of joined records.
  • We show how our set-similarity self-join algorithms can be extended to answer set-similarity R-S join queries.
  • We present strategies for exceptional situations where, even if we use the finest-granularity partitioning method, the data that needs to be held in the main memory of one node is too large to fit.”

A number of lessons and insights relevant to topic maps in this paper.

Makes me think of domain specific (as well as possibly one or more “general”) set-similarity join interchange languages! What are you thinking of?

NTCIR (NII Test Collection for IR Systems) Project

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 7:47 am

NTCIR (NII Test Collection for IR Systems) Project focuses on information retrieval tasks in Japanese, Chinese, Korean, English and cross-lingual information retrieval.

From the project description:

For the laboratory-typed testing, we have placed emphasis on (1) information retrieval (IR) with Japanese or other Asian languages and (2) cross-lingual information retrieval. For the challenging issues, (3) shift from document retrieval to “information” retrieval and technologies to utilizing information in the documents, and (4) investigation for realistic evaluation, including evaluation methods for summarization, multigrade relevance judgments and single-numbered averageable measures for such judgments, evaluation methods suitable for retrieval and processing of particular document-genre and its usage of the user group of the genre and so on.

I know there are active topic map communities in both Japan and Korea. Perhaps this is a place to meet researchers working on issues closely similar to those in topic maps and to discuss the contribution that topic maps have to offer.

Forum for Information Retrieval Evaluation (FIRE)

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 6:44 am

Forum for Information Retrieval Evaluation (FIRE)  aims:

  • to encourage research in South Asian language Information Access technologies by providing reusable large-scale test collections for ILIR experiments
  • to explore new Information Retrieval / Access tasks that arise as our information needs evolve, and new needs emerge
  • to provide a common evaluation infrastructure for comparing the performance of different IR systems
  • to investigate evaluation methods for Information Access techniques and methods for constructing a reusable large-scale data set for ILIR experiments.

I know there is a lot of topic map development in South Asia and this looks like a great place to meet current researchers and to interest others in topic maps.

INEX: Initiative for Evaluation of XML Retrieval

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 6:30 am

INEX: Initiative for Evaluation of XML Retrieval is another must-see for serious topic map researchers.

No surprise that my first stop was the iNEX Publications page with proceedings from 2002-date.

However, INEX offers an opportunity for evaluation of topic maps in the context of other solutions, providing that one or more of us participate in the initiative.

If you or your institution decided to participate, please let others in the community know. I for one would like to join such an effort.

UCI ISG Lecture Series on Scalable Data Management

Filed under: Information Retrieval,MapReduce,Searching,Semantics,SQL — Patrick Durusau @ 5:39 am

UCI ISG Lecture Series on Scalable Data Management is simply awesome! Slides and videos you will find:

  • Teradata Past, Present and Future Todd Walter, CTO, R&D, Teradata
  • Hadoop: Origins and Applications Chris Smith, Xavier Stevens and John Carnahan, FOX Audience Network
  • Pig: Building High-Level Dataflows over Map-Reduce Utkarsh Srivastava, Senior Research Scientist, Yahoo!
  • Database Scalability and Indexes Goetz Graefe, HP Fellow, Hewlett-Packard Laboratories
  • Cloud Data Serving: Key-Value Stores to DBMSs Raghu Ramakrishnan, Chief Scientist for Audience & Cloud Computing, Yahoo!
  • Scalable Data Management at Facebook Srinvas Narayanan, Software Engineer, Facebook
  • SCOPE: Parallel Data Processing of Massive Data Sets Jingren Zhou, Researcher, Microsoft
  • What We Got Right, What We Got Wrong: The Lessons I Learned Building a Large-Scale DBMS for XML. Mary Holstege, Principal Engineer, Mark Logic
  • Scalable Data Management with DB2 Matthias Nicola, DB2 pureXML Architect, IBM
  • SQL Server: A Data Platform for Large-Scale Applications José Blakeley, Partner Architect, Microsoft
  • Data in the Cloud: New Challenges or More of the Same? Divy Agrawal, Professor of Computer Science, UC Santa Barbara

Subject identity is as important in the realm of big data/table/etc. as it is anywhere.

It is our choice if topic maps are going to step up to the challenge.

That is going to require reaching out and across communities and becoming pro-active with regard to new opportunities and possibilities.

This resource was brought to my notice by Jack Park. Jack delights in sending these highly relevant and often quite large resource listings my way (and to be honest, I return the favor).

July 10, 2010

Knowledge-Based Systems – Journal

Filed under: Information Retrieval,Software — Patrick Durusau @ 7:51 am

Knowledge-Based Systems is described on its homepage:

Knowledge-Based Systems is the international, interdisciplinary and applications-oriented journal on KBS.

Knowledge-Based Systems focuses on systems that use knowledge-based techniques to support human decision-making, learning and action. Such systems are capable of cooperating with human users and so the quality of support given and the manner of its presentation are important issues. The emphasis of the journal is on the practical significance of such systems in modern computer development and usage.

As well as being concerned with the implementation of knowledge-based systems, the journal covers the design process, the matching of requirements and needs to delivered systems and the organisational implications of introducing such technology into the workplace and public life, expert systems, application of knowledge-based methods, integration with conventional technologies, software tools for KBS construction, decision-support mechanisms, user interactions, organisational issues, knowledge acquisition, knowledge representation, languages and programming environments, knowledge-based implementation techniques and system architectures. Also included are publication reviews.

Forthcoming articles include:

  • Grammar-Based Geodesics in Semantic Networks
  • Hy-SN: Hyper-graph based Semantic Network
  • A Semantic Backend for Content Management Systems
  • Research on the Model of Rough Set over Dual-universes

Definitely should be on every topic map researcher’s current awareness list.

July 8, 2010

Taking Your Tool Kit to the Next Level

Filed under: Data Mining,Information Retrieval,Search Engines — Patrick Durusau @ 7:53 pm

Online Mathematics Textbooks is a good stop if you want to take your tool kit to the next level.

Plug-n-play indexing and search engines will do a lot out of the box but aren’t going to distinguish you from the competition.

Understanding the underlying algorithms will help make the data mining you do to populate your topic map qualitatively different.

Here’s your chance to brush up on your math skills without monetary investment.

***
PS: At some point, maybe at TMRA, a group of us need to draft an outline for a topic maps curriculum. Would have to include topic maps, obviously, but would also need to include courses in Information Retrieval, User Interfaces, Natural Language Processing, Classification, Math, what else? Would need to have “minors” in some particular subject area.

June 19, 2010

Demonstrating The Need For Topic Maps

Individual Differences in the Interpretation of Text: Implications for Information Science by Jane Morris demonstrates that different readers have different perceptions of lexical cohesion in a text. About 40% worth’s of difference. That is a difference in the meaning of the text.

Many tasks in library and information science (e.g., indexing, abstracting, classification, and text analysis techniques such as discourse and content analysis) require text meaning interpretation, and, therefore, any individual differences in interpretation are relevant and should be considered, especially for applications in which these tasks are done automatically. This article investigates individual differences in the interpretation of one aspect of text meaning that is commonly used in such automatic applications: lexical cohesion and lexical semantic relations. Experiments with 26 participants indicate an approximately 40% difference in interpretation. In total, 79, 83, and 89 lexical chains (groups of semantically related words) were analyzed in 3 texts, respectively. A major implication of this result is the possibility of modeling individual differences for individual users. Further research is suggested for different types of texts and readers than those used here, as well as similar research for different aspects of text meaning.

I won’t belabor what a 40% difference in interpretation implies for the one interpretation of data crowd. At least for those who prefer an evidence versus ideology approach to IR.

What is worth belaboring is how to use Morris’ technique to demonstrate such differences in interpretation to potential topic map customers. As a community we could develop texts for use with particular market segments, business, government, legal, finance, etc. An interface to replace the colored pencils used to mark all words belonging to a particular group. Automating some of the calculations and other operations on the resulting data.

Sensing that interpretations of texts vary is one thing. Having an actual demonstration, possibly using texts from a potential client, is quite another.

This is a tool we should build. I am willing to help. Who else is interested?

« Newer PostsOlder Posts »

Powered by WordPress