TREC « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 4, 2015

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

Filed under: Entities,Freebase,TREC — Patrick Durusau @ 5:26 pm

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

From the webpage:

Researchers at Google annotated the English-language pages from the TREC KBA Stream Corpus 2014 with links to Freebase. The annotation was performed automatically and are imperfect. For each entity recognized with high confidence an annotation with a link to Freebase is provided (see the details below).

For any questions, join this discussion forum: https://groups.google.com/group/streamcorpus.

Data Description

FAKBA1 Corpus Annotations

TREC KBA 2014 CCR entity query annotations

The entity annotations are for the TREC KBA Stream Corpus 2014. These annotations are freely available. The annotation data for the corpus is provided as a collection of 2000 files (the partitioning is somewhat arbitrary) that total 196 GB, compressed (gz). Each file contains annotations for a batch of pages and the entities identified on the page. These annotations are freely available.

I first saw this in a tweet by Jeff Dalton.

Jeff has a blog post about this release at: Google Research Entity Annotations of the KBA Stream Corpus (FAKBA1). Jeff speculates on the application of this corpus to other TREC tasks.

Jeff suggests that you monitor Knowledge Data Releases for future data releases. I need to ping Jeff as the FAKBA1 release does not appear on the Knowledge Data Release page.

BTW, don’t be misled by the “9.4 billion entity annotations from over 496 million documents” statistic. Impressive but ask yourself, how many of your co-workers, their friends, families, relationships at work, projects where you work, etc. appear in Freebase? Sounds like there is a lot of work to be done with your documents and data that have little or nothing to do with Freebase. Yes?

Enjoy!

Comments Off

February 10, 2014

Text Retrieval Conference (TREC) 2014

Filed under: Conferences,TREC — Patrick Durusau @ 11:33 am

Text Retrieval Conference (TREC) 2014

Schedule: As soon as possible — submit your application to participate in TREC 2014 as described below.
Submitting an application will add you to the active participants’ mailing list. On Feb 26, NIST will announce a new password for the “active participants” portion of the TREC web site.

Beginning March 1
Document disks used in some existing TREC collections distributed to participants who have returned the required forms. Please note that no disks will be shipped before March 1.

July–August
Results submission deadline for most tracks. Specific deadlines for each track will be included in the track guidelines, which will be finalized in the spring.

September 30 (estimated)
relevance judgments and individual evaluation scores due back to participants.

Nov 18–21
TREC 2014 conference at NIST in Gaithersburg, Md. USA

From the webpage:

The Text Retrieval Conference (TREC) workshop series encourages research in information retrieval and related applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. Now in its 23rd year, the conference has become the major experimental effort in the field. Participants in the previous TREC conferences have examined a wide variety of retrieval techniques and retrieval environments, including cross-language retrieval, retrieval of web documents, multimedia retrieval, and question answering. Details about TREC can be found at the TREC web site, http://trec.nist.gov.

You are invited to participate in TREC 2014. TREC 2014 will consist of a set of tasks known as “tracks”. Each track focuses on a particular subproblem or variant of the retrieval task as described below. Organizations may choose to participate in any or all of the tracks. Training and test materials are available from NIST for some tracks; other tracks will use special collections that are available from other organizations for a fee.

Dissemination of TREC work and results other than in the (publicly available) conference proceedings is welcomed, but the conditions of participation specifically preclude any advertising claims based on TREC results. All retrieval results submitted to NIST are published in the Proceedings and are archived on the TREC web site. The workshop in November is open only to participating groups that submit retrieval results for at least one track and to selected government invitees.

The eight (8) tracks:

Clinical Decision Support Track: The clinical decision support track investigates techniques for linking medical cases to information relevant for patient care.

Contextual Suggestion Track: The Contextual Suggestion track investigates search techniques for complex information needs that are highly dependent on context and user interests.

Federated Web Search Track: The Federated Web Search track investigates techniques for the selection and combination of search results from a large number of real on-line web search services.

Knowledge Base Acceleration Track: This track looks to develop techniques to dramatically improve the efficiency of (human) knowledge base curators by having the system suggest modifications/extensions to the KB based on its monitoring of the data streams.

Microblog Track: The Microblog track examines the nature of real-time information needs and their satisfaction in the context of microblogging environments such as Twitter.

Session Track: The Session track aims to provide the necessary resources in the form of test collections to simulate user interaction and help evaluate the utility of an IR system over a sequence of queries and user interactions, rather than for a single “one-shot” query.

Temporal Summarization Track: The goal of the Temporal Summarization track is to develop systems that allow users to efficiently monitor the information associated with an event over time.

Web Track: The goal of the Web track is to explore and evaluate Web retrieval technologies that are both effective and reliable.

As of the data of this post, only the Clinical Decision Support Track webpage has been updated for the 2014 conference. The others will follow in due time.

Apologies for the late notice but since the legal track doesn’t appear this year it dropped off my radar.

Application Details

Organizations wishing to participate in TREC 2014 should respond to this call for participation by submitting an application. Participants in previous TRECs who wish to participate in TREC 2014 must submit a new application. To apply, submit the online application at: http://ir.nist.gov/trecsubmit.open/application.html

Comments Off

January 30, 2013

A Data Driven Approach to Query Expansion in Question Answering

Filed under: Lucene,Query Expansion,TREC — Patrick Durusau @ 8:44 pm

A Data Driven Approach to Query Expansion in Question Answering by Leon Derczynski, Jun Wang, Robert Gaizauskas, and Mark A. Greenwood.

Abstract:

Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions.

In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method.

Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Work on query expansion in natural language answering systems. Closely related to synonymy.

Query expansion tools could be useful prompts for topic map authors seeking terms for mapping.

Comments Off

May 14, 2012

TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

Filed under: Data Mining,Data Source,Open Relevance Project,TREC — Patrick Durusau @ 12:47 pm

TREC Document Review Project on Hiatus, Recommind Asked to Withdraw

From the post:

TREC Legal Track — part of the U.S. government’s Text Retrieval Conference — announced last week that the 2012 edition of its annual document review project for testing new systems is canceled, while prominent e-discovery software company Recommind confirmed that it’s been asked to leave the project for prematurely sharing results.

These difficulties highlight the need for:

open data sets and
protocols for reporting of results as they occur.

That requires a data set with relevance judgments and other work.

Have you thought about the: Open Relevance Project at the Apache Foundation?

Email archives from Apache projects, the backbone of the web as we know it, are ripe for your contributions.

Let me be the first to ask Recommind to join in building a public data set for everyone.

Comments Off

May 12, 2012

TREC 2012 Crowdsourcing Track

Filed under: Crowd Sourcing,TREC — Patrick Durusau @ 6:22 pm

TREC 2012 Crowdsourcing Track

Panos Ipeirotis writes:

TREC 2012 Crowdsourcing Track – Call for Participation
June 2012 – November 2012

https://sites.google.com/site/treccrowd/

Goals

As part of the National Institute of Standards and Technology (NIST)‘s annual Text REtrieval Conference (TREC), the Crowdsourcing track investigates emerging crowd-based methods for search evaluation and/or developing hybrid automation and crowd search systems.

This year, our goal is to evaluate approaches to crowdsourcing high quality relevance judgments for two different types of media:

textual documents

images

For each of the two tasks, participants will be expected to crowdsource relevance labels for approximately 20k topic-document pairs (i.e., 40k labels when taking part in both tasks). In the first task, the documents will be from an English news text corpora, while in the second task the documents will be images from Flickr and from a European news agency.

Participants may use any crowdsourcing methods and platforms, including home-grown systems. Submissions will be evaluated against a gold standard set of labels and against consensus labels over all participating teams.

Tentative Schedule

Jun 1: Document corpora, training topics (for image task) and task guidelines available

Jul 1: Training labels for the image task

Aug 1: Test data released

Sep 15: Submissions due

Oct 1: Preliminary results released

Oct 15: Conference notebook papers due

Nov 6-9: TREC 2012 conference at NIST, Gaithersburg, MD, USA

Nov 15: Final results released

Jan 15, 2013: Final papers due

As you know, I am interested in crowd sourcing of paths through data and assignment of semantics.

Although I am puzzled why we continue to put emphasis on post-creation assignment of semantics?

After data is created, we look around surprised the data has no explicit semantics.

Like realizing you are on Main Street without your pants.

Why don’t we look to the data creation process to assign explicit semantics?

Thoughts?

Comments Off

January 12, 2012

TEXT RETRIEVAL CONFERENCE (TREC) 2012

Filed under: Conferences,TREC — Patrick Durusau @ 7:33 pm

TEXT RETRIEVAL CONFERENCE (TREC) 2012 February 2012 – November 2012.

Schedule:

As soon as possible — submit your application to participate in TREC 2012 as described below.

Submitting an application will add you to the active participants’ mailing list. On Feb 23, NIST will announce a new password for the “active participants” portion of the TREC web site.

Beginning March 1

Document disks used in some existing TREC collections distributed to participants who have returned the required forms. Please note that no disks will be shipped before March 1.

July–August

Results submission deadline for most tracks Specific deadlines for each track will be included in the track guidelines, which will be finalized in the spring.

September 30 (estimated)

relevance judgments and individual evaluation scores due back to participants.

Nov 6-9

TREC 2012 conference at NIST in Gaithersburg, Md. USA

Applications:

Organizations wishing to participate in TREC 2012 should respond to this call for participation by submitting an application. Participants in previous TRECs who wish to participate in TREC 2012 must submit a new application. To apply, follow the instructions at

http://ir.nist.gov/trecsubmit.open/application.html

to submit an online application. The application system will send an acknowledgement to the email address supplied in the form once it has processed the form.

Conference blurb:

The Text Retrieval Conference (TREC) workshop series encourages research in information retrieval and related applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. Now in its 21st year, the conference has become the major experimental effort in the field. Participants in the previous TREC conferences have examined a wide variety of retrieval techniques and retrieval environments, including cross-language retrieval, retrieval of web documents, multimedia retrieval, and question answering. Details about TREC can be found at the TREC web site, http://trec.nist.gov.

You are invited to participate in TREC 2012. TREC 2012 will consist of a set of tasks known as “tracks”. Each track focuses on a particular subproblem or variant of the retrieval task as described below. Organizations may choose to participate in any or all of the tracks. Training and test materials are available from NIST for some tracks; other tracks will use special collections that are available from other organizations for a fee.

Dissemination of TREC work and results other than in the (publicly available) conference proceedings is welcomed, but the conditions of participation specifically preclude any advertising claims based on TREC results. All retrieval results submitted to NIST are published in the Proceedings and are archived on the TREC web site. The workshop in November is open only to participating groups that submit retrieval results for at least one track and to selected government invitees.

Look at the data sets and tracks. This is not a venture for the faint of heart.

Comments Off

November 4, 2011

More Data: Tweets & News Articles

Filed under: Dataset,News,TREC,Tweets — Patrick Durusau @ 6:07 pm

From Max Lin’s blog, Ian Soboroff posted:

Two new collections being released from TREC today:

The first is the long-awaited Tweets2011 collection. This is 16 million tweets sampled by Twitter for use in the TREC 2011 microblog track. We distribute the tweet identifiers and a crawler, and you download the actual tweets using the crawler. http://trec.nist.gov/data/tweets/

The second is TRC2, a collection of 1.8 million news articles from Thompson Reuters used in the TREC 2010 blog track. http://trec.nist.gov/data/reuters/reuters.html

Both collections are available under extremely permissive usage agreements that limit their use to research and forbid redistribution, but otherwise are very open as data usage agreements go.

It may just be my memory but I don’t recall seeing topic map research with the older Reuters data set (the new one is too recent). Is that true?

Anyway, more large data sets for your research pleasure.

Comments Off

July 26, 2011

The Graph 500 List

Filed under: Graphs,HPCC,TREC — Patrick Durusau @ 6:24 pm

The Graph 500 List

From the website:

Data intensive supercomputer applications are increasingly important for HPC workloads, but are ill-suited for platforms designed for 3D physics simulations. Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.

Backed by a steering committee of over 30 international HPC experts from academia, industry, and national laboratories, Graph 500 will establish a set of large-scale benchmarks for these applications. The Graph 500 steering committee is in the process of developing comprehensive benchmarks to address three application kernels: concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set). Further, we are in the process of addressing five graph-related business areas: Cybersecurity, Medical Informatics, Data Enrichment, Social Networks, and Symbolic Networks.

This is the first serious approach to complement the Top 500 with data intensive applications. Additionally, we are working with the SPEC committee to include our benchmark in their CPU benchmark suite. We anticipate the list will rotate between ISC and SC in future years.

Comments Off

December 25, 2010

Spectral Based Information Retrieval

Filed under: Information Retrieval,Retrieval,TREC,Vectors,Wavelet Transforms — Patrick Durusau @ 6:10 am

Spectral Based Information Retrieval Author: Laurence A. F. Park (2003)

Every now and again I run into a dissertation that is an interesting and useful survey of a field and an original contribution to the literature.

Not often but it does happen.

It happened in this case with Park’s dissertation.

The beginning of an interesting threat of research that treats terms in a document as a spectrum and then applies spectral transformations to the retrieval problem.

The technique has been developed and extended since the appearance of Park’s work.

Highly recommended, particularly if you are interested in tracing the development of this technique in information retrieval.

My interest is in the use of spectral representations of text in information retrieval as part of topic map authoring and its potential as a subject identity criteria.

Actually I should broaden that to include retrieval of images and other data as well.

Questions:

Prepare an annotated bibliography of ten (10) recent papers usually spectral analysis for information retrieval.
Spectral analysis helps retrieve documents but what if you are searching for ideas? Does spectral analysis offer any help?
How would you extend the current state of spectral based information retrieval? (5-10 pages, project proposal, citations)

Comments Off

November 29, 2010

TREC Entity Track: Plans for Entity 2011

Filed under: Conferences,Entity Extraction,TREC — Patrick Durusau @ 8:50 am

TREC Entity Track: Plans for Entity 2011

Plans for Entity 2011.

Known datasets of interest: ClueWeb09, DBPedia Ontology, Billion Triple Dataset

It’s not too early to get started for next year!

Comments Off

TREC Entity Track: Report from TREC 2010

Filed under: Conferences,Entity Extraction,TREC — Patrick Durusau @ 8:17 am

TREC Entity Track: Report from TREC 2010

A summary of the results for the TREC Entity Track (related entity finding (REF) search task on the WWW) for 2010.

Comments Off