Archive for the ‘Freebase’ Category

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

Wednesday, February 4th, 2015

Google Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)

From the webpage:

Researchers at Google annotated the English-language pages from the TREC KBA Stream Corpus 2014 with links to Freebase. The annotation was performed automatically and are imperfect. For each entity recognized with high confidence an annotation with a link to Freebase is provided (see the details below).

For any questions, join this discussion forum: https://groups.google.com/group/streamcorpus.

Data Description

The entity annotations are for the TREC KBA Stream Corpus 2014. These annotations are freely available. The annotation data for the corpus is provided as a collection of 2000 files (the partitioning is somewhat arbitrary) that total 196 GB, compressed (gz). Each file contains annotations for a batch of pages and the entities identified on the page. These annotations are freely available.

I first saw this in a tweet by Jeff Dalton.

Jeff has a blog post about this release at: Google Research Entity Annotations of the KBA Stream Corpus (FAKBA1). Jeff speculates on the application of this corpus to other TREC tasks.

Jeff suggests that you monitor Knowledge Data Releases for future data releases. I need to ping Jeff as the FAKBA1 release does not appear on the Knowledge Data Release page.

BTW, don’t be misled by the “9.4 billion entity annotations from over 496 million documents” statistic. Impressive but ask yourself, how many of your co-workers, their friends, families, relationships at work, projects where you work, etc. appear in Freebase? Sounds like there is a lot of work to be done with your documents and data that have little or nothing to do with Freebase. Yes?

Enjoy!

11 Billion Clues in 800 Million Documents:…

Saturday, July 20th, 2013

11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts by Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard.

From the post:

When you type in a search query — perhaps Plato — are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval — you have to know what the string actually refers to. The Knowledge Graph and Freebase are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.

We’ve previously released data to help with disambiguation and recently awarded $1.2M in research grants to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.

These Freebase Annotations of the ClueWeb Corpora (FACC) consist of ClueWeb09 FACC and ClueWeb12 FACC. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (Freebase MID’s). …

(…)

Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%….

(…)

Evaluate precision and recall by asking:

Your GPS gives you relevant directions on an average eight (8) times out of ten and it finds relevant locations on average of seven (7) times out of ten (10). (Wikipedia on Precision and Recall)

Is that a good GPS?

A useful data set but still a continuation of the approach of guessing what authors meant when they authored documents.

What if by some yet unknown technique, precision goes to nine (9) out of ten (10) and recall goes to nine (9) out of ten (10) as well?

The GPS question becomes:

Your GPS gives you relevant directions on an average nine (9) times out of ten and it finds relevant locations on average of nine (9) times out of ten (10).

Is that a good GPS?

Not that any automated technique has shown that level of performance.

Rather than focusing on data post-authoring, why not enable authors to declare their semantics?

Author declared semantics would reduce the cost and uncertainty of post-authoring semantic solutions.

I first saw this in a tweet by Nicolas Torzec.

Freebase Data Dumps

Thursday, March 21st, 2013

Freebase Data Dumps

From the webpage:

Data Dumps are a downloadable version of the data in Freebase. They constitute a snapshot of the data stored in Freebase and the Schema that structures it, and are provided under the same CC-BY license.

Full data dumps of every fact and assertion in Freebase are available as RDF and are updated every week. Deltas are not available at this time.

Total triples: 585 million
Compressed size: 14 GB
Uncompressed size: 87 GB
Data Format: Turtle RDF

I first saw this in a tweet by Thomas Steiner.

New at Freebase

Monday, January 21st, 2013

I saw a note at SemanticWeb.com about Freebase offering a new interface. Went to see.

Looked under astronomy, which had far fewer sub-topics than I would have imagined and visited the entry for “star.”

“Star” reports:

A star is really meant to be a single stellar object, not just something that looks like a star from earth. However, in many cases, other objects, such as multi-star systems, were originally thought to be stars. Because people have historically believed these to be stars, they are type as such, but they are also typed as what we now know them to be.

I understand the need to preserve prior “types” but that is a question of scope, not simply adding more types.

Moreover, if “star” means a “single stellar object,” then were do I put different classes of stars? Do they have occurrences too? Does that mean their occurrences get listed under “star” as well?

Topic Maps, Google and the Billion Fact Parade

Thursday, February 10th, 2011

Andrew Hogue (Google) actually titled his presentation on Google’s plan for Freebase: The Structured Search Engine.

Several minutes into the presentation Hogue points out that to answer the question, “when was Martin Luther King, Jr. born?” that date of birth, date born, appeared, dob were all considered synonyms that expect the date type.

Hmmm, he must mean keys that represent the same subject and so subject to merging and possibly, depending on their role in a subject representative, further merging of those subject representatives. Can you say Steve Newcomb and the TMRM?

Yes, attribute names represent subjects just like collections of attributes are thought to represent subjects. And benefit from rules specifying subject identity, other properties and merging rules. (Some of those rules can be derived from mechanical analysis, others probably not.)

Second, Hogue points out that Freebase had 13 million entities when purchased by Google. He speculates on taking that to 1 billion entities.

Let’s cut to the chase, I will see Hogue’s 1 billion entities and raise him 9 billion entities for a total pot of 10 billion entities.

Now what?

Let’s take a simple question that Hogue’s 10 billion entity Google/Freebase cannot usefully answer.

What is democracy?

Seems simple enough. (viewers at home can try this with their favorite search engine.)

1) United States State Department: Democracy means a state that support Israel, keeps the Suez canal open and opposes people we don’t like in the U.S. Oh, and that protects the rights and social status of the wealthy, almost forgot that one. Sorry.

2) Protesters in Egypt (my view): Democracy probably does not include some or all of the points I mention for #1.

3) Turn of the century U.S.: Effectively only the white male population participates.

4) Early U.S. history: Land ownership is a requirement.

I am sure examples can be supplied from other “democracies” and their histories around the world.

This is a very important term and it differing use by different people in different contexts, is going to make discussion and negotiations more difficult.

There are lots of terms where no single “entity” or “fact” that is going to work for everyone.

Subject identity is a tough question and the identification of a subject changes over time, social context, etc. Not to mention that the subjects identified by particular identifications change as well.

Consider that at one time cab was not used to refer to a method of transportation but to a brothel. You may object that was “slang” usage but if I am searching an index of police reports for that time period for raids on brothel’s, your objection isn’t helpful. Doesn’t matter if the usage is “slang” or not, I need to obtain accurate results.

User expectations and needs cannot (or at least should not in my opinion) be adapted to the limitations of a particular approach or technology.

Particularly when we already know of strategies that can help with, not solve, the issues surrounding subject identity.

The first step that Hogue and Google have taken, recognizing that attribute names can have synonyms, is a good start. In topic map terms, recognizing that information structures are composed of subjects as well. So that we can map between information structures, rather than replacing one with another. (Or having religious discussions about which one is better, etc.)

Hogue and Google are already on the way to treating some subjects as worthy of more effort than others, but for those that merit the attention, solving the issue of to reliable, repeatable subject identification, is non-trivial.

Topic maps can make a number of suggestions that can help with that task.