Archive for the ‘Dictionary’ Category

Panel on Digital Dictionaries (MLA/LSA/ADS)

Wednesday, September 26th, 2012

Panel on Digital Dictionaries (MLA/LSA/ADS) by Ben Zimmer.

From the post:

Eric Baković has noted the happy confluence of the annual meetings of the Linguistic Society of America and the Modern Language Association, both scheduled for January 3-6, 2013 at sites within reasonable walking distance of each other in Boston. (The LSA will be at the Boston Marriott Copley Place, and the MLA at the Hynes Convention Center and the Sheraton Boston.) Eric has plugged the joint organized session on open access for which he will be a panelist, so allow me to do the same for another panel with MLA/LSA crossover appeal. The MLA’s Discussion Group on Lexicography has held a special panel for several years now, but many lexicographers and fellow travelers in linguistics have been unable to attend because of the conflict with the LSA and the concurrent meeting of the American Dialect Society. This time around, with the selected topic of “Digital Dictionaries,” the whole MLA/LSA/ADS crowd can join in.

Interested to hear your thoughts if you are able to attend!

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Friday, May 18th, 2012

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas by Valentin Spitkovsky and Peter Norvig (Google Research Team).

From the post:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

(examples omitted)

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data. (emphasis added)

Did you catch those numbers?

Now there is a truly remarkable resource.

What will you make out of it?

Are visual dictionaries generalizable?

Sunday, May 13th, 2012

Are visual dictionaries generalizable? by Otavio A. B. Penatti, Eduardo Valle, and Ricardo da S. Torres

Abstract:

Mid-level features based on visual dictionaries are today a cornerstone of systems for classification and retrieval of images. Those state-of-the-art representations depend crucially on the choice of a codebook (visual dictionary), which is usually derived from the dataset. In general-purpose, dynamic image collections (e.g., the Web), one cannot have the entire collection in order to extract a representative dictionary. However, based on the hypothesis that the dictionary reflects only the diversity of low-level appearances and does not capture semantics, we argue that a dictionary based on a small subset of the data, or even on an entirely different dataset, is able to produce a good representation, provided that the chosen images span a diverse enough portion of the low-level feature space. Our experiments confirm that hypothesis, opening the opportunity to greatly alleviate the burden in generating the codebook, and confirming the feasibility of employing visual dictionaries in large-scale dynamic environments.

The authors use the Caltech-101 image set because of its “diversity.” Odd because they cite the Caltech-256 image set, which was created to answer concerns about the lack of diversity in the Caltech-101 image set.

Not sure this paper answers the issues it raises about visual dictionaries.

Wanted to bring it to your attention because representative dictionaries (as opposed to comprehensive ones) may be lurking just beyond the semantic horizon.

SoSlang Crowdsources a Dictionary

Wednesday, March 21st, 2012

SoSlang Crowdsources a Dictionary

Stephen E. Arnold writes:

Here’s a surprising and interesting approach to dictionaries: have users build their own. SoSlang allows anyone to add a slang term and its definition. Beware, though, this site is not for everyone. Entries can be salty. R-rated, even. You’ve been warned.

I would compare this approach:

speakers -> usages -> dictionary

to a formal dictionary:

speakers -> usages -> editors -> formal dictionary

That is to say a formal dictionary reflects the editor’s sense of the language and not the raw input of the speakers of a language.

It would be a very interesting text mining tasks to eliminate duplicate usages of terms so that the changing uses of a term can be tracked.

After DuPont bans Teflon from WordNet, the world is their non-sticky oyster

Tuesday, February 21st, 2012

After DuPont bans Teflon from WordNet, the world is their non-sticky oyster

Toma Tasovac reports on DuPont banning the term Teflon from WordNet, but not before observing:

I lived in the United States for more than a decade — long enough to know that litigation is not just a judiciary battle about enforcing legal rights: it’s a way of life. I have also over the years watched with amusement how dictionaries get used in American courtrooms, from Martha Nussbaum’s unfortunate reading of the Liddell-Scott on τόλμημα in Romer vs. Evans in 1993 to a recent case in which Chief Justice John G. Roberts Jr. parsed the meaning of a federal law by consulting no less than five dictionaries: one of the words he focused on was the preposition of. While Martha Nussbaum’s court drama about moral philosophy, scholarly integrity, homosexual desire and the nature of shame would make a great movie (staring, inevitably, as pretty much every other movie out there – Meryl Streep), Chief Justice Roberts’ dreadful, ho-hum lexicographic exercise would barely pass the Judge Judy test of how-low-can-we-go: he discovered that the meaning of of had something to do with belonging or possession. Pass the remote, please!

Who rules/owns our vocabularies?

There are serious issues at stake but take a few minutes to enjoy this post.

ODLIS: Online Dictionary for Library and Information Science

Friday, February 10th, 2012

ODLIS: Online Dictionary for Library and Information Science by Joan M. Reitz.

ODLIS is known to all librarians and graduate school library students but perhaps not to those of us who abuse library terminology in CS and related pursuits. Can’t promise it will make our usage any better but certainly won’t make it any worse. ;-)

This would make a very interesting “term for a day” type resource.

Certainly one you should bookmark and browse at your leisure.

History of the Dictionary

ODLIS began at the Haas Library in 1994 as a four-page printed handout titled Library Lingo, intended for undergraduates not fluent in English and for English-speaking students unfamiliar with basic library terminology. In 1996, the text was expanded and converted to HTML format for installation on the WCSU Libraries Homepage under the title Hypertext Library Lingo: A Glossary of Library Terminology. In 1997, many more hypertext links were added and the format improved in response to suggestions from users. During the summer of 1999, several hundred terms and definitions were added, and a generic version was created that omitted all reference to specific conditions and practices at the Haas Library.

In the fall of 1999, the glossary was expanded to 1,800 terms, renamed to reflect its extended scope, and copyrighted. In February, 2000, ODLIS was indexed in Yahoo! under “Reference – Dictionaries – Subject.” It was also indexed in the WorldCat database, available via OCLC FirstSearch. During the year 2000, the dictionary was expanded to 2,600 terms and by 2002 an additional 800 terms had been added. From 2002 to 2004, the dictionary was expanded to 4,200 terms and cross-references were added, in preparation for the print edition. Since 2004, an additional 600 terms and definitions have been added.

Purpose of the Dictionary

ODLIS is designed as a hypertext reference resource for library and information science professionals, university students and faculty, and users of all types of libraries. The primary criterion for including a term is whether a librarian or other information professional might reasonably be expected to know its meaning in the context of his or her work. A newly coined term is added when, in the author’s judgment, it is likely to become a permanent addition to the lexicon of library and information science. The dictionary reflects North American practice; however, because ODLIS was first developed as an online resource available worldwide, with an e-mail contact address for feedback, users from many countries have contributed to its growth, often suggesting additional terms and commenting on existing definitions. Expansion of the dictionary is an ongoing process.

Broad in scope, ODLIS includes not only the terminology of the various specializations within library science and information studies but also the vocabulary of publishing, printing, binding, the book trade, graphic arts, book history, literature, bibliography, telecommunications, and computer science when, in the author’s judgment, a definition might prove useful to librarians and information specialists in their work. Entries are descriptive, with examples provided when appropriate. The definitions of terms used in the Anglo-American Cataloging Rules follow AACR2 closely and are therefore intended to be prescriptive. The dictionary includes some slang terms and idioms and a few obsolete terms, often as See references to the term in current use. When the meaning of a term varies according to the field in which it is used, priority is given to the definition that applies within the field with which it is most closely associated. Definitions unrelated to library and information science are generally omitted. As a rule, definition is given under an acronym only when it is generally used in preference to the full term. Alphabetization is letter-by-letter. The authority for spelling and hyphenation is Webster’s New World Dictionary of the American Language (College Edition). URLs, current as of date of publication, are updated annually.

Be careful with dictionary-based text analysis

Wednesday, October 12th, 2011

Be careful with dictionary-based text analysis

Brendan O’Connor writes:

OK, everyone loves to run dictionary methods for sentiment and other text analysis — counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus. In particular, this is often done for sentiment analysis: count positive and negative words (according to a sentiment polarity lexicon, which was derived from human raters or previous researchers’ intuitions), and then proclaim the output yields sentiment levels of the documents. More and more papers come out every day that do this. I’ve done this myself. It’s interesting and fun, but it’s easy to get a bunch of meaningless numbers if you don’t carefully validate what’s going on. There are certainly good studies in this area that do further validation and analysis, but it’s hard to trust a study that just presents a graph with a few overly strong speculative claims as to its meaning. This happens more than it ought to.

How does “measurement” of sentiment in a document differ from “measurement” of the semantics of terms in that document?

Have we traded “access” to large numbers of documents (think about the usual Internet search engine) for validated collections? By validated collections I mean the discipline-based indexes where the user did not have to weed out completely irrelevant results.

Web Pages Clustering: A New Approach

Wednesday, September 7th, 2011

Web Pages Clustering: A New Approach by Jeevan H E, Prashanth P P, Punith Kumar S N, and Vinay Hegde.

Abstract:

The rapid growth of web has resulted in vast volume of information. Information availability at a rapid speed to the user is vital. English language (or any for that matter) has lot of ambiguity in the usage of words. So there is no guarantee that a keyword based search engine will provide the required results. This paper introduces the use of dictionary (standardised) to obtain the context with which a keyword is used and in turn cluster the results based on this context. These ideas can be merged with a metasearch engine to enhance the search efficiency.

The first part of this paper is concerned with the use of a dictionary to create separate queries for each “sense” of a term. I am not sure that is an innovation.

I don’t have the citation at hand but seem to recall that term rewriting for queries has used something very much like a dictionary. Perhaps not a “dictionary” in the conventional sense but I would not even bet on that. Anyone have a better memory than mine and/or working in query rewriting?