Archive for the ‘Dictionary’ Category

Statistics vs. Machine Learning Dictionary (flat text vs. topic map)

Saturday, December 16th, 2017

Data science terminology (UBC Master of Data Science)

From the webpage:

About this document

This document is intended to help students navigate the large amount of jargon, terminology, and acronyms encountered in the MDS program and beyond. There is also an accompanying blog post.

Stat-ML dictionary

This section covers terms that have different meanings in different contexts, specifically statistics vs. machine learning (ML).
… (emphasis in original)

Gasp! You don’t mean that the same words have different meanings in machine learning and statistics!

Even more shocking, some words/acronyms, have the same meaning!

Never fear, a human reader can use this document to distinguish the usages.

Automated processors, not so much.

If these terms were treated as occurrences of topics, where the topics had the respective scopes of statistics and machine-learning, then for any scoped document, an enhanced view with the correct definition for the unsteady reader could be supplied.

Static markup of legacy documents is not required as annotations can be added as a document is streamed to a reader. Opening the potential, of course, for different annotations depending upon the skill and interest of the reader.

If for each term/subject, more properties than the scope of statistics or machine-learning or both were supplied, users of the topic map could search on those properties to match terms not included here. Such as which type of bias (in statistics) does bias mean in your paper? A casually written Wikipedia article reports twelve and with refinement, the number could be higher.

Flat text is far easier to write than a topic map but tasks every reader with re-discovering the distinctions already known to the author of the document.

Imagine your office, department, agency’s vocabulary and its definitions captured and then used to annotate internal or external documentation for your staff.

Instead of very new staffer asking (hopefully), what do we mean by (your common term), the definition appears with a mouse-over in a document.

Are you capturing the soft knowledge of your staff?

Building a Telecom Dictionary scraping web using rvest in R [Tunable Transparency]

Tuesday, December 5th, 2017

Building a Telecom Dictionary scraping web using rvest in R by Abdul Majed Raja.

From the post:

One of the biggest problems in Business to carry out any analysis is the availability of Data. That is where in many cases, Web Scraping comes very handy in creating that data that’s required. Consider the following case: To perform text analysis on Textual Data collected in a Telecom Company as part of Customer Feedback or Reviews, primarily requires a dictionary of Telecom Keywords. But such a dictionary is hard to find out-of-box. Hence as an Analyst, the most obvious thing to do when such dictionary doesn’t exist is to build one. Hence this article aims to help beginners get started with web scraping with rvest in R and at the same time, building a Telecom Dictionary by the end of this exercise.

Great for scraping an existing glossary but as always, it isn’t possible to extract information that isn’t captured by the original glossary.

Things like the scope of applicability for the terms, language, author, organization, even characteristics of the subjects the terms represent.

Of course, if your department invested in collecting that information for every subject in the glossary, there is no external requirement that on export all that information be included.

That is your “data silo” can have tunable transparency, that is you enable others to use your data with as much or as least semantic friction as the situation merits.

For some data borrowers, they get opaque spreadsheet field names, column1, column2, etc.

Other data borrowers, perhaps those willing to help defray the cost of semantic annotation, well, they get a more transparent view of the data.

One possible method of making semantic annotation and its maintenance a revenue center as opposed to a cost one.

A Dictionary of Victorian Slang (1909)

Tuesday, June 20th, 2017

Passing English of the Victorian era, a dictionary of heterodox English, slang and phrase (1909) by J. Reeding Ware.

Quoted from the Preface:

HERE is a numerically weak collection of instances of ‘Passing English’. It may be hoped that there are errors on every page, and also that no entry is ‘quite too dull’. Thousands of words and phrases in existence in 1870 have drifted away, or changed their forms, or been absorbed, while as many have been added or are being added. ‘Passing English’ ripples from countless sources, forming a river of new language which has its tide and its ebb, while its current brings down new ideas and carries away those that have dribbled out of fashion. Not only is ‘Passing English’ general ; it is local ; often very seasonably local. Careless etymologists might hold that there are only four divisions of fugitive language in London west, east, north and south. But the variations are countless. Holborn knows little of Petty Italia behind Hatton Garden, and both these ignore Clerkenwell, which is equally foreign to Islington proper; in the South, Lambeth generally ignores the New Cut, and both look upon Southwark as linguistically out of bounds; while in Central London, Clare Market (disappearing with the nineteenth century) had, if it no longer has, a distinct fashion in words from its great and partially surviving rival through the centuries the world of Seven Dials, which is in St Giles’s St James’s being ractically in the next parish. In the East the confusion of languages is a world of ‘ variants ‘ there must be half-a-dozen of Anglo-Yiddish alone all, however, outgrown from the Hebrew stem. ‘Passing English’ belongs to all the classes, from the peerage class who have always adopted an imperfection in speech or frequency of phrase associated with the court, to the court of the lowest costermonger, who gives the fashion to his immediate entourage.

A healthy reminder that language is no more fixed and unchanging than the people who use it.


Looking up words in the OED with XQuery [Details on OED API Key As Well]

Saturday, January 14th, 2017

Looking up words in the OED with XQuery by Clifford Anderson.

Clifford has posted a gist of work from the @VandyLibraries XQuery group, looking up words in the Oxford English Dictionary (OED) with XQuery.

To make full use of Clifford’s post, you will need for the Oxford Dictionaries API.

If you go straight to the regular Oxford English Dictionary (I’m omitting the URL so you don’t make the same mistake), there is nary a mention of the Oxford Dictionaries API.

The free plan allows 3K queries a month.

Not enough to shut out the outside world for the next four/eight years but enough to decide if it’s where you want to hide.

Application for the free api key was simple enough.

Save that the dumb password checker insisted on one or more special characters, plus one or more digits, plus upper and lowercase. When you get beyond 12 characters the insistence on a special character is just a little lame.

Email response with the key was fast, so I’m in!

What about you?

Green’s Dictionary of Slang [New Commercializing Information Model?]

Friday, October 14th, 2016

Green’s Dictionary of Slang

From the about page:

Green’s Dictionary of Slang is the largest historical dictionary of English slang. Written by Jonathon Green over 17 years from 1993, it reached the printed page in 2010 in a three-volume set containing nearly 100,000 entries supported by over 400,000 citations from c. ad 1000 to the present day. The main focus of the dictionary is the coverage of over 500 years of slang from c. 1500 onwards.

The printed version of the dictionary received the Dartmouth Medal for outstanding works of reference from the American Library Association in 2012; fellow recipients include the Dictionary of American Regional English, the Oxford Dictionary of National Biography, and the New Grove Dictionary of Music and Musicians. It has been hailed by the American New York Times as ‘the pièce de résistance of English slang studies’ and by the British Sunday Times as ‘a stupendous achievement, in range, meticulous scholarship, and not least entertainment value’.

On this website the dictionary is now available in updated online form for the first time, complete with advanced search tools enabling search by definition and history, and an expanded bibliography of slang sources from the early modern period to the present day. Since the print edition, nearly 60,000 quotations have been added, supporting 5,000 new senses in 2,500 new entries and sub-entries, of which around half are new slang terms from the last five years.

Green’s Dictionary of Slang has an interesting commercial model.

You can search for any word, freely, but “more search features” requires a subscription:

By subscribing to Green’s Dictionary of Slang Online, you gain access to advanced search tools (including the ability to search for words by meaning, history, and usage), full historical citations in each entry, and a bibliography of over 9,000 slang sources.

Current rate for individuals is £ 49 (or about $59.96).

In addition to being a fascinating collection of information, is the free/commercial split here of interest?

An alternative to:

The Teaser Model

Contrast the Oxford Music Online:

Grove Music Online is the eighth edition of Grove’s Dictionary of Music and Musicians, and contains articles commissioned specifically for the site as well as articles from New Grove 2001, Grove Opera, and Grove Jazz. The recently published second editions of The Grove Dictionary of American Music and The Grove Dictionary of Musical Instruments are still being put online, and new articles are added to GMO with each site update.

Oh, Oxford Music Online isn’t all pay-per-view.

It offers the following thirteen (13) articles for free viewing:

Sotiria Bellou, Greek singer of rebetiko song, famous for the special quality and register of her voice

Cell [Mobile] Phone Orchestra, ensemble of performers using programmable mobile (cellular) phones

Crete, largest and most populous of the Greek islands

Lyuba Encheva, Bulgarian pianist and teacher

Gaaw, generic term for drums, and specifically the frame drum, of the Tlingit and Haida peoples of Alaska

Johanna Kinkel, German composer, writer, pianist, music teacher, and conductor

Lady’s Glove Controller, modified glove that can control sound, mechanical devices, and lights

Outsider music, a loosely related set of recordings that do not fit well within any pre-existing generic framework

Peter (Joshua) Sculthorpe, Australian composer, seen by the Australian musical public as the most nationally representative.

Slovenia, country in southern Central Europe

Sound art, a term ecompassing a variety of art forms that utlize sound, or comment on auditory cultures

Alice (Bigelow) Tully, American singer and music philanthropist

Wars in Iraq and Afghanistan, soliders’ relationship with music is largely shaped by contemporary audio technology

Hmmm, 160,000 slang terms for free from Green’s Dictionary of Slang versus 13 free articles from Oxford Music Online.

Show of hands for the teaser model of Oxford Music Online?

The Consumer As Product

You are aware that casual web browsing and alleged “free” sites are not just supported by ads, but by the information they collect on you?

Consider this rather boastful touting of information collection capabilities:

To collect online data, we use our native tracking tags as experience has shown that other methods require a great deal of time, effort and cost on both ends and almost never yield satisfactory coverage or results since they depend on data provided by third parties or compiled by humans (!!), without being able to verify the quality of the information. We have a simple universal server-side tag that works with most tag managers. Collecting offline marketing data is a bit trickier. For TV and radio, we will with your offline advertising agency to collect post-log reports on a weekly basis, transmitted to a secure FTP. Typical parameters include flight and cost, date/time stamp, network, program, creative length, time of spot, GRP, etc.

Convertro is also able to collect other type of offline data, such as in-store sales, phone orders or catalog feeds. Our most popular proprietary solution involves placing a view pixel within a confirmation email. This makes it possible for our customers to tie these users to prior online activity without sharing private user information with us. For some customers, we are able to match almost 100% of offline sales. Other customers that have different conversion data can feed them into our system and match it to online activity by partnering with LiveRamp. These matches usually have a success rate between 30%-50%. Phone orders are tracked by utilizing a smart combination of our in-house approach, the inputting of special codes, or by third party vendors such as Mongoose and ResponseTap.v

You don’t have to be on the web, you can be tracked “in-store,” on the phone, etc.

Converto doesn’t mention explicitly “supercookies,” for which Verizon just paid a $1.35 Million fine. From the post:

“Supercookies,” known officially as unique identifier headers [UIDH], are short-term serial numbers used by corporations to track customer data for advertising purposes. According to Jacob Hoffman-Andrews, a technologist with the Electronic Frontier Foundation, these cookies can be read by any web server one visits used to build individual profiles of internet habits. These cookies are hard to detect, and even harder to get rid of.

If any of that sounds objectionable to you, remember that to be valuable, user habits must be tracked.

That is if you find the idea of being a product acceptable.

The Green’s Dictionary of Slang offers an economic model that enables free access to casual users, kids writing book reports, journalists, etc., while at the same time creating a value-add that power users will pay for.

Other examples of value-add models with free access to the core information?

What would that look like for the Podesta emails?

Dictionary of Fantastic Vocabulary [Increasing the Need for Topic Maps]

Monday, April 18th, 2016

Dictionary of Fantastic Vocabulary by Greg Borenstein.

Alexis Lloyd tweeted this link along with:

This is utterly fantastic.

Well, it certainly increases the need for topic maps!

From the bot description on Twitter:

Generating new words with new meanings out of the atoms of English.

Ahem, are you sure about that?

Is a bot is generating meaning?

Or are readers conferring meaning on the new words as they are read?

If, as I contend, readers confer meaning, the utterance of every “new” word, opens up as many new meanings as there are readers of the “new” word.

Example of people conferring different meanings on a term?

Ask a dozen people what is meant by “shot” in:

It’s just a shot away

When Lisa Fischer breaks into her solo in:

(Best played loud.)

Differences in meanings make for funny moments, awkward pauses, blushes, in casual conversation.

What if the stakes are higher?

What if you need to produce (or destroy) all the emails by “bobby1.”

Is it enough to find some of them?

What have you looked for lately? Did you find all of it? Or only some of it?

New words appear everyday.

You are already behind. You will get further behind using search.

Challenges of Electronic Dictionary Publication

Wednesday, February 17th, 2016

Challenges of Electronic Dictionary Publication

From the webpage:

April 8-9th, 2016

Venue: University of Leipzig, GWZ, Beethovenstr. 15; H1.5.16

This April we will be hosting our first Dictionary Journal workshop. At this workshop we will give an introduction to our vision of „Dictionaria“, introduce our data model and current workflow and will discuss (among others) the following topics:

  • Methodology and concept: How are dictionaries of „small“ languages different from those of „big“ languages and what does this mean for our endeavour? (documentary dictionaries vs. standard dictionaries)
  • Reviewing process and guidelines: How to review and evaluate a dictionary database of minor languages?
  • User-friendliness: What are the different audiences and their needs?
  • Submission process and guidelines: reports from us and our first authors on how to submit and what to expect
  • Citation: How to cite dictionaries?

If you are interested in attending this event, please send an e-mail to dictionary.journal[AT]

Workshop program

Our workshop program can now be downloaded here.

See the webpage for a list of confirmed participants, some with submitted abstracts.

Any number of topic map related questions arise in a discussion of dictionaries.

  • How to represent dictionary models?
  • What properties should be used to identify the subjects that represent dictionary models?
  • On what basis, if any, should dictionary models be considered the same or different? And for what purposes?
  • What data should be captured by dictionaries and how should it be identified?
  • etc.

Those are only a few of the questions that could be refined into dozens, if not hundreds of more, when you reach the details of constructing a dictionary.

I won’t be attending but wait with great anticipation the output from this workshop!

ROOT Files

Friday, March 21st, 2014

ROOT Files

From the webpage:

Today, a huge amount of data is stored into files present on our PC and on the Internet. To achieve the maximum compression, binary formats are used, hence they cannot simply be opened with a text editor to fetch their content. Rather, one needs to use a program to decode the binary files. Quite often, the very same program is used both to save and to fetch the data from those files, but it is also possible (and advisable) that other programs are able to do the same. This happens when the binary format is public and well documented, but may happen also with proprietary formats that became a standard de facto. One of the most important problems of the information era is that programs evolve very rapidly, and may also disappear, so that it is not always trivial to correctly decode a binary file. This is often the case for old files written in binary formats that are not publicly documented, and is a really serious risk for the formats implemented in custom applications.

As a solution to these issues ROOT provides a file format that is a machine-independent compressed binary format, including both the data and its description, and provides an open-source automated tool to generate the data description (or “dictionary“) when saving data, and to generate C++ classes corresponding to this description when reading back the data. The dictionary is used to build and load the C++ code to load the binary objects saved in the ROOT file and to store them into instances of the automatically generated C++ classes.

ROOT files can be structured into “directories“, exactly in the same way as your operative system organizes the files into folders. ROOT directories may contain other directories, so that a ROOT file is more similar to a file system than to an ordinary file.

Amit Kapadia mentions ROOT files in his presentation at CERN on citizen science.

I have only just begun to read the documentation but wanted to pass this starting place along to you.

I don’t find the “machine-independent compressed binary format” argument all that convincing but apparently it has in fact worked for quite some time.

Of particular interest will be the data dictionary aspects of ROOT.

Other data and description capturing file formats?

American Regional English dictionary going online (DARE)

Thursday, November 28th, 2013

American Regional English dictionary going online by Scott Bauer.

From the post:

University of Wisconsin students and researchers set out in “word wagons” nearly 50 years ago to record the ways Americans spoke in various parts of the country.

Now, they’re doing it again, only virtually.

This time they won’t be lugging reel-to-reel tape recorders or sleeping in vans specially equipped with beds, stoves and sinks. Instead, work to update the Dictionary of American Regional English is being done in front of computers, reading online survey results.

“Of course, language changes and a lot of people have the notion that American English is becoming homogenized,” said Joan Houston Hall, who has worked on the dictionary since 1975 and served as its editor since 2000.

The only way to determine if that is true, though, is to do more research, she said.

The dictionary, known as DARE, has more than 60,000 entries exposing variances in the words, phrases, pronunciations, and pieces of grammar and syntax used throughout the country. Linguists consider it a national treasure, and it has been used by everyone from a criminal investigator in the 1990s tracking down the Unabomber to Hollywood dialect coaches trying to be as authentic as possible.

A great resource if you are creating topic maps for American literature during the time period in question.

Be aware that field work stopped in 1970 and any supplements will be by online survey:

Even though no new research has been done for the dictionary since 1970, Hall said she hopes it can now be updated more frequently now that it is going online. The key will be gathering new data tracking how language has changed, or stayed the same, since the first round of field work ended 43 years ago.

But why not break out the 21st century version of the “word wagon” and head out in the field again?

“Because it would be way too expensive and time-consuming,” Hall said, laughing.

So, instead, Hall is loading up the virtual “word wagon” also known as the online survey.

For language usage, there is a forty-three (43) year gap in coverage. Use caution as the vocabulary you are researching moves away from 1970.

The continuation of the project by online surveys will only capture evidence from people who complete online surveys.

Keep that limitation in mind when using DARE after it resumes “online” field work.

Personally, I would prefer more complete field work over the noxious surveillance adventures by non-democratic elements of the U.S. government.

BTW, DARE Digital, from Harvard Press is reported to set you back $150/year.

The Historical Thesaurus of English

Thursday, November 14th, 2013

The Historical Thesaurus of English

From the webpage:

The Historical Thesaurus of English project was initiated by the late Professor Michael Samuels in 1965 and completed in 2008. It contains almost 800,000 word meanings from Old English onwards, arranged in detailed hierarchies within broad conceptual categories such as Thought or Music. It is based on the second edition of the Oxford English Dictionary and its Supplements, with additional materials from A Thesaurus of Old English, and was published in print as the Historical Thesaurus of the OED by Oxford University Press on 22 October 2009.

This electronic version enables users to pinpoint the range of meanings of a word throughout its history, their synonyms, and their relationship to words of more general or more specific meaning. In addition to providing hitherto unavailable information for linguistic and textual scholars, the Historical Thesaurus online is a rich resource for students of social and cultural history, showing how concepts developed through the words that refer to them. Links to Oxford English Dictionary headwords are provided for subscribers to the online OED, which also links the two projects on its own site.

Take particular note of:

This electronic version enables users to pinpoint the range of meanings of a word throughout its history, their synonyms, and their relationship to words of more general or more specific meaning.

Ooooh, that means that words don’t have fixed meanings. Or that everyone reads them the same way.

Want to improve your enterprise search results? A maintained domain/enterprise specific thesaurus would be a step in that direction.

Not to mention a thesaurus could reduce the 42% of people who use the wrong information to make decisions to a lesser number. (Findability As Value Proposition)

Unless you are happy with the 60/40 Rule, where 40% of your executives are making decisions based on incorrect information.

I wouldn’t be. [Humans Only]

Monday, October 21st, 2013

From the About page:

Life is full of choices to make, so are the differences. Differentiation is the identity of a person or any item.

Throughout our life we have to make number of choices. To make the right choice we need to know what makes one different from the other.

We know that making the right choice is the hardest task we face in our life and we will never be satisfied with what we chose, we tend to think the other one would have been better. We spend a lot of time on making decision between A and B.

And the information that guide us to make the right choice should be unbiased, easily accessible, freely available, no hidden agendas and have to be simple and self explanatory, while adequately informative. Information is everything in decision making. That’s where comes in. We make your life easy by guiding you to distinguish the differences between anything and everything, so that you can make the right choices.

Whatever the differences you want to know, be it about two people, two places, two items, two concepts, two technologiesor whatever it is, we have the answer. We have not confined ourselves in to limits. We have a very wide collection of information, that are diverse, unbiased and freely available. In our analysis we try to cover all the areas such as what is the difference, why the difference and how the difference affect.

What we do at, we team up with selected academics, subject matter experts and script writers across the world to give you the best possible information in differentiating any two items.

Easy Search: We have added search engine for viewers to go direct to the topic they are searching for, without browsing page by page.

Sam Hunting forwarded this to my attention.

I listed it under dictionary and disambiguation but I am not sure either of those is correct.

Just a sampling:

And my current favorite:

Difference Between Lucid Dreaming and Astral Projection

Never has occurred to me to confuse those two. 😉

There are over five hundred and twenty (520) pages and assuming an average of sixteen (16) entries per page, there are over eight thousand (8,000) entries today.

Unstructured prose is used to distinguish one subject from another, rather than formal properties.

Being human really helps with the distinctions given in the articles.

Google expands define but drops dictionary

Wednesday, September 11th, 2013

Google expands define but drops dictionary by Karen Blakeman.

From the post:

Google has added extra information to its web definitions. When using the ‘define’ command, an expandable box now appears containing additional synonyms, how the word is used in a sentence, the origins of the word, the use of the word over time and translations. At the moment it is only available in and you no longer need the colon immediately after define. So, for definitions of dialectic simply type in define dialectic.

Google Define

The box gives definitions and synonyms of the word and the ‘More’ link gives you an example of its use in a sentence.

Karen lays out how you can use “define” to your best advantage.

What has my curiosity up is the thought of using a keyword like “define” in a topic map interface.

Rather than giving a user all the information about a subject, to create an on the fly thumbnail of a subject. Which a user can then follow or not.

“tweet” enters the Oxford English Dictionary

Monday, June 17th, 2013

A heads up for the June 2013 OED release

From the post:

The shorter a word is, generally speaking, the more complex it is lexicographically. Short words are likely to be of Germanic origin, and so to derive from the earliest bedrock of English words; they have probably survived long enough in the language to spawn many new sub-senses; they are almost certain to have generated many fixed compounds and phrases often taking the word into undreamt-of semantic areas; and last but not least they have typically formed the basis of secondary derivative words which in turn develop a life of their own.

All of these conditions apply to the three central words in the current batch of revised entries: hand, head, and heart. Each one of these dates in English from the earliest times and forms part of a bridge back to the Germanic inheritance of English. The revised and updated range contains 2,875 defined items, supported by 22,116 illustrative quotations.


The noun and verb tweet (in the social-networking sense) has just been added to the OED. This breaks at least one OED rule, namely that a new word needs to be current for ten years before consideration for inclusion. But it seems to be catching on.

Dictionaries, particularly ones like the OED, should be all the evidence needed to prove semantic diversity is everywhere.

But I don’t think anyone really contests that point.

Disagreement arises when others refuse to abandon their incorrect understanding of terms and to adhere to the meanings intended by a speaker.

A speaker understands themselves perfectly and so expects their audience to put for the effort to do the same.

No surprise that we have so many silos, since we have personal, family, group and enterprise silos.

What is surprising is that we communicate as well as we do, despite the many layers of silos.

Panel on Digital Dictionaries (MLA/LSA/ADS)

Wednesday, September 26th, 2012

Panel on Digital Dictionaries (MLA/LSA/ADS) by Ben Zimmer.

From the post:

Eric Baković has noted the happy confluence of the annual meetings of the Linguistic Society of America and the Modern Language Association, both scheduled for January 3-6, 2013 at sites within reasonable walking distance of each other in Boston. (The LSA will be at the Boston Marriott Copley Place, and the MLA at the Hynes Convention Center and the Sheraton Boston.) Eric has plugged the joint organized session on open access for which he will be a panelist, so allow me to do the same for another panel with MLA/LSA crossover appeal. The MLA’s Discussion Group on Lexicography has held a special panel for several years now, but many lexicographers and fellow travelers in linguistics have been unable to attend because of the conflict with the LSA and the concurrent meeting of the American Dialect Society. This time around, with the selected topic of “Digital Dictionaries,” the whole MLA/LSA/ADS crowd can join in.

Interested to hear your thoughts if you are able to attend!

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Friday, May 18th, 2012

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas by Valentin Spitkovsky and Peter Norvig (Google Research Team).

From the post:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

(examples omitted)

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data. (emphasis added)

Did you catch those numbers?

Now there is a truly remarkable resource.

What will you make out of it?

Are visual dictionaries generalizable?

Sunday, May 13th, 2012

Are visual dictionaries generalizable? by Otavio A. B. Penatti, Eduardo Valle, and Ricardo da S. Torres


Mid-level features based on visual dictionaries are today a cornerstone of systems for classification and retrieval of images. Those state-of-the-art representations depend crucially on the choice of a codebook (visual dictionary), which is usually derived from the dataset. In general-purpose, dynamic image collections (e.g., the Web), one cannot have the entire collection in order to extract a representative dictionary. However, based on the hypothesis that the dictionary reflects only the diversity of low-level appearances and does not capture semantics, we argue that a dictionary based on a small subset of the data, or even on an entirely different dataset, is able to produce a good representation, provided that the chosen images span a diverse enough portion of the low-level feature space. Our experiments confirm that hypothesis, opening the opportunity to greatly alleviate the burden in generating the codebook, and confirming the feasibility of employing visual dictionaries in large-scale dynamic environments.

The authors use the Caltech-101 image set because of its “diversity.” Odd because they cite the Caltech-256 image set, which was created to answer concerns about the lack of diversity in the Caltech-101 image set.

Not sure this paper answers the issues it raises about visual dictionaries.

Wanted to bring it to your attention because representative dictionaries (as opposed to comprehensive ones) may be lurking just beyond the semantic horizon.

SoSlang Crowdsources a Dictionary

Wednesday, March 21st, 2012

SoSlang Crowdsources a Dictionary

Stephen E. Arnold writes:

Here’s a surprising and interesting approach to dictionaries: have users build their own. SoSlang allows anyone to add a slang term and its definition. Beware, though, this site is not for everyone. Entries can be salty. R-rated, even. You’ve been warned.

I would compare this approach:

speakers -> usages -> dictionary

to a formal dictionary:

speakers -> usages -> editors -> formal dictionary

That is to say a formal dictionary reflects the editor’s sense of the language and not the raw input of the speakers of a language.

It would be a very interesting text mining tasks to eliminate duplicate usages of terms so that the changing uses of a term can be tracked.

After DuPont bans Teflon from WordNet, the world is their non-sticky oyster

Tuesday, February 21st, 2012

After DuPont bans Teflon from WordNet, the world is their non-sticky oyster

Toma Tasovac reports on DuPont banning the term Teflon from WordNet, but not before observing:

I lived in the United States for more than a decade — long enough to know that litigation is not just a judiciary battle about enforcing legal rights: it’s a way of life. I have also over the years watched with amusement how dictionaries get used in American courtrooms, from Martha Nussbaum’s unfortunate reading of the Liddell-Scott on τόλμημα in Romer vs. Evans in 1993 to a recent case in which Chief Justice John G. Roberts Jr. parsed the meaning of a federal law by consulting no less than five dictionaries: one of the words he focused on was the preposition of. While Martha Nussbaum’s court drama about moral philosophy, scholarly integrity, homosexual desire and the nature of shame would make a great movie (staring, inevitably, as pretty much every other movie out there – Meryl Streep), Chief Justice Roberts’ dreadful, ho-hum lexicographic exercise would barely pass the Judge Judy test of how-low-can-we-go: he discovered that the meaning of of had something to do with belonging or possession. Pass the remote, please!

Who rules/owns our vocabularies?

There are serious issues at stake but take a few minutes to enjoy this post.

ODLIS: Online Dictionary for Library and Information Science

Friday, February 10th, 2012

ODLIS: Online Dictionary for Library and Information Science by Joan M. Reitz.

ODLIS is known to all librarians and graduate school library students but perhaps not to those of us who abuse library terminology in CS and related pursuits. Can’t promise it will make our usage any better but certainly won’t make it any worse. 😉

This would make a very interesting “term for a day” type resource.

Certainly one you should bookmark and browse at your leisure.

History of the Dictionary

ODLIS began at the Haas Library in 1994 as a four-page printed handout titled Library Lingo, intended for undergraduates not fluent in English and for English-speaking students unfamiliar with basic library terminology. In 1996, the text was expanded and converted to HTML format for installation on the WCSU Libraries Homepage under the title Hypertext Library Lingo: A Glossary of Library Terminology. In 1997, many more hypertext links were added and the format improved in response to suggestions from users. During the summer of 1999, several hundred terms and definitions were added, and a generic version was created that omitted all reference to specific conditions and practices at the Haas Library.

In the fall of 1999, the glossary was expanded to 1,800 terms, renamed to reflect its extended scope, and copyrighted. In February, 2000, ODLIS was indexed in Yahoo! under “Reference – Dictionaries – Subject.” It was also indexed in the WorldCat database, available via OCLC FirstSearch. During the year 2000, the dictionary was expanded to 2,600 terms and by 2002 an additional 800 terms had been added. From 2002 to 2004, the dictionary was expanded to 4,200 terms and cross-references were added, in preparation for the print edition. Since 2004, an additional 600 terms and definitions have been added.

Purpose of the Dictionary

ODLIS is designed as a hypertext reference resource for library and information science professionals, university students and faculty, and users of all types of libraries. The primary criterion for including a term is whether a librarian or other information professional might reasonably be expected to know its meaning in the context of his or her work. A newly coined term is added when, in the author’s judgment, it is likely to become a permanent addition to the lexicon of library and information science. The dictionary reflects North American practice; however, because ODLIS was first developed as an online resource available worldwide, with an e-mail contact address for feedback, users from many countries have contributed to its growth, often suggesting additional terms and commenting on existing definitions. Expansion of the dictionary is an ongoing process.

Broad in scope, ODLIS includes not only the terminology of the various specializations within library science and information studies but also the vocabulary of publishing, printing, binding, the book trade, graphic arts, book history, literature, bibliography, telecommunications, and computer science when, in the author’s judgment, a definition might prove useful to librarians and information specialists in their work. Entries are descriptive, with examples provided when appropriate. The definitions of terms used in the Anglo-American Cataloging Rules follow AACR2 closely and are therefore intended to be prescriptive. The dictionary includes some slang terms and idioms and a few obsolete terms, often as See references to the term in current use. When the meaning of a term varies according to the field in which it is used, priority is given to the definition that applies within the field with which it is most closely associated. Definitions unrelated to library and information science are generally omitted. As a rule, definition is given under an acronym only when it is generally used in preference to the full term. Alphabetization is letter-by-letter. The authority for spelling and hyphenation is Webster’s New World Dictionary of the American Language (College Edition). URLs, current as of date of publication, are updated annually.

Be careful with dictionary-based text analysis

Wednesday, October 12th, 2011

Be careful with dictionary-based text analysis

Brendan O’Connor writes:

OK, everyone loves to run dictionary methods for sentiment and other text analysis — counting words from a predefined lexicon in a big corpus, in order to explore or test hypotheses about the corpus. In particular, this is often done for sentiment analysis: count positive and negative words (according to a sentiment polarity lexicon, which was derived from human raters or previous researchers’ intuitions), and then proclaim the output yields sentiment levels of the documents. More and more papers come out every day that do this. I’ve done this myself. It’s interesting and fun, but it’s easy to get a bunch of meaningless numbers if you don’t carefully validate what’s going on. There are certainly good studies in this area that do further validation and analysis, but it’s hard to trust a study that just presents a graph with a few overly strong speculative claims as to its meaning. This happens more than it ought to.

How does “measurement” of sentiment in a document differ from “measurement” of the semantics of terms in that document?

Have we traded “access” to large numbers of documents (think about the usual Internet search engine) for validated collections? By validated collections I mean the discipline-based indexes where the user did not have to weed out completely irrelevant results.

Web Pages Clustering: A New Approach

Wednesday, September 7th, 2011

Web Pages Clustering: A New Approach by Jeevan H E, Prashanth P P, Punith Kumar S N, and Vinay Hegde.


The rapid growth of web has resulted in vast volume of information. Information availability at a rapid speed to the user is vital. English language (or any for that matter) has lot of ambiguity in the usage of words. So there is no guarantee that a keyword based search engine will provide the required results. This paper introduces the use of dictionary (standardised) to obtain the context with which a keyword is used and in turn cluster the results based on this context. These ideas can be merged with a metasearch engine to enhance the search efficiency.

The first part of this paper is concerned with the use of a dictionary to create separate queries for each “sense” of a term. I am not sure that is an innovation.

I don’t have the citation at hand but seem to recall that term rewriting for queries has used something very much like a dictionary. Perhaps not a “dictionary” in the conventional sense but I would not even bet on that. Anyone have a better memory than mine and/or working in query rewriting?