Archive for the ‘Collocation’ Category

1150 Free Online Courses from Top Universities (update) [Collating Content]

Sunday, September 27th, 2015

1150 Free Online Courses from Top Universities (update).

From the webpage:

Get 1150 free online courses from the world’s leading universities — Stanford, Yale, MIT, Harvard, Berkeley, Oxford and more. You can download these audio & video courses (often from iTunes, YouTube, or university web sites) straight to your computer or mp3 player. Over 30,000 hours of free audio & video lectures, await you now.

An ever improving resource!

As of last January (2015), it listed 1100 courses.

Another fifty courses have been added and I discovered a course in Hittite!

The same problem with collating content across resources that I mentioned for data science books, obtains here as you take courses in the same discipline or read primary/secondary literature.

What if I find references that are helpful in the Hittite course in the image PDFs of the Chicago Assyrian Dictionary? How do I combine that with the information from the Hittite course so if you take Hittite, you don’t have to duplicate my search?

That’s the ticket isn’t it? Not having different users performing the same task over and over again? One user finds the answer and for all other users, it is simply “there.”

Quite a different view of the world of information than the repetitive, non-productive, ad-laden and often irrelevant results from the typical search engine.

Using your Lucene index as input to your Mahout job – Part I

Tuesday, March 6th, 2012

Using your Lucene index as input to your Mahout job – Part I

From the post:

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.

Access to original text can help with improving clustering results. See the blog post for details.

collocations in wikipedia – parts 2 and 3

Friday, November 18th, 2011

Matthew Kelcey continues his series on collocations, although the title to part 3 doesn’t say as much.

collocations in wikipedia, part 2

In part 2 Matt discusses alternatives to “magic” frequency cut-offs for collocation analysis.

I rather like the idea of looking for alternatives to “it’s just that way” methodologies. Accepting traditional cut-offs, etc., maybe the right thing to do in some cases, but only with experience and understanding the alternatives.

finding phrases with mutual information [collocations, part 3]

In part 3 Matt discusses taking collocations beyond just two terms that occur together and techniques for that analysis.

Matt is also posting todo thoughts for further investigation.

If you have the time and interest, drop by Matt’s blog to leave suggestions or comments.

(See collocations in wikipedia, part 1 for our coverage of the first post.)

collocations in wikipedia, part 1

Tuesday, October 25th, 2011

collocations in wikipedia, part 1

From the post:

collocations are combinations of terms that occur together more frequently than you’d expect by chance.

they can include

  • proper noun phrases like ‘Darth Vader’
  • stock/colloquial phrases like ‘flora and fauna’ or ‘old as the hills’
  • common adjectives/noun pairs (notice how ‘strong coffee’ sounds ok but ‘powerful coffee’ doesn’t?)

let’s go through a couple of techniques for finding collocations taken from the exceptional nlp text “foundations of statistical natural language processing” by manning and schutze.

Looks like the start of a very interesting series on collocation (statistical) in Wikipedia. Which is a serious data set for training purposes.

BTW, don’t miss the homepage. Lots of interesting resources.

Update: 18 November 2011

See also:

collocations in wikipedia, part 2

finding phrases with mutual information [collocations, part 3]

I am making a separate blog post on parts 2 and 3 but just in case you come here first…. Enjoy!

Building Concept Structures/Concept Trails

Thursday, December 2nd, 2010

Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems Authors: Christian Biemann, Karsten Böhm, Gerhard Heyer and Ronny Melz


The automated creation and the visualization of concept structures become more important as the number of relevant information continues to grow dramatically. Especially information and knowledge intensive tasks are relying heavily on accessing the relevant information or knowledge at the right time. Moreover the capturing of relevant facts and good ideas should be focused on as early as possible in the knowledge creation process.

In this paper we introduce a technology to support knowledge structuring processes already at the time of their creation by building up concept structures in real time. Our focus was set on the design of a minimal invasive system, which ideally requires no human interaction and thus gives the maximum freedom to the participants of a knowledge creation or exchange processes. The initial prototype concentrates on the capturing of spoken language to support meetings of human experts, but can be easily adapted for the use in Internet communities that have to rely on knowledge exchange using electronic communication channel.

I don’t share the author’s confidence that corpus linguistics are going to provide the level of accuracy expected.

But, I find the notion of a dynamic semantic map that grows, changes and evolves during a discussion to be intriguing.

This article was published in 2006 so I will follow up to see what later results have been reported.

Measuring the meaning of words in contexts:…

Sunday, November 21st, 2010

Measuring the meaning of words in contexts: An automated analysis of controversies about ‘Monarch butterflies,’ ‘Frankenfoods,’ and ‘stem cells’ Author(s): Loet Leydesdorff and Iina Hellsten Keywords: co-words, metaphors, diaphors, context, meaning


Co-words have been considered as carriers of meaning across different domains in studies of science, technology, and society. Words and co-words, however, obtain meaning in sentences, and sentences obtain meaning in their contexts of use. At the science/society interface, words can be expected to have different meanings: the codes of communication that provide meaning to words differ on the varying sides of the interface. Furthermore, meanings and interfaces may change over time. Given this structuring of meaning across interfaces and over time, we distinguish between metaphors and diaphors as reflexive mechanisms that facilitate the translation between contexts. Our empirical focus is on three recent scientific controversies: Monarch butterflies, Frankenfoods, and stem-cell therapies. This study explores new avenues that relate the study of co-word analysis in context with the sociological quest for the analysis and processing of meaning.

Excellent article on shifts of word meaning over time. Reports sufficient detail on methodology that interested readers will be able to duplicate or extend the research reported here.


  1. Annotated bibliography of research citing this paper.
  2. Design a study of the shifting meaning of a 2 or 3 terms. What texts would you select? (3-5 pages, with citations)
  3. Perform a study of shifting meaning of terms in library science. (Project)

The LibraryThing

Saturday, June 12th, 2010

The LibraryThing is the home of OverCat, a collection of 32 million library records.

It is a nifty illustration of re-using identifiers, not re-inventing them.

I put in an ISBN, for example, and the system searches for that work. It does not ask me to create a “cool” URI for it.

It also demonstrates some of the characteristics of a topic map in that it does return multiple matches for all the libraries that hold a work, but only one. (You can still view the other records as well.)

I am not sure I have the time to enter, even by ISBN, all the books that line the walls of my office but maybe I will start with the new ones as they come in and the older ones as I use them. The result is a catalog of my books, but more importantly, additional information about those works entered by others.

Maybe that could be a marketing pitch for topic maps? That topic maps enable users to coordinate their information with others, without prior agreement. Sort of like asking for a ride to town and at the same time, someone in a particular area says they are going to town but need to share gas expenses. (Treating a circumference around a set of geographic coordinates as a subject. Users neither know nor care about the details, just expressing their needs.)

Is There a Haptic Topic Map in Your Future?

Friday, March 12th, 2010

I ran across a short article today on improving access to maps using non-visual channels, gestures, tactile/haptic interaction and sound, The HaptiMap project aims to make maps accessible through touch, hearing and vision.

The HaptiMap project is sponsored by the EU Commission. There is a collection of recent papers.

One obvious relevance to topic maps is that HaptiMap is collocating information about the same locations from/to different non-visual channels. Hmmm, I have heard that before, but where? Oh, yeah, topic maps are about collocating information about the same subject. That would include information in different channels about the same subject.

A less obvious relevance is for determining when there are multiple representatives of the same subject. Comparing strings, which may or may not be meaningful to a user, is only one test of subject identity. Ever tried to identify the subject spoiled milk by sniffing it? Or a favorite artist or style of music by listening to it? Or a particular style of weave or fabric by touch? All of those sound like identification of subjects to me.

Imagine a map that presents representatives of subjects for merging based on non-symbolic clues experienced by the user. Rather than a music library organized by artist/song title, etc., a continuum that is navigated and merged only on the basis sound. Or representations of subjects in a haptic map found in a VR environment. Or an augmented environment that uses a variety of channels to communicate information about a single subject.

You will have to attend TMRA 2010 (sponsored in part by to see if any haptic topic maps show up this year.

An Early Example of Collocation

Friday, March 5th, 2010

An early example of collocation is the Rosetta Stone. It records a decree in 196 BCE by Ptolemy V granting a tax amnesty to temple priests.

The stele has the degree written in Egyptian (two scripts, hieroglyphic and Demotic) and Classical Greek.

The collocation of different translations of the same decree on the Rosetta Stone raises interesting questions about identification of subjects as well as how to process such identifications.

This decree of Ptolemy V could be identified as the decree on the Rosetta Stone. Or, it could be identified by reciting the entire text. There are multiple ways to identify any subject. That some means of identification are more common than others, should not blind us to alternative methods for subject identification. Or to the differences that a means of identification may make for processing.

Since each text/identifier was independent of the others, each reader was free to identify the subject without reference to the other identifiers. (Shades of parallel processing?)

Another processing issue to notice is that by reciting the text of the decree on the Rosetta Stone, it was not necessary for readers to “dereference” an identifier in order to understand what subject was being identified.

Topic maps are a recent development in a long history of honoring semantic diversity.