Archive for the ‘N-Grams’ Category

Lgram

Sunday, September 7th, 2014

Lgram: A memory-efficient ngram builder

From the webpage:

Lgram is a cross–platform tool for calculating ngrams in a memory–efficient manner. The current crop of n-gram tools have non–constant memory usage such that ngrams cannot be computed for large input texts. Given the prevalence of large texts in computational and corpus linguistics, this deficit is problematic. Lgram has constant memory usage so it can compute ngrams on arbitrarily sized input texts. Lgram achieves constant memory usages by periodically syncing the computed ngrams to an sqlite database stored on disk.

Lgram was written by Edward J. L. Bell at Lancaster University and funded by UCREL. The project was initiated by Dr Paul Rayson.

Not recent (2011) but new to me. Enjoy!

I first saw this in a tweet by Christopher Phipps.

Neo4j Aces “State Competition “Jugend Forscht Hessen” and best Project award”

Friday, March 16th, 2012

Paul Wagner and Till Speicher won State Competition “Jugend Forscht Hessen” and best Project award using neo4j” writes René Pickhardt.

From the post:

6 months of hard coding and supervising by me are over and end with a huge success! After analyzing 80 GB of Google ngrams data Paul and Till put them to a neo4j graph data base in order to make predictions for fast scentence completion. Today was the award ceremony and the two students from Darmstadt and Saarbrücken (respectivly) won the first place. Additionally the received the “beste schöpferische Arbeit” award. Which is the award for the best project in the entire competition (over all disciplines).

With their technology and the almost finnished android app typing will be revolutionized! While typing a scentence they are able to predict the next word with a recall of 67% creating a huge additional vallue for today’s smartphones.

So stay tuned of the upcomming news and the federal competition on May in Erfurt.

Not that you could tell that René is proud of the team! 😉

Curious: Can you use a Neo4j database to securely exchange messages? Display of messages triggered by series of tokens? Smart phone operator only knows their sequence and nothing more.

How to Store Google n-gram in Neo4j

Saturday, February 18th, 2012

How to Store Google n-gram in Neo4j by r.schiessler.

From the post:

In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is. http://en.wikipedia.org/wiki/N-gram The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.

This data set is very well described on the official google n gram page which I also include as an iframe directly here on my blog.

Schiessler describes the project as follows:

The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.

Another suggestion project!

Worth your time both for its substance and use of Neo4j.

Mongoid_fulltext

Sunday, December 4th, 2011

Mongoid_fulltext: full-text n-gram search for your MongoDB models by Daniel Doubrovkine.

From the post:

We’ve been using mongoid_search for sometime now for auto-complete. It’s a fine component that splits sentences and uses MongoDB to index them. Unfortunately it doesn’t rank them, so results come in order of appearance. In contrast, mongoid-fulltext uses n-gram matching (with n=3 right now), so we index all of the substrings of length 3 from text that we want to search on. If you search for “damian hurst”, mongoid_fulltext does lookups for “dam”, “ami”, “mia”, “ian”, “an “, “n h”, ” hu”, “hur”, “urs”, and “rst” and combines the results to get a most likely match. This also means users can make simple spelling mistakes and still find something relevant. In addition, you can index multiple collections in a single index, producing best matching results within several models. Finally, mongoid-fulltext leverages MongoDB native indexing and map-reduce.

And see: https://github.com/aaw/mongoid_fulltext.

Might want to think about this for your next text input by user application.

Download Google n gram data set and neo4j source code for storing it

Wednesday, November 30th, 2011

Download Google n gram data set and neo4j source code for storing it by René Pickhardt.

From the post:

In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is. http://en.wikipedia.org/wiki/N-gram

The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.

I forwarded this data set to two high school students which I was teaching last summer at the dsa. Now they are working on a project for a German student competition. They are using the n-grams and neo4j to predict sentences and help people to improve typing.

The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.

Now, imagine having users mark subjects in texts (highlighting?) and using a sufficient number of such texts to automate the recognition of subjects and their relationships to document authors and other subjects. Does that sound like an easy way to author an ongoing topic map based on the output of an office? Or project?

Feel free to mention at least one prior project that used a very similar technique with texts. If no one does, I will post its name and links to the papers tomorrow.