Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 19, 2015

Civil War Navies Bookworm

Filed under: History,Humanities,Indexing,Ngram Viewer,Searching,Text Analytics — Patrick Durusau @ 6:39 pm

Civil War Navies Bookworm by Abby Mullen.

From the post:

If you read my last post, you know that this semester I engaged in building a Bookworm using a government document collection. My professor challenged me to try my system for parsing the documents on a different, larger collection of government documents. The collection I chose to work with is the Official Records of the Union and Confederate Navies. My Barbary Bookworm took me all semester to build; this Civil War navies Bookworm took me less than a day. I learned things from making the first one!

This collection is significantly larger than the Barbary Wars collection—26 volumes, as opposed to 6. It encompasses roughly the same time span, but 13 times as many words. Though it is still technically feasible to read through all 26 volumes, this collection is perhaps a better candidate for distant reading than my first corpus.

The document collection is broken into geographical sections, the Atlantic Squadron, the West Gulf Blockading Squadron, and so on. Using the Bookworm allows us to look at the words in these documents sequentially by date instead of having to go back and forth between different volumes to get a sense of what was going on in the whole navy at any given time.

Before you ask:

The earlier post: Text Analysis on the Documents of the Barbary Wars

More details on Bookworm.

As with all ngram viewers, exercise caution in assuming a text string has uniform semantics across historical, ethnic, or cultural fault lines.

October 18, 2013

Enhancing Linguistic Search with…

Filed under: Linguistics,Ngram Viewer — Patrick Durusau @ 3:27 pm

Enhancing Linguistic Search with the Google Books Ngram Viewer by Slav Petrov and Dipanjan Das.

From the post:


With our interns Jason Mann, Lu Yang, and David Zhang, we’ve added three new features. The first is wildcards: by putting an asterisk as a placeholder in your query, you can retrieve the ten most popular replacement. For instance, what noun most often follows “Queen” in English fiction? The answer is “Elizabeth”:

Another feature we’ve added is the ability to search for inflections: different grammatical forms of the same word. (Inflections of the verb “eat” include “ate”, “eating”, “eats”, and “eaten”.) Here, we can see that the phrase “changing roles” has recently surged in popularity in English fiction, besting “change roles”, which earlier dethroned “changed roles”:

Finally, we’ve implemented the most common feature request from our users: the ability to search for multiple capitalization styles simultaneously. Until now, searching for common capitalizations of “Mother Earth” required using a plus sign to combine ngrams (e.g., “Mother Earth + mother Earth + mother earth”), but now the case-insensitive checkbox makes it easier:

The ngram data sets are available for download.

As of the date of this post, the data sets go up to 5-grams in multiple languages.

Be mindful of semantic drift, the changing of the meaning of words, over centuries or decades. Even across social, economic strata and work domains at the same time.

October 19, 2012

Ngram Viewer 2.0 [String Usage != Semantic Usage]

Filed under: GoogleBooks,Natural Language Processing,Ngram Viewer — Patrick Durusau @ 3:32 pm

Ngram Viewer 2.0 by Jon Orwant.

From the post:

Since launching the Google Books Ngram Viewer, we’ve been overjoyed by the public reception. Co-creator Will Brockman and I hoped that the ability to track the usage of phrases across time would be of interest to professional linguists, historians, and bibliophiles. What we didn’t expect was its popularity among casual users. Since the launch in 2010, the Ngram Viewer has been used about 50 times every minute to explore how phrases have been used in books spanning the centuries. That’s over 45 million graphs created, each one a glimpse into the history of the written word. For instance, comparing flapper, hippie, and yuppie, you can see when each word peaked:

(graphic omitted)

Meanwhile, Google Books reached a milestone, having scanned 20 million books. That’s approximately one-seventh of all the books published since Gutenberg invented the printing press. We’ve updated the Ngram Viewer datasets to include a lot of those new books we’ve scanned, as well as improvements our engineers made in OCR and in hammering out inconsistencies between library and publisher metadata. (We’ve kept the old dataset around for scientists pursuing empirical, replicable language experiments such as the ones Jean-Baptiste Michel and Erez Lieberman Aiden conducted for our Science paper.)

Tracking the usage of phrases through time is no mean feat, but tracking their semantics would be far more useful.

For example, “freedom of speech” did not have the same “semantic” in the early history of the United States that it does today. Otherwise, how would you explain criminal statutes against blasphemy and their enforcement after the ratification of the US Constitution? (I have verified this but Wikipedia, Blasphemy Law in the United States, reports a person being jailed for blasphemy in the 1830’s.)

Or the guarantee of “freedom of speech,” in Article 125 of the 1936 Constitution of the USSR.

Those three usages, current United States, early United States, USSR 1936 (English translation), don’t have the same semantics to me.

You?

Powered by WordPress