Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 5, 2014

Q

Filed under: Data,Language — Patrick Durusau @ 8:27 pm

Q by Bernard Lambeau.

From the webpage:

Q is a data language. For now, it is limited to a data definition language (DDL). Think “JSON/XML schema”, but the correct way. Q comes with a dedicated type system for defining data and a theory, called information contracts, for interoperability with programming and data exchange languages.

I am sure this will be useful but limited since it doesn’t extend to disclosing the semantics of data or the structures that contain data.

Unfortunate but it seems like the semantics of data are treated as: “…you know what the data means…,” which is rather far from the truth.

Sometimes some people may know what the data “means,” but that is hardly a sure thing.

My favorite example being the pyramids being build in front of hundreds of thousands of people over decades and because everyone “…knew how it was done…,” no one bothered to write it down.

Now H2 can consult with “ancient astronaut theorists” (I’m not lying, that is what they called their experts) about the building of the pyramids.

Do you want your data to be interpreted by the data equivalent of an “ancient astronaut theorist?” If not, you had better give some consideration to documenting the semantics of your data.

I first saw this in a tweet by Carl Anderson.

March 1, 2014

How to learn Chinese and Japanese [and computing?]

Filed under: Language,Learning,Topic Maps — Patrick Durusau @ 6:33 pm

How to learn Chinese and Japanese by Victor Mair.

From the post:

Victor concludes after a discussion of various authorities and sources:

If you delay introducing the characters, students’ mastery of pronunciation, grammar, vocabulary, syntax, and so forth, are all faster and more secure. Surprisingly, when later on they do start to study the characters (ideally in combination with large amounts of reading interesting texts with phonetic annotation), students acquire mastery of written Chinese much more quickly and painlessly than if writing is introduced at the same time as the spoken language.

An interesting debate follows in the comments.

I am wondering if the current emphasis on “coding” would be better shift to an emphasis on computing?

That is teaching the fundamental concepts of computing, separate and apart from any particular coding language or practice.

Much as I have taught the principles of subject identification separate and apart from a particular model or syntax.

The nooks and crannies of particular models or syntaxes can weight until later.

February 8, 2014

Arabic Natural Language Processing

Filed under: Language,Natural Language Processing — Patrick Durusau @ 3:14 pm

Arabic Natural Language Processing

From the webpage:

Arabic is the largest member of the Semitic language family and is spoken by nearly 500 million people worldwide. It is one of the six official UN languages. Despite its cultural, religious, and political significance, Arabic has received comparatively little attention by modern computational linguistics. We are remedying this oversight by developing tools and techniques that deliver state-of-the-art performance in a variety of language processing tasks. Machine translation is our most active area of research, but we have also worked on statistical parsing and part-of-speech tagging. This page provides links to our freely available software along with a list of relevant publications.

Software and papers from the Stanford NLP group.

An important capability to add to your toolkit, especially if you are dealing with the U.S. security complex.

I first saw this at: Stanford NLP Group Tackles Arabic Machine Translation.

February 4, 2014

So You Want To Write Your Own Language?

Filed under: Language,Language Design,Programming — Patrick Durusau @ 8:54 pm

So You Want To Write Your Own Language? by Walter Bright.

From the post:

The naked truth about the joys, frustrations, and hard work of writing your own programming language

My career has been all about designing programming languages and writing compilers for them. This has been a great joy and source of satisfaction to me, and perhaps I can offer some observations about what you’re in for if you decide to design and implement a professional programming language. This is actually a book-length topic, so I’ll just hit on a few highlights here and avoid topics well covered elsewhere.

In case you are wondering if Walter is a good source for language writing advice, I pulled this bio from the Dr. Dobb’s site:

Walter Bright is a computer programmer known for being the designer of the D programming language. He was also the main developer of the first native C++ compiler, Zortech C++ (later to become Symantec C++, now Digital Mars C++). Before the C++ compiler he developed the Datalight C compiler, also sold as Zorland C and later Zortech C.

I am sure writing a language is an enormous amount of work but Water makes it sound quite attractive.

January 8, 2014

Morpho project

Filed under: Language,Parsing — Patrick Durusau @ 2:02 pm

Morpho project

From the webpage:

The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, we are focusing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms.

This may not be of general interest but I mention it as one aspect of data-driven linguistics.

Long dead languages are often victims of well-meaning but highly imaginative work meant to explain those languages.

Grounding work in texts of a language introduces a much needed sanity check.

December 22, 2013

Twitter Weather Radar – Test Data for Language Analytics

Filed under: Analytics,Language,Tweets,Weather Data — Patrick Durusau @ 8:17 pm

Twitter Weather Radar – Test Data for Language Analytics by Nicholas Hartman.

From the post:

Today we’d like to share with you some fun charts that have come out of our internal linguistics research efforts. Specifically, studying weather events by analyzing social media traffic from Twitter.

We do not specialize in social media and most of our data analytics work focuses on the internal operations of leading organizations. Why then would we bother playing around with Twitter data? In short, because it’s good practice. Twitter data mimics a lot of the challenges we face when analyzing the free text streams generated by complex processes. Specifically:

  • High Volume: The analysis represented here is looking at around 1 million tweets a day. In the grand scheme of things, that’s not a lot but we’re intentionally running the analysis on a small server. That forces us to write code that rapidly assess what’s relevant to the question we’re trying to answer and what’s not. In this case the raw tweets were quickly tested live on receipt with about 90% of them discarded. The remaining 10% were passed onto the analytics code.
  • Messy Language: A lot of text analytics exercises I’ve seen published use books and news articles as their testing ground. That’s fine if you’re trying to write code to analyze books or news articles, but most of the world’s text is not written with such clean and polished prose. The types of text we encounter (e.g., worklogs from an IT incident management system) are full of slang, incomplete sentences and typos. Our language code needs to be good and determining the messages contained within this messy text.
  • Varying Signal to Noise: The incoming stream of tweets will always contain a certain percentage of data that isn’t relevant to the item we’re studying. For example, if a band member from One Direction tweets something even tangentially related to what some code is scanning for the dataset can be suddenly overwhelmed with a lot of off-topic tweets. Real world data is similarly has a lot of unexpected noise.

In the exercise below, tweets from Twitter’s streaming API JSON stream were scanned in near real-time for their ability to 1) be pinpointed to a specific location and 2) provide potential details on local weather conditions. The vast majority of tweets passing through our code failed to meet both of these conditions. The tweets that remained were analyzed to determine the type of precipitation being discussed.

An interesting reminder that data to test your data mining/analytics is never far away.

If not Twitter, pick one of the numerous email archives or open data datasets.

The post doesn’t offer any substantial technical details but then you need to work those out for yourself.

December 5, 2013

U.S. Military Slang

Filed under: Language,Vocabularies — Patrick Durusau @ 2:35 pm

The definitive glossary of modern US military slang by Ben Brody.

From the post:

It’s painful for US soldiers to hear discussions and watch movies about modern wars when the dialogue is full of obsolete slang, like “chopper” and “GI.”

Slang changes with the times, and the military’s is no different. Soldiers fighting the wars in Iraq and Afghanistan have developed an expansive new military vocabulary, taking elements from popular culture as well as the doublespeak of the military industrial complex.

The US military drawdown in Afghanistan — which is underway but still awaiting the outcome of a proposed bilateral security agreement — is often referred to by soldiers as “the retrograde,” which is an old military euphemism for retreat. Of course the US military never “retreats” — rather it conducts a “tactical retrograde.”

This list is by no means exhaustive, and some of the terms originated prior to the wars in Afghanistan and Iraq. But these terms are critical to speaking the current language of soldiers, and understanding it when they speak to others. Please leave anything you think should be included in the comments.

Useful for documents that contain U.S. military slang, such as the Afghanistan War Diary.

As Ben notes at the outset, language changes over time so validate any vocabulary against your document/data set.

December 4, 2013

Free Language Lessons for Computers

Filed under: Data,Language,Natural Language Processing — Patrick Durusau @ 4:58 pm

Free Language Lessons for Computers by Dave Orr.

From the post:

50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.

These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.

A great summary of the major data drops by Google Research over the past year. In many cases including pointers to additional information on the datasets.

One that I have seen before and that strikes me as particularly relevant to topic maps is:

Dictionaries for linking Text, Entities, and Ideas

What is it: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.

Where can I find it: http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2

I want to know more: A description of the data, several examples, and ideas for uses for it can be found in a blog post or in the associated paper.

For most purposes, you would need far less than the full set of 7.5 million concepts. Imagine having the relevant concepts for a domain that was being automatically “tagged” as you composed prose about it.

Certainly less error-prone than marking concepts by hand!

November 28, 2013

American Regional English dictionary going online (DARE)

Filed under: Dictionary,Language — Patrick Durusau @ 11:28 am

American Regional English dictionary going online by Scott Bauer.

From the post:

University of Wisconsin students and researchers set out in “word wagons” nearly 50 years ago to record the ways Americans spoke in various parts of the country.

Now, they’re doing it again, only virtually.

This time they won’t be lugging reel-to-reel tape recorders or sleeping in vans specially equipped with beds, stoves and sinks. Instead, work to update the Dictionary of American Regional English is being done in front of computers, reading online survey results.

“Of course, language changes and a lot of people have the notion that American English is becoming homogenized,” said Joan Houston Hall, who has worked on the dictionary since 1975 and served as its editor since 2000.

The only way to determine if that is true, though, is to do more research, she said.

The dictionary, known as DARE, has more than 60,000 entries exposing variances in the words, phrases, pronunciations, and pieces of grammar and syntax used throughout the country. Linguists consider it a national treasure, and it has been used by everyone from a criminal investigator in the 1990s tracking down the Unabomber to Hollywood dialect coaches trying to be as authentic as possible.

A great resource if you are creating topic maps for American literature during the time period in question.

Be aware that field work stopped in 1970 and any supplements will be by online survey:

Even though no new research has been done for the dictionary since 1970, Hall said she hopes it can now be updated more frequently now that it is going online. The key will be gathering new data tracking how language has changed, or stayed the same, since the first round of field work ended 43 years ago.

But why not break out the 21st century version of the “word wagon” and head out in the field again?

“Because it would be way too expensive and time-consuming,” Hall said, laughing.

So, instead, Hall is loading up the virtual “word wagon” also known as the online survey.

For language usage, there is a forty-three (43) year gap in coverage. Use caution as the vocabulary you are researching moves away from 1970.

The continuation of the project by online surveys will only capture evidence from people who complete online surveys.

Keep that limitation in mind when using DARE after it resumes “online” field work.

Personally, I would prefer more complete field work over the noxious surveillance adventures by non-democratic elements of the U.S. government.

BTW, DARE Digital, from Harvard Press is reported to set you back $150/year.

November 19, 2013

Bridging Semantic Gaps

Filed under: Language,Lexicon,Linguistics,Sentiment Analysis — Patrick Durusau @ 4:50 pm

OK, the real title is: Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation by Zheng Lin, Songbo Tan, Yue Liu, Xueqi Cheng, Xueke Xu. (Lin Z, Tan S, Liu Y, Cheng X, Xu X (2013) Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation. PLoS ONE 8(11): e79294. doi:10.1371/journal.pone.0079294)

Abstract:

There is a growing interest in automatically building opinion lexicon from sources such as product reviews. Most of these methods depend on abundant external resources such as WordNet, which limits the applicability of these methods. Unsupervised or semi-supervised learning provides an optional solution to multilingual opinion lexicon extraction. However, the datasets are imbalanced in different languages. For some languages, the high-quality corpora are scarce or hard to obtain, which limits the research progress. To solve the above problems, we explore a mutual-reinforcement label propagation framework. First, for each language, a label propagation algorithm is applied to a word relation graph, and then a bilingual dictionary is used as a bridge to transfer information between two languages. A key advantage of this model is its ability to make two languages learn from each other and boost each other. The experimental results show that the proposed approach outperforms baseline significantly.

I have always wondered when someone would notice the WordNet database is limited to the English language. 😉

The authors are seeking to develop “…a language-independent approach for resource-poor language,” saying:

Our approach differs from existing approaches in the following three points: first, it does not depend on rich external resources and it is language-independent. Second, our method is domain-specific since the polarity of opinion word is domain-aware. We aim to extract the domain-dependent opinion lexicon (i.e. an opinion lexicon per domain) instead of a universal opinion lexicon. Third, the most importantly, our approach can mine opinion lexicon for a target language by leveraging data and knowledge available in another language…

Our approach propagates information back and forth between source language and target language, which is called mutual-reinforcement label propagation. The mutual-reinforcement label propagation model follows a two-stage framework. At the first stage, for each language, a label propagation algorithm is applied to a large word relation graph to produce a polarity estimate for any given word. This stage solves the problem of external resource dependency, and can be easily transferred to almost any language because all we need are unlabeled data and a couple of seed words. At the second stage, a bilingual dictionary is introduced as a bridge between source and target languages to start a bootstrapping process. Initially, information about the source language can be utilized to improve the polarity assignment in target language. In turn, the updated information of target language can be utilized to improve the polarity assignment in source language as well.

Two points of particular interest:

  1. The authors focus on creating domain specific lexicons and don’t attempt to boil the ocean. Useful semantic results will arrive sooner if you avoid attempts at universal solutions.
  2. English speakers are a large market, but the target of this exercise is the #1 language of the world, Mandarin Chinese.

    Taking the numbers for English speakers at face value, approximately 0.8 billion speakers, with a world population of 7.125 billion, that leaves 6.3 billion potential customers.

You’ve heard what they say: A billion potential customers here and a billion potential customers there, pretty soon you are talking about a real market opportunity. (The original quote misattributed to Sen. Everett Dirksen.)

November 3, 2013

A multi-Teraflop Constituency Parser using GPUs

Filed under: GPU,Grammar,Language,Parsers,Parsing — Patrick Durusau @ 4:45 pm

A multi-Teraflop Constituency Parser using GPUs by John Canny, David Hall and Dan Klein.

Abstract:

Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length- 30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.

Just in case you are interested in parsing “unstructured” data, mostly what they also call “texts.”

I first saw the link: BIDParse: GPU-accelerated natural language parser at hgup.org. Then I started looking for the paper. 😉

October 16, 2013

Exploiting Discourse Analysis…

Filed under: Discourse,Language,Rhetoric,Temporal Semantic Analysis — Patrick Durusau @ 6:49 pm

Exploiting Discourse Analysis for Article-Wide Temporal Classification by Jun-Ping Ng, Min-Yen Kan, Ziheng Lin, Wei Feng, Bin Chen, Jian Su, Chew-Lim Tan.

Abstract:

In this paper we classify the temporal relations between pairs of events on an article-wide basis. This is in contrast to much of the existing literature which focuses on just event pairs which are found within the same or adjacent sentences. To achieve this, we leverage on discourse analysis as we believe that it provides more useful semantic information than typical lexico-syntactic features. We propose the use of several discourse analysis frameworks, including 1) Rhetorical Structure Theory (RST), 2) PDTB-styled discourse relations, and 3) topical text segmentation. We explain how features derived from these frameworks can be effectively used with support vector machines (SVM) paired with convolution kernels. Experiments show that our proposal is effective in improving on the state-of-the-art significantly by as much as 16% in terms of F1, even if we only adopt less-than-perfect automatic discourse analyzers and parsers. Making use of more accurate discourse analysis can further boost gains to 35%

Cutting edge of discourse analysis, which should be interesting if you are automatically populating topic maps based upon textual analysis.

It won’t be perfect, but even human editors are not perfect. (Or so rumor has it.)

A robust topic map system should accept, track and if approved, apply user submitted corrections and changes.

October 11, 2013

Quick Etymology

Filed under: Interface Research/Design,Language — Patrick Durusau @ 10:34 am

A tweet by Norm Walsh observes:

“Etymology of the word ___” in Google gives a railroad diagram answer on the search results page. Nice.

That along with “define ____” are suggestive of short-cuts for a topic map interface.

Yes?

Thinking of: “Relationships with _____”

Of course, Tiger Woods would be a supernode (…a vertex with a disproportionately high number of incident edges.”). 😉

October 9, 2013

Logical and Computational Structures for Linguistic Modeling

Filed under: Language,Linguistics,Logic,Modeling — Patrick Durusau @ 7:25 pm

Logical and Computational Structures for Linguistic Modeling

From the webpage:

Computational linguistics employs mathematical models to represent morphological, syntactic, and semantic structures in natural languages. The course introduces several such models while insisting on their underlying logical structure and algorithmics. Quite often these models will be related to mathematical objects studied in other MPRI courses, for which this course provides an original set of applications and problems.

The course is not a substitute for a full cursus in computational linguistics; it rather aims at providing students with a rigorous formal background in the spirit of MPRI. Most of the emphasis is put on the symbolic treatment of words, sentences, and discourse. Several fields within computational linguistics are not covered, prominently speech processing and pragmatics. Machine learning techniques are only very sparsely treated; for instance we focus on the mathematical objects obtained through statistical and corpus-based methods (i.e. weighted automata and grammars) and the associated algorithms, rather than on automated learning techniques (which is the subject of course 1.30).

Abundant supplemental materials, slides, notes, further references.

In particular you may like Notes on Computational Aspects of Syntax by Sylvain Schmitz, that cover the first part of Logical and Computational Structures for Linguistic Modeling.

As with any model, there are trade-offs and assumptions build into nearly every choice.

Knowing where to look for those trade-offs and assumptions will give you a response to: “Well, but the model shows that….”

September 28, 2013

Language support and linguistics

Filed under: Language,Lucene,Solr — Patrick Durusau @ 7:31 pm

Language support and linguistics in Apache Lucene™ and Apache Solr™ and the eco-system by Gaute Lambertsen and Christian Moen.

Slides from Lucene Revolution May, 2013.

Good overview of language support and linguistics in both Lucene and Solr.

A few less language examples at the beginning would shorten the slide deck from its current one hundred and fifty-one (151) count without impairing its message.

Still, if you are unfamiliar with language support in Lucene and Solr, the extra examples don’t hurt anything.

September 27, 2013

Quantifying the Language of British Politics, 1880-1914

Filed under: Corpora,History,Language,Politics — Patrick Durusau @ 1:17 pm

Quantifying the Language of British Politics, 1880-1914

Abstract:

This paper explores the power, potential, and challenges of studying historical political speeches using a specially constructed multi-million word corpus via quantitative computer software. The techniques used – inspired particularly by Corpus Linguists – are almost entirely novel in the field of political history, an area where research into language is conducted nearly exclusively qualitatively. The paper argues that a corpus gives us the crucial ability to investigate matters of historical interest (e.g. the political rhetoric of imperialism, Ireland, and class) in a more empirical and systematic manner, giving us the capacity to measure scope, typicality, and power in a massive text like a national general election campaign which it would be impossible to read in entirety.

The paper also discusses some of the main arguments against this approach which are commonly presented by critics, and reflects on the challenges faced by quantitative language analysis in gaining more widespread acceptance and recognition within the field.

Points to a podcast by Luke Blaxill presenting the results of his Ph.D research.

Luke Blaxill’s dissertation: The Language of British Electoral Politics 1880-1910.

Important work that strikes a balance between a “close reading” of the relevant texts and using a one million word corpus (two corpora actually) to trace language usage.

Think of it as the opposite of tools that flatten the meaning of words across centuries.

August 17, 2013

Got Genitalia?

Filed under: Humor,Language — Patrick Durusau @ 6:23 pm

Extensive timelines of slang for genitalia by Nathan Yau.

Nathan has discovered interactive time lines of slang for male and female genitalia. (Goes back to 1250-1300 CE.)

If you know Anthony Weiner, please forward these links to his attention.

If you don’t know Anthony Weiner, take this as an opportunity to expand your twitter repartee.

August 14, 2013

Glottolog

Filed under: Language — Patrick Durusau @ 6:46 pm

Glottolog

From the webpage:

Comprehensive reference information for the world’s languages, especially the lesser known languages.

Information about the different languages, dialects, and families of the world (‘languoids’) is available in the Languoid section. The Langdoc section contains bibliographical information. (…)

Languoid catalogue

Glottolog provides a comprehensive catalogue of the world’s languages, language families and dialects. It assigns a unique and stable identifier (the Glottocode) to (in principle) all languoids, i.e. all families, languages, and dialects. Any variety that a linguist works on should eventually get its own entry. The languoids are organized via a genealogical classification (the Glottolog tree) that is based on available historical-comparative research (see also the Languoids information section).

Langdoc

Langdoc is a comprehensive collection of bibliographical data for the world’s lesser known languages. It provides access to more than 180,000 references of descriptive works such as grammars, dictionaries, word lists, texts etc. Search criteria include author, year, title, country, and genealogical affiliation. References can be downloaded as txt, bib, html, or with the Zotero Firefox plugin.

Interesting language resource.

The authors are interested in additional bibliographies in any format: glottolog@eva.mpg.de

August 3, 2013

in the dark heart of a language model lies madness….

Filed under: Corpora,Language,Natural Language Processing — Patrick Durusau @ 4:32 pm

in the dark heart of a language model lies madness…. by Chris.

From the post:

This is the second in a series of post detailing experiments with the Java Graphical Authorship Attribution Program. The first post is here.

screenshot

In my first run (seen above), I asked JGAAP to normalize for white space, strip punctuation, turn everything into lowercase. Then I had it run a Naive Bayes classifier on the top 50 tri-grams from the three known authors (Shakespeare, Marlowe, Bacon) and one unknown author (Shakespeare’s sonnets).

Based on that sample, JGAAP came to the conclusion that Francis Bacon wrote the sonnets. We know that because it lists its guesses in order from best to worst in the left window in the above image. Bacon is on top. This alone is cause to start tinkering with the model, but the results didn’t look as flat weird until I looked at the image again today. It lists the probability that the sonnets were written by Bacon as 1. A probability of 1 typically means absolute certainty. So this model, given the top 50 trigrams, is absolutely certain that Francis Bacon wrote those sonnets … Bullshit. A probabilistic model is never absolutely certain of anything. That’s what makes it probabilistic, right?

So where’s the bug? Turns out, it might have been poor data management on my part. I didn’t bother to sample in any kind of fair and reasonable way. Here are my corpora:

(…)

You may not be a stakeholder in the Shakespeare vs. Bacon debate, but you are likely to encounter questions about the authorship of data. Particularly text data.

The tool that Chris describes is a great introduction to that type of analysis.

August 2, 2013

A new version of the Compact Language Detector

Filed under: Language — Patrick Durusau @ 6:36 pm

A new version of the Compact Language Detector by Mike McCandless.

From the post:

It’s been almost two years since I originally factored out the fast and accurate Compact Language Detector from the Chromium project, and the effort was clearly worthwhile: the project is popular and others have created additional bindings for languages including at least Perl, Ruby, R, JavaScript, PHP and C#/.NET.

Eric Fischer used CLD to create the colorful Twitter language map, and since then further language maps have appeared, e.g. for New York and London. What a multi-lingual world we live in!

Suddenly, just a few weeks ago, I received an out-of-the-blue email from Dick Sites, creator of CLD, with great news: he was finishing up version 2.0 of CLD and had already posted the source code on a new project.

So I’ve now reworked the Python bindings and ported the unit tests to Python (they pass!) to take advantage of the new features. It was much easier this time around since the CLD2 sources were already pulled out into their own project (thank you Dick and Google!).

Great library if you need to detect languages.

I understand some of the security agencies think use of a non-English language is a dot to be connected.

July 16, 2013

Abbot MorphAdorner collaboration

Filed under: Data,Language — Patrick Durusau @ 9:12 am

Abbot MorphAdorner collaboration

From the webpage:

The Center for Digital Research in the Humanities at the University of Nebraska and Northwestern University’s Academic and Academic Research Technologies are pleased to announce the first fruits of a collaboration between the Abbot and EEBO-MorphAdorner projects: the release of some 2,000 18th century texts from the TCP-ECCO collections in a TEI-P5 format and with linguistic annotation. More texts will follow shortly, subject to the access restrictions that will govern the use of TCP texts for the remainder of this decade.

The Text Creation Partnership (TCP) collection currently consists of about 50,000 fully transcribed SGML texts from the first three centuries of English print culture. The collection will grow to approximately 75,000 volumes and will contain at least one copy of every book published before 1700 as well as substantial samples of 18th century texts published in the British Isles or North America. The ECCO-TCP texts are already in the public domain. The other texts will follow them between 2014 and 2015. The Evans texts will be released in June 2014, followed by a release of some 25,000 EEBO texts in 2015.

It is a major goal of the Abbot and EEBO MorphAdorner collaboration to turn the TCP texts into the foundation for a “Book of English,” defined as

  • a large, growing, collaboratively curated, and public domain corpus
  • of written English since its earliest modern form
  • with full bibliographical detail
  • and light but consistent structural and linguistic annotation

Texts in the annotated TCP corpus will exist in more than one format so as to facilitate different uses to which they are likely to be put. In a first step, Abbot transforms the SGML source text into a TEI P5 XML format. Abbot, a software program designed by Brian Pytlik Zillig and Stephen Ramsay, can read arbitrary XML files and convert them into other XML formats or a shared format. Abbot generates its own set of conversion routines at runtime by reading an XML schema file and programmatically effecting the desired transformations. It is an excellent tool for creating an environment in which texts originating in separate projects can acquire a higher degree of interoperability. A prototype of Abbot was used in the MONK project to harmonize texts from several collections, including the TCP, Chadwyck-Healey’s Nineteenth-Century Fiction, the Wright Archive of American novels 1851-1875, and Documenting the American South.

This first transformation maintains all the typographical data recorded in the original SGML transcription, including long ‘s’, printer’s abbreviations, superscripts etc. In a second step MorphAdorner tokenizes this file. MorphAdorner was developed by Philip R. Burns. It is a multi-purpose suite of NLP tools with special features for the tokenization, analysis, and annotation of historical corpora. The tokenization uses algorithms and heuristics specific to the practices of Early Modern print culture, wraps every word token in a <w> element with a unique ID, and explicitly marks sentence boundaries.

In the next step (conceptually different but merged in practice with the previous), some typographical features are removed from the tokenized text, but all such changes are recorded in a change log and may therefore be reversed. The changes aim at making it easier to manipulate the corpus with software tools that presuppose modern printing practices. They involve such things as replacing long ‘s’ with plain ‘s’, or resolving unambiguous printer’s abbreviations and superscripts.

Talk about researching across language as it changes!

This is way cool!

Lots of opportunities for topic map-based applications.

For more information:

Abbot Text Interoperability Tool

Download texts here

July 15, 2013

Corporate Culture Clash:…

Filed under: Communication,Diversity,Heterogeneous Data,Language,Marketing,Semantics — Patrick Durusau @ 3:05 pm

Corporate Culture Clash: Getting Data Analysts and Executives to Speak the Same Language by Drew Rockwell

From the post:

A colleague recently told me a story about the frustration of putting in long hours and hard work, only to be left feeling like nothing had been accomplished. Architecture students at the university he attended had scrawled their frustrations on the wall of a campus bathroom…“I wanted to be an architect, but all I do is create stupid models,” wrote students who yearned to see their ideas and visions realized as staples of metropolitan skylines. I’ve heard similar frustrations expressed by business analysts who constantly face the same uphill battle. In fact, in a recent survey we did of 600 analytic professionals, some of the biggest challenges they cited were “getting MBAs to accept advanced methods”, getting executives to buy into the potential of analytics, and communicating with “pointy-haired” bosses.

So clearly, building the model isn’t enough when it comes to analytics. You have to create an analytics-driven culture that actually gets everyone paying attention, participating and realizing what analytics has to offer. But how do you pull that off? Well, there are three things that are absolutely critical to building a successful, analytics-driven culture. Each one links to the next and bridges the gap that has long divided analytics professionals and business executives.

Some snippets to attract you to this “must read:”

(…)
In the culinary world, they say you eat with your eyes before your mouth. A good visual presentation can make your mouth water, while a bad one can kill your appetite. The same principle applies when presenting data analytics to corporate executives. You have to show them something that stands out, that they can understand and that lets them see with their own eyes where the value really lies.
(…)
One option for agile integration and analytics is data discovery – a type of analytic approach that allows business people to explore data freely so they can see things from different perspectives, asking new questions and exploring new hypotheses that could lead to untold benefits for the entire organization.
(…)
If executives are ever going to get on board with analytics, the cost of their buy-in has to be significantly lowered, and the ROI has to be clear and substantial.
(…)

I did pick the most topic map “relevant” quotes but they are as valid for topic maps as any other approach.

Seeing from different perspectives sounds like on-the-fly merging to me.

How about you?

June 10, 2013

When will my computer understand me?

Filed under: Language,Markov Decision Processes,Semantics,Translation — Patrick Durusau @ 2:57 pm

When will my computer understand me?

From the post:

It’s not hard to tell the difference between the “charge” of a battery and criminal “charges.” But for computers, distinguishing between the various meanings of a word is difficult.

For more than 50 years, linguists and computer scientists have tried to get computers to understand human language by programming semantics as software. Driven initially by efforts to translate Russian scientific texts during the Cold War (and more recently by the value of information retrieval and data analysis tools), these efforts have met with mixed success. IBM’s Jeopardy-winning Watson system and Google Translate are high profile, successful applications of language technologies, but the humorous answers and mistranslations they sometimes produce are evidence of the continuing difficulty of the problem.

Our ability to easily distinguish between multiple word meanings is rooted in a lifetime of experience. Using the context in which a word is used, an intrinsic understanding of syntax and logic, and a sense of the speaker’s intention, we intuit what another person is telling us.

“In the past, people have tried to hand-code all of this knowledge,” explained Katrin Erk, a professor of linguistics at The University of Texas at Austin focusing on lexical semantics. “I think it’s fair to say that this hasn’t been successful. There are just too many little things that humans know.”

Other efforts have tried to use dictionary meanings to train computers to better understand language, but these attempts have also faced obstacles. Dictionaries have their own sense distinctions, which are crystal clear to the dictionary-maker but murky to the dictionary reader. Moreover, no two dictionaries provide the same set of meanings — frustrating, right?

Watching annotators struggle to make sense of conflicting definitions led Erk to try a different tactic. Instead of hard-coding human logic or deciphering dictionaries, why not mine a vast body of texts (which are a reflection of human knowledge) and use the implicit connections between the words to create a weighted map of relationships — a dictionary without a dictionary?

“An intuition for me was that you could visualize the different meanings of a word as points in space,” she said. “You could think of them as sometimes far apart, like a battery charge and criminal charges, and sometimes close together, like criminal charges and accusations (“the newspaper published charges…”). The meaning of a word in a particular context is a point in this space. Then we don’t have to say how many senses a word has. Instead we say: ‘This use of the word is close to this usage in another sentence, but far away from the third use.'”

Before you jump to the post looking for the code, Erk is working with a 10,000 dimension space to analyze her data.

The most recent paper: Montague Meets Markov: Deep Semantics with Probabilistic Logical Form (2013)

Abstract:

We combine logical and distributional representations of natural language meaning by transforming distributional similarity judgments into weighted inference rules using Markov Logic Networks (MLNs). We show that this framework supports both judging sentence similarity and recognizing textual entailment by appropriately adapting the MLN implementation of logical connectives. We also show that distributional phrase similarity, used as textual inference rules created on the fly, improves its performance.

April 24, 2013

Weapons of Mass Destruction Were In Iraq

Filed under: Government,Language — Patrick Durusau @ 1:26 pm

It is commonly accepted that no weapons of mass destruction were found after the invasion of Iraq by Bush II.

But is that really true?

To credit that claim, you would have to be unable to find a common pressure cooker in Iraq.

The FBI apparently considers bombs made using pressure cookers to be “weapons of mass destruction.”

How remarkable. I have one of the big pressure canners. That must be the H-Bomb of pressure cookers. 😉

“Weapon of mass destruction” gets even vaguer when you get into the details.

18 USC § 2332a – Use of weapons of mass destruction, which refers you to another section, “any destructive device as defined in section 921 of this title;” to find the definition.

And, 18 USC § 921 – Definitions reads in relevant part:

(4) The term “destructive device” means—
(A) any explosive, incendiary, or poison gas—
(i) bomb,
(ii) grenade,
(iii) rocket having a propellant charge of more than four ounces,
(iv) missile having an explosive or incendiary charge of more than one-quarter ounce,
(v) mine, or
(vi) device similar to any of the devices described in the preceding clauses;

Maybe Bush II should have asked the FBI to hunt for “weapons of mass destruction” in Iraq.

They would not have come home empty handed.


If this seems insensitive, remember government debasement of language contributes to the lack of sane discussions about national security.

Discussions that could have lead to better information sharing and possibly the stopping of some crimes.

Yes, crimes, not acts of terrorism. Crimes are solved by old fashioned police work.

Fear of acts of terrorism leads to widespread monitoring of electronic communications, loss of privacy, etc.

As shown in the Boston incident, national security monitoring played no role in stopping the attack or apprehending the suspects.

Traditional law enforcement did.

Why is the most effective tool against crime not a higher priority?

March 17, 2013

Twitter users forming tribes with own language…

Filed under: Language,Tribes,Usage — Patrick Durusau @ 1:28 pm

Twitter users forming tribes with own language, tweet analysis shows by Jason Rodrigues.

From the post:

Twitter users are forming ‘tribes’, each with their own language, according to a scientific analysis of millions of tweets.

The research on Twitter word usage throws up a pattern of behaviour that seems to contradict the commonly held belief that users simply want to share everything with everyone.

In fact, the findings point to a more precise use of social media where users frequently include keywords in their tweets so that they engage more effectively with other members of their community or tribe. Just like our ancestors we try to join communities based on our political interests, ethnicity, work and hobbies.

And just like our ancestors our communities have semantics unique to those communities.

Always a pleasure to see people replicating the semantic diversity that keeps data curation techniques relevant.

Like topic maps mapping between language tribes.

See Jason’s post for the details but of particular interest is the placing of people into Twitter tribes based on usage.

February 15, 2013

Capturing the “Semantic Differential”?

Filed under: Language,Semantics — Patrick Durusau @ 11:51 am

Reward Is Assessed in Three Dimensions That Correspond to the Semantic Differential by John G. Fennell and Roland J. Baddeley. (Fennell JG, Baddeley RJ (2013) Reward Is Assessed in Three Dimensions That Correspond to the Semantic Differential. PLoS ONE 8(2): e55588. doi:10.1371/journal.pone.0055588)

Abstract:

If choices are to be made between alternatives like should I go for a walk or grab a coffee, a ‘common currency’ is needed to compare them. This quantity, often known as reward in psychology and utility in economics, is usually conceptualised as a single dimension. Here we propose that to make a comparison between different options it is important to know not only the average reward, but also both the risk and level of certainty (or control) associated with an option. Almost all objects can be the subject of choice, so if these dimensions are required in order to make a decision, they should be part of the meaning of those objects. We propose that this ubiquity is unique, so if we take an average over many concepts and domains these three dimensions (reward, risk, and uncertainty) should emerge as the three most important dimensions in the “meaning” of objects. We investigated this possibility by relating the three dimensions of reward to an old, robust and extensively studied factor analytic instrument known as the semantic differential. Across a very wide range of situations, concepts and cultures, factor analysis shows that 50% of the variance in rating scales is accounted for by just three dimensions, with these dimensions being Evaluation, Potency, and Activity [1]. Using a statistical analysis of internet blog entries and a betting experiment, we show that these three factors of the semantic differential are strongly correlated with the reward history associated with a given concept: Evaluation measures relative reward; Potency measures absolute risk; and Activity measures the uncertainty or lack of control associated with a concept. We argue that the 50% of meaning captured by the semantic differential is simply a summary of the reward history that allows decisions to be made between widely different options.

Semantic Differential” as defined by Wikipedia:

Semantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept.

Invented over 50 years ago, semantic differential scales, ranking a concept on a scale anchored by opposites, such as good-evil, has proven to be very useful.

What the scale was measuring, despite its success, was unknown. (May still be, depends on how persuasive you find the author’s proposal.)

The proposal merits serious discussion and additional research but I am leery about relying on blogs as representative of language usage.

Or rather I take blogs as representative of people who blog, which is a decided minority of all language users.

Just as I would take transcripts of “Sex and the City” as representing the fantasies of socially deprived writers. Interesting perhaps but not the same as the mores of New York City. (If that lowers your expectations about a trip to New York City, my apologies.)

January 7, 2013

Machine Learning based Vocabulary Management Tool

Filed under: Language,Vocabularies,VocBench — Patrick Durusau @ 6:55 am

Machine Learning based Vocabulary Management Tool – Assessment for the Linked Open Data by Ahsan Morshed and Ritaban Dutta.

Abstract:

Reusing domain vocabularies in the context of developing the knowledge based Linked Open data system is the most important discipline on the web. Many editors are available for developing and managing the vocabularies or Ontologies. However, selecting the most relevant editor is very difficult since each vocabulary construction initiative requires its own budget, time, resources. In this paper a novel unsupervised machine learning based comparative assessment mechanism has been proposed for selecting the most relevant editor. Defined evaluation criterions were functionality, reusability, data storage, complexity, association, maintainability, resilience, reliability, robustness, learnability, availability, flexibility, and visibility. Principal component analysis (PCA) was applied on the feedback data set collected from a survey involving sixty users. Focus was to identify the least correlated features carrying the most independent information variance to optimize the tool selection process. An automatic evaluation method based on Bagging Decision Trees has been used to identify the most suitable editor. Three tools namely Vocbench, TopBraid EVN and Pool Party Thesaurus Manager have been evaluated. Decision tree based analysis recommended the Vocbench and the Pool Party Thesaurus Manager are the better performer than the TopBraid EVN tool with very similar recommendation scores.

With the caveat that sixty (60) users in your organization (the number tested in this study), might reach different results, a useful study of vocabulary software.

More useful for the evaluation criteria to use with vocabulary software than in any absolute guide to the appropriate software.

I first saw this at: New article on vocabulary management tools.

English Letter Frequency Counts: Mayzner Revisited…

Filed under: Language,Linguistics — Patrick Durusau @ 6:27 am

English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU by Peter Norvig.

From the post:

On December 17th 2012, I got a nice letter from Mark Mayzner, a retired 85-year-old researcher who studied the frequency of letter combinations in English words in the early 1960s. His 1965 publication has been cited in hundreds of articles. Mayzner describes his work:

I culled a corpus of 20,000 words from a variety of sources, e.g., newspapers, magazines, books, etc. For each source selected, a starting place was chosen at random. In proceeding forward from this point, all three, four, five, six, and seven-letter words were recorded until a total of 200 words had been selected. This procedure was duplicated 100 times, each time with a different source, thus yielding a grand total of 20,000 words. This sample broke down as follows: three-letter words, 6,807 tokens, 187 types; four-letter words, 5,456 tokens, 641 types; five-letter words, 3,422 tokens, 856 types; six-letter words, 2,264 tokens, 868 types; seven-letter words, 2,051 tokens, 924 types. I then proceeded to construct tables that showed the frequency counts for three, four, five, six, and seven-letter words, but most importantly, broken down by word length and letter position, which had never been done before to my knowledge.

and he wonders if:

perhaps your group at Google might be interested in using the computing power that is now available to significantly expand and produce such tables as I constructed some 50 years ago, but now using the Google Corpus Data, not the tiny 20,000 word sample that I used.

The answer is: yes indeed, I am interested! And it will be a lot easier for me than it was for Mayzner. Working 60s-style, Mayzner had to gather his collection of text sources, then go through them and select individual words, punch them on Hollerith cards, and use a card-sorting machine.

Peter rises to the occasion, using thirty-seven (37) times as much data as Mayzner. Not to mention detailing his analysis and posting the resulting data sets for more analysis.

November 29, 2012

Notation as a Tool of Thought

Filed under: Language,Language Design,Programming — Patrick Durusau @ 1:33 pm

Notation as a Tool of Thought by Kenneth E. Iverson.

From the introduction:

Nevertheless, mathematical notation has serious deficiencies. In particular, it lacks universality, and must be interpreted differently according to the topic, according to the author, and even according to the immediate context. Programming languages, because they were designed for the purpose of directing computers, offer important advantages as tools of thought. Not only are they universal (general-purpose), but they are also executable and unambiguous. Executability makes it possible to use computers to perform extensive experiments on ideas expressed in a programming language, and the lack of ambiguity makes possible precise thought experiments. In other respects, however, most programming languages are decidedly inferior to mathematical notation and are little used as tools of thought in ways that would be considered significant by, say, an applied mathematician.

The thesis of the present paper is that the advantages of executability and universality found in programming languages can be effectively combined, in a single coherent language, with the advantages offered by mathematical notation.

Will expose you to APL but that’s not a bad thing. The history of reasoning about data structures can be interesting and useful.

Iverson’s response to critics of the algorithms in this work was in part as follows:

…overemphasis of efficiency leads to an unfortunate circularity in design: for reasons of efficiency early programming languages reflected the characteristics of the early computers, and each generation of computers reflects the needs of the programming languages of the preceding generation. (5.4 Mode of Presentation)

A good reason to understand the nature of a problem before reaching for the keyboard.

November 17, 2012

Mining a multilingual association dictionary from Wikipedia…

Filed under: Language,Multilingual,Topic Maps,Wikipedia — Patrick Durusau @ 5:53 pm

Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval by Zheng Ye, Jimmy Xiangji Huang, Ben He, Hongfei Lin.

Abstract:

Wikipedia is characterized by its dense link structure and a large number of articles in different languages, which make it a notable Web corpus for knowledge extraction and mining, in particular for mining the multilingual associations. In this paper, motivated by a psychological theory of word meaning, we propose a graph-based approach to constructing a cross-language association dictionary (CLAD) from Wikipedia, which can be used in a variety of cross-language accessing and processing applications. In order to evaluate the quality of the mined CLAD, and to demonstrate how the mined CLAD can be used in practice, we explore two different applications of the mined CLAD to cross-language information retrieval (CLIR). First, we use the mined CLAD to conduct cross-language query expansion; and, second, we use it to filter out translation candidates with low translation probabilities. Experimental results on a variety of standard CLIR test collections show that the CLIR retrieval performance can be substantially improved with the above two applications of CLAD, which indicates that the mined CLAD is of sound quality.

Is there a lesson here about using Wikipedia as a starter set of topics across languages?

Not the final product but a starting place other than ground zero for creation of a multi-lingual topic map.

« Newer PostsOlder Posts »

Powered by WordPress