Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 23, 2012

Linguistic Society of America (LSA)

Filed under: Language,Linguistics — Patrick Durusau @ 10:01 am

Linguistic Society of America (LSA)

The membership page says:

The Linguistic Society of America is the major professional society in the United States that is exclusively dedicated to the advancement of the scientific study of language. With nearly 4,000 members, the LSA speaks on behalf of the field of linguistics and also serves as an advocate for sound educational and political policies that affect not only professionals and students of language, but virtually all segments of society. Founded in 1924, the LSA has on many occasions made the case to governments, universities, foundations, and the public to support linguistic research and to see that our scientific discoveries are effectively applied.

Language and linguistics are important in the description of numeric data but even more so for non-numeric data.

Another avenue to sharpen your skills at both.

PS: I welcome your suggestions of other language/linguistic institutions and organizations. Even if our machines don’t understand natural language, our users do.

July 7, 2012

Natural Language Processing | Hub

Natural Language Processing | Hub

From the “about” page:

NLP|Hub is an aggregator of news about Natural Language Processing and other related topics, such as Text Mining, Information Retrieval, Linguistics or Machine Learning.

NLP|Hub finds, collects and arranges related news from different sites, from academic webs to company blogs.

NLP|Hub is a product of Cilenis, a company specialized in Natural Language Processing.

If you have interesting posts for NLP|Hub, or if you do not want NLP|Hub indexing your text, please contact us at info@cilenis.com

Definitely going on my short list of sites to check!

May 9, 2012

GATE Teamware: Collaborative Annotation Factories (HOT!)

GATE Teamware: Collaborative Annotation Factories

From the webpage:

Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.

It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)

GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:

  • Loading document collections (a “corpus” or “corpora”)
  • Creating re-usable project templates
  • Initiating projects based on templates
  • Assigning project roles to specific users
  • Monitoring progress and various project statistics in real time
  • Reporting of project status, annotator activity and statistics
  • Applying GATE-based processing routines (automatic annotations or post-annotation processing)

I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.

Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.

Then I read:

Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from GATECloud.net.

Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”

Let me know if you take the plunge!

April 27, 2012

Parallel Language Corpus Hunting?

Filed under: Corpora,EU,Language,Linguistics — Patrick Durusau @ 6:11 pm

Parallel language corpus hunters, particularly in legal informatics can rejoice!

[A] parallel corpus of all European Union legislation, called the Acquis Communautaire, translated into all 22 languages of the EU nations — has been expanded to include EU legislation from 2004-2010…

If you think semantic impedance in one language is tough, step up and try that across twenty-two (22) languages.

Of course, these countries share something of a common historical context. Imagine the gulf when you move up to languages from other historical contexts.

See: DGT-TM-2011, Parallel Corpus of All EU Legislation in Translation, Expanded to Include Data from 2004-2010 for links and other details.

March 17, 2012

The Anachronism Machine: The Language of Downton Abbey

Filed under: Language,Linguistics — Patrick Durusau @ 8:19 pm

The Anachronism Machine: The Language of Downton Abbey

David Smith writes::

I’ve recently become hooked on the TV series Downton Abbey. I’m not usually one for costume dramas, but the mix of fine acting, the intriguing social relationships, and the larger WW1-era story make for compelling viewing. (Also: Maggie Smith is a treasure.)

Despite the widespread criticial acclaim, Downton has met with criticism for some period-innapropriate uses of language. For example, at one point Lady Mary laments “losting the high ground”, a phrase that didn’t come into use until the 1960s. But is this just a random slip, or are such anachronistic phrases par for the course on Downton? And how does it compare to other period productions in its use of language?

To answer these questions, Ben Schmidt (a graduate student in history at Princeton University and Visiting Graduate Fellow at the Cultural Observatory at Harvard) created an “Anachronism Machine“. Using the R statistical programming language and Google n-grams, it analyzes all of the two-word phrases in a Downton Abbey script, and compares their frequency of use with that in books written around the WW1 era (when Downton is set). For example, Schmidt finds that Downton characters, if they were to follow societal norms of the 1910’s (as reflected in books from that period), would rarely use the phrase “guest bedroom”, but in fact it’s regularly uttered during the series. Schmidt charts the frequency these phrases appear in the show versus the frequency they appear in contemporaneous books below:

Good post on the use of R for linguistic analysis!

As a topic map person, I am more curious what should be substituted for “guest bedroom” in a 1910’s series? Thinking it would be interesting to have a mapping between the “normal” speech patterns for various time periods.

March 14, 2012

Segmenting Words and Sentences

Filed under: Linguistics,Segmentation — Patrick Durusau @ 7:36 pm

Segmenting Words and Sentences by Richard Marsden.

From the post:

Even simple NLP tasks such as tokenizing words and segmenting sentences can have their complexities. Punctuation characters could be used to segment sentences, but this requires the punctuation marks to be treated as separate tokens. This would result in abbreviations being split into separate words and sentences.

This post uses a classification approach to create a parser that returns lists of sentences of tokenized words and punctuation.

Splitting text into words and sentences seems like it should be the simplest NLP task. It probably is, but there are a still number of potential problems. For example, a simple approach could use space characters to divide words. Punctuation (full stop, question mark, exclamation mark) could be used to divide sentences. This quickly comes into problems when an abbreviation is processed. “etc.” would be interpreted as a sentence terminator, and “U.N.E.S.C.O.” would be interpreted as six individual sentences, when both should be treated as single word tokens. How should hyphens be interpreted? What about speech marks and apostrophes?

A good introduction to segmentation but I would test the segmentation with a sample text before trusting it too far. Writing habits vary even within languages.

March 12, 2012

Cross Validation vs. Inter-Annotator Agreement

Filed under: Annotation,LingPipe,Linguistics — Patrick Durusau @ 8:05 pm

Cross Validation vs. Inter-Annotator Agreement by Bob Carpenter.

From the post:

Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Lessons in the use of LingPipe tools!

If you are annotating texts or anticipate annotating texts, read this post.

March 11, 2012

Corpus-Wide Association Studies

Filed under: Corpora,Data Mining,Linguistics — Patrick Durusau @ 8:10 pm

Corpus-Wide Association Studies by Mark Liberman.

From the post:

I’ve spent the past couple of days at GURT 2012, and one of the interesting talks that I’ve heard was Julian Brooke and Sali Tagliamonte, “Hunting the linguistic variable: using computational techniques for data exploration and analysis”. Their abstract (all that’s available of the work so far) explains that:

The selection of an appropriate linguistic variable is typically the first step of a variationist analysis whose ultimate goal is to identify and explain social patterns. In this work, we invert the usual approach, starting with the sociolinguistic metadata associated with a large scale socially stratified corpus, and then testing the utility of computational tools for finding good variables to study. In particular, we use the ‘information gain’ metric included in data mining software to automatically filter a huge set of potential variables, and then apply our own corpus reader software to facilitate further human inspection. Finally, we subject a small set of particularly interesting features to a more traditional variationist analysis.

This type of data-mining for interesting patterns is likely to become a trend in sociolinguistics, as it is in other areas of the social and behavioral sciences, and so it’s worth giving some thought to potential problems as well as opportunities.

If you think about it, the social/behavioral sciences are being applied to the results of data mining of user behavior now. Perhaps you can “catch the wave” early on this cycle of research.

January 23, 2012

50 years of linguistics at MIT

Filed under: Linguistics — Patrick Durusau @ 7:44 pm

50 years of linguistics at MIT by Arnold Zwicky.

From the post:

The videos are now out — from the 50th-anniversary celebrations (“a scientific reunion”) of the linguistics program at MIT, December 9-11, 2011. The schedule of the talks (with links to slides for them) is available here, with links to other material: a list of attendees, a list of the many poster presentations, videos of the main presentations, personal essays by MIT alumni, photographs from the event, a list of MIT dissertations from 1965 to the present, and a 1974 history of linguistics at MIT (particularly interesting for the years before the first officially registered graduate students entered the program, in 1961).

The eleven YouTube videos (of the introduction and the main presentations) can be accessed directly here.

See Arnold’s post for the links.

January 20, 2012

The communicative function of ambiguity in language

Filed under: Ambiguity,Context,Corpus Linguistics,Linguistics — Patrick Durusau @ 9:20 pm

The communicative function of ambiguity in language by Steven T. Piantadosi, Harry Tily and Edward Gibson. (Cognition, 2011) (PDF file)

Abstract:

We present a general information-theoretic argument that all efficient communication systems will be ambiguous, assuming that context is informative about meaning. We also argue that ambiguity allows for greater ease of processing by permitting efficient linguistic units to be re-used. We test predictions of this theory in English, German, and Dutch. Our results and theoretical analysis suggest that ambiguity is a functional property of language that allows for greater communicative efficiency. This provides theoretical and empirical arguments against recent suggestions that core features of linguistic systems are not designed for communication.

This is a must read paper if you are interesting in ambiguity and similar issues.

At page 289, the authors report:

These findings suggest that ambiguity is not enough of a problem to real-world communication that speakers would make much effort to avoid it. This may well be because actual language in context provides other information that resolves the ambiguities most of the time.

I don’t know if our communication systems are efficient or not but I think the phrase “in context” is covering up a very important point.

Our communication systems came about in very high-bandwidth circumstances. We were in the immediate presence of a person speaking. With all the context that provides.

Even if we accept an origin of language of say 200,000 years ago, written language, which provides the basis for communication without the presence of another person, emerges only in the last five or six thousand years. Just to keep it simple, 5 thousand years would be 2.5% of the entire history of language.

So for 97.5% of the history of language, it has been used in a high bandwidth situation. No wonder it has yet to adapt to narrow bandwidth situations.

If writing puts us into a narrow bandwidth situation and ambiguity, where does that leave our computers?

January 4, 2012

Algorithm estimates who’s in control

Filed under: Data Analysis,Discourse,Linguistics,Social Graphs,Social Networks — Patrick Durusau @ 10:43 am

Algorithm estimates who’s in control

John Kleinberg, whose work influenced Google’s PageRank, is working on ranking something else. Kelinberg et al. developed an algorithm that ranks people, based on how they speak to each other.

This on the heels of the Big Brother’s Name is… has to have you wondering if you even want Internet access at all. 😉

Just imagine, power (who has, who doesn’t) analysis of email discussion lists, wiki edits, email archives, transcripts.

This has the potential (along with other clever analysis) to identify and populate topic maps with some very interesting subjects.

I first saw this at FlowingData

December 31, 2011

News Cracking 1 : Meet the Editors

Filed under: Linguistics,News — Patrick Durusau @ 7:26 pm

News Cracking 1 : Meet the Editors

Matthew Hurst writes:

I posted recently about visualizing the relationships between editors and the countries appearing in articles they edit for news articles published by Reuters. I’ve since updated my experimental news aggregation site (now it is intended to eventually be more of a meta-news analysis site) to display only Reuters articles and to extract the names of contributors, including the editors. The overall list of editors is maintained (in the right column) and each editor is displayed with the number of articles observed for which they have attribution. Currently Cynthia Johnston and Tim Dobbyn are at the top of the list.

What do you think about Matthew’s plans for future tracking? Thoughts on how subject identities might/might not be helpful? Comment at Matthew’s blog.

I don’t know if CNN is still this way because it has been a long time since I have seen it but it used to repeat the same news over and over again every 24 hour cycle. It might be amusing to see how short a summary could be created for some declared 24 hour news cycle. I suppose the only problem would be that if CNN “knew” it was being watched, they would introduce artificial diversity into the newscast.

Still, I suppose one could capture the audio track and using voice recognition software collapse all the repetitive statements, excluding the commercials (or including commercials as well). Maybe I do need a cable TV connection in my home office. 😉

BUCC 2012: The Fifth Workshop on Building and Using Comparable Corpora

Filed under: Corpora,Linguistics — Patrick Durusau @ 7:23 pm

BUCC 2012: The Fifth Workshop on Building and Using Comparable Corpora (Special topic: Language Resources for Machine Translation in Less-Resourced Languages and Domains

Dates:

DEADLINE FOR PAPERS: 15 February 2012
Workshop Saturday, 26 May 2012
Lütfi Kirdar Istanbul Exhibition and Congress Centre
Istanbul, Turkey

Some of the information is from: Call for papers. the main conference site does not (yet) have the call for papers posted. Suggest that you verify dates with conference organizers before making travel arrangements.

From the call for papers:

In the language engineering and the linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest in themselves by making possible inter-linguistic discoveries and comparisons. It is generally accepted in both communities that comparable corpora are documents in one or several languages that are comparable in content and form in various degrees and dimensions. We believe that the linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora for applications of statistical NLP. As such, it is of great interest to bring together builders and users of such corpora.

The scarcity of parallel corpora has motivated research concerning the use of comparable corpora: pairs of monolingual corpora selected according to the same set of criteria, but in different languages or language varieties. Non-parallel yet comparable corpora overcome the two limitations of parallel corpora, since sources for original, monolingual texts are much more abundant than translated texts. However, because of their nature, mining translations in comparable corpora is much more challenging than in parallel corpora. What constitutes a good comparable corpus, for a given task or per se, also requires specific attention: while the definition of a parallel corpus is fairly straightforward, building a non-parallel corpus requires control over the selection of source texts in both languages.

Parallel corpora are a key resource as training data for statistical machine translation, and for building or extending bilingual lexicons and terminologies. However, beyond a few language pairs such as English-French or English-Chinese and a few contexts such as parliamentary debates or legal texts, they remain a scarce resource, despite the creation of automated methods to collect parallel corpora from the Web. To exemplify such issues in a practical setting, this year’s special focus will be on

Language Resources for Machine Translation in Less-Resourced Languages and Domains

with the aim of overcoming the shortage of parallel resources when building MT systems for less-resourced languages and domains, particularly by usage of comparable corpora for finding parallel data within and by reaching out for “hidden” parallel data. Lack of sufficient language resources for many language pairs and domains is currently one of the major obstacles in further advancement of machine translation.

Curious about the use of topic maps in the creation of comparable corpora? Seems like the use of language/domain scopes on linguistic data could result in easier construction of comparable corpora.

December 30, 2011

Sexual Accommodation

Filed under: Linguistics,Semantics — Patrick Durusau @ 6:12 pm

Sexual Accommodation by Mark Liberman.

From the post:

You’ve probably noticed that how people talk depends on who they’re talking with. And for 40 years or so, linguists and psychologists and sociologists have referred to this process as “speech accommodation” or “communication accommodation” — or, for short, just plain “accommodation”. This morning’s Breakfast Experiment™ explores a version of the speech accommodation effect as applied to groups rather than individuals — some ways that men and women talk differently in same-sex vs. mixed-sex conversations.

I got the idea of doing this a couple of days ago, as I was indexing some conversational transcripts in order to find material for an experiment on a completely different topic. The transcripts in question come from a large collection of telephone conversations known as the “Fisher English” corpus, collected at the LDC in 2003 and published in 2004 and 2005. These two publications together comprise 11,699 two-person conversations, involving a diverse collection of speakers. While the sample is not demographically balanced in a strict sense, there is a good representation of speakers from all over the United States, across a wide range of ages, educational levels, occupations, and so forth.

I mention this because if usage varies by gender, doesn’t it also stand to reason that usage (read identification of subjects) varies by position in an organization?

Anyone who has been in an IT position can attest that conversations inside the IT department use a completely different vocabulary than when addressing people outside the department. For one, the term “idiot” is probably not used with reference to the CEO outside of the IT department. 😉

Capturing the differences in vocabularies could be as useful as any result for an actual topic map, in terms of communication across levels of an organization.

Suggestions for text archives where that sort of difference could be investigated?

December 19, 2011

Journal of Computing Science and Engineering

Filed under: Bioinformatics,Computer Science,Linguistics,Machine Learning,Record Linkage — Patrick Durusau @ 8:09 pm

Journal of Computing Science and Engineering

From the webpage:

Journal of Computing Science and Engineering (JCSE) is a peer-reviewed quarterly journal that publishes high-quality papers on all aspects of computing science and engineering. The primary objective of JCSE is to be an authoritative international forum for delivering both theoretical and innovative applied researches in the field. JCSE publishes original research contributions, surveys, and experimental studies with scientific advances.

The scope of JCSE covers all topics related to computing science and engineering, with a special emphasis on the following areas: embedded computing, ubiquitous computing, convergence computing, green computing, smart and intelligent computing, and human computing.

I got here from following a sponsor link at a bioinformatics conference.

Then just picking at random from the current issue I see:

A Fast Algorithm for Korean Text Extraction and Segmentation from Subway Signboard Images Utilizing Smartphone Sensors by Igor Milevskiy, Jin-Young Ha.

Abstract:

We present a fast algorithm for Korean text extraction and segmentation from subway signboards using smart phone sensors in order to minimize computational time and memory usage. The algorithm can be used as preprocessing steps for optical character recognition (OCR): binarization, text location, and segmentation. An image of a signboard captured by smart phone camera while holding smart phone by an arbitrary angle is rotated by the detected angle, as if the image was taken by holding a smart phone horizontally. Binarization is only performed once on the subset of connected components instead of the whole image area, resulting in a large reduction in computational time. Text location is guided by user’s marker-line placed over the region of interest in binarized image via smart phone touch screen. Then, text segmentation utilizes the data of connected components received in the binarization step, and cuts the string into individual images for designated characters. The resulting data could be used as OCR input, hence solving the most difficult part of OCR on text area included in natural scene images. The experimental results showed that the binarization algorithm of our method is 3.5 and 3.7 times faster than Niblack and Sauvola adaptive-thresholding algorithms, respectively. In addition, our method achieved better quality than other methods.

Secure Blocking + Secure Matching = Secure Record Linkage by Alexandros Karakasidis, Vassilios S. Verykios.

Abstract:

Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching.

A Survey of Transfer and Multitask Learning in Bioinformatics by Qian Xu, Qiang Yang.

Abstract:

Machine learning and data mining have found many applications in biological domains, where we look to build predictive models based on labeled training data. However, in practice, high quality labeled data is scarce, and to label new data incurs high costs. Transfer and multitask learning offer an attractive alternative, by allowing useful knowledge to be extracted and transferred from data in auxiliary domains helps counter the lack of data problem in the target domain. In this article, we survey recent advances in transfer and multitask learning for bioinformatics applications. In particular, we survey several key bioinformatics application areas, including sequence classification, gene expression data analysis, biological network reconstruction and biomedical applications.

And the ones I didn’t list from the current issue are just as interesting and relevant to identity/mapping issues.

This journal is a good example of people who have deliberately reached further across disciplinary boundaries than most.

About the only excuse for not doing so left is the discomfort of being the newbie in a field not your own.

Is that a good enough reason to miss possible opportunities to make critical advances in your home field? (Only you can answer that for yourself. No one can answer it for you.)

November 28, 2011

Surrogate Learning

Filed under: Entity Resolution,Linguistics,Record Linkage — Patrick Durusau @ 7:04 pm

Surrogate Learning – From Feature Independence to Semi-Supervised Classification by Sriharsha Veeramachaneni and Ravi Kumar Kondadadi.

Abstract:

We consider the task of learning a classifier from the feature space X to the set of classes $Y = {0, 1}$, when the features can be partitioned into class-conditionally independent feature sets $X1$ and $X2$. We show that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $X2$ to $X1$ (in the sense of estimating the probability $P(x1|x2))$ and 2) learning the class-conditional distribution of the feature set $X1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.

The two “real world” applications are ones you are likely to encounter:

First:

Our problem consisted of merging each of ≈ 20000 physician records, which we call the update database, to the record of the same physician in a master database of ≈ 106 records.

Our old friends record linkage and entity resolution. The solution depends upon a clever choice of features for application of the technique. (The thought occurs to me that a repository of data analysis snippets for particular techniques would be as valuable, if not more so, than the techniques themselves. Techniques come and go. Data analysis and the skills it requires goes on and on.)

Second:

Sentence classification is often a preprocessing step for event or relation extraction from text. One of the challenges posed by sentence classification is the diversity in the language for expressing the same event or relationship. We present a surrogate learning approach to generating paraphrases for expressing the merger-acquisition (MA) event between two organizations in financial news. Our goal is to find paraphrase sentences for the MA event from an unlabeled corpus of news articles, that might eventually be used to train a sentence classifier that discriminates between MA and non-MA sentences. (Emphasis added. This is one of the issues in the legal track at TREC.)

This test was against 700000 financial news records.

Both tests were quite successful.

Surrogate learning looks interesting for a range of NLP applications.

Template-Based Information Extraction without the Templates

Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky.

Abstract:

Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

Can you say association?

Definitely points towards a pipeline approach to topic map authoring. To abuse the term, perhaps a “dashboard” that allows selection of data sources followed by the construction of workflows with preliminary analysis being displayed at “breakpoints” in the processing. No particular reason why stages have to be wired together other than tradition.

Just looking a little bit into the future, imagine that some entities weren’t being recognized at a high enough rate. So you shift that part of the data to several thousand human entity processors and take the average of their results, higher than what you were getting and feed that back into the system. Could have knowledge workers who work full time but shift from job to job performing tasks too difficult to program effectively.

November 18, 2011

collocations in wikipedia – parts 2 and 3

Filed under: Collocation,Linguistics — Patrick Durusau @ 9:37 pm

Matthew Kelcey continues his series on collocations, although the title to part 3 doesn’t say as much.

collocations in wikipedia, part 2

In part 2 Matt discusses alternatives to “magic” frequency cut-offs for collocation analysis.

I rather like the idea of looking for alternatives to “it’s just that way” methodologies. Accepting traditional cut-offs, etc., maybe the right thing to do in some cases, but only with experience and understanding the alternatives.

finding phrases with mutual information [collocations, part 3]

In part 3 Matt discusses taking collocations beyond just two terms that occur together and techniques for that analysis.

Matt is also posting todo thoughts for further investigation.

If you have the time and interest, drop by Matt’s blog to leave suggestions or comments.

(See collocations in wikipedia, part 1 for our coverage of the first post.)

November 14, 2011

Twitter POS Tagging with LingPipe and ARK Tweet Data

Filed under: LingPipe,Linguistics,POS,Tweets — Patrick Durusau @ 7:15 pm

Twitter POS Tagging with LingPipe and ARK Tweet Data by Bob Carpenter.

From the post:

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

We will train and test on anything that’s easy to parse.

How’s that for a motto! 😉

Social media may be more important than I thought it was several years ago. It may just be the serialization in digital form all the banter in bars, at blocks parties and around the water cooler. If that is true, then governments would be well advised to encourage and assist with access to social media. To give them an even chance of leaving ahead of the widow maker.

Think of mining Twitter data like the NSA and phone traffic, but you aren’t doing anything illegal.

October 12, 2011

Active learning: far from solved

Filed under: Active Learning,Linguistics — Patrick Durusau @ 4:38 pm

Active learning: far from solved

From the post:

As Daniel Hsu and John Langford pointed out recently, there has been a lot of recent progress in active learning. This is to the point where I might actually be tempted to suggest some of these algorithms to people to use in practice, for instance the one John has that learns faster than supervised learning because it’s very careful about what work it performs. That is, in particular, I might suggest that people try it out instead of the usual query-by-uncertainty (QBU) or query-by-committee (QBC). This post is a brief overview of what I understand of the state of the art in active learning (paragraphs 2 and 3) and then a discussion of why I think (a) researchers don’t tend to make much use of active learning and (b) why the problem is far from solved. (a will lead to b.)

This is a deeply interesting article that could give rise to mini and major projects. I particularly like his point about not throwing away training data. No, you have to read the post for yourself. It’s not that long.

October 3, 2011

Scala Tutorial – Tuples, Lists, methods on Lists and Strings

Filed under: Computational Linguistics,Linguistics,Scala — Patrick Durusau @ 7:04 pm

Scala Tutorial – Tuples, Lists, methods on Lists and Strings

I mention this not only because it looks like a good Scala tutorial series but also because it is being developed in connection with a course on computational linguistics at UT Austin (sorry, University of Texas at Austin, USA).

The cross-over between computer programming and computational linguistics illustrates the artificial nature of the divisions we make between disciplines and professions.

September 25, 2011

Modeling Item Difficulty for Annotations of Multinomial Classifications

Filed under: Annotation,Classification,LingPipe,Linguistics — Patrick Durusau @ 7:49 pm

Modeling Item Difficulty for Annotations of Multinomial Classifications by Bob Carpenter

From the post:

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

For your convenience, links for the “…tutorial for LREC with Massimo Poesio” can be found at: LREC 2010 Tutorial: Modeling Data Annotation.

September 19, 2011

Philosophy of Language, Logic, and Linguistics – The Very Basics

Filed under: Language,Linguistics,Logic — Patrick Durusau @ 7:53 pm

Philosophy of Language, Logic, and Linguistics – The Very Basics

Any starting reading list in this area is going to have its proponents and its detractors.

I would add to this list John Sowa’s “Knowledge representation : logical, philosophical, and computational foundations“. There is a website as well. Just be aware that John’s views on Charles Peirce represent a unique view of the development of logic in the 20th century. Still an excellent bibliography of materials for reading. And, as always, you should read the original texts for yourself. You may reach different conclusions from those reported by others.

August 20, 2011

WordNet Data > 10.3 Billion Unique Values

Filed under: Dataset,Linguistics,WordNet — Patrick Durusau @ 8:08 pm

WordNet Data > 10.3 Billion Unique Values

Wanted to draw your attention to some WordNet data files.

From the readme.TXT file in the directory:

As of August 19, 2011 pairwise measures for all nouns using the path measure are available. This file is named WordNet-noun-noun-path-pairs.tar. It is approximately 120 GB compressed. In this file you will find 146,312 files, one for each noun sense. Each file consists of 146,313 lines, where each line (except the first) contains a WordNet noun sense and the similarity to the sense featured in that particular file. Doing the math here, you find that each .tar file contains
about 21,000,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion unique values.

We are currently running wup, res, and lesk, but do not have an estimated date of availability yet.

BTW, on verb data:

These files were created with WordNet::Similarity version 2.05 using WordNet 3.0. They show all the pairwise verb-verb similarities found in WordNet according to the path, wup, lch, lin, res, and jcn measures. The path, wup, and lch are path-based, while res, lin, and jcn are based on information content.

As of March 15, 2011 pairwise measures for all verbs using the six measures above are availble, each in their own .tar file. Each *.tar file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx 2.0 – 2.4 GB compressed. In each of these .tar files you will find 25,047 files, one for each verb sense. Each file consists of 25,048 lines, where each line (except the first) contains a WordNet verb sense and the similarity to the sense featured in that particular file. Doing
the math here, you find that each .tar file contains about 625,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have a bit more than 300 million unique values.

July 3, 2011

SwiftRiver/Ushahidi

Filed under: Filters,Linguistics,Natural Language Processing,NoSQL,Python — Patrick Durusau @ 7:34 pm

SwiftRiver

From the Get Started page:

The mission of the SwiftRiver initiative is to democratize access to the tools used to make sense of data.

To achieve this goal we’ve taken two approaches, apps and APIs. Apps are user facing and should be tools that are easy to understand, deploy and use. APIs are machine facing and extract meta-context that other machines (apps) use to convey information to the end user.

SwiftRiver is an opensource platform that aims to allow users to do three things well: 1) structure unstructured data feeds, 2) filter and prioritize information conditionally and 3) add context to content. Doing these things well allows users to pull in real-time content from Twitter, SMS, Email or the Web and to make sense of data on the fly.

The Ushahidi logo at the top will take you to a common wiki for Ushahidi and SwithRiver.

And the Ushahidi link in text takes you to: Ushahidi:

We are a non-profit tech company that develops free and open source software for information collection, visualization and interactive mapping.

Home of:

  • Ushahidi Platform: We built the Ushahidi platform as a tool to easily crowdsource information using multiple channels, including SMS, email, Twitter and the web.
  • SwiftRiver: SwiftRiver is an open source platform that aims to democratize access to tools for filtering & making sense of real-time information.
  • Crowdmap: When you need to get the Ushahidi platform up in 2 minutes to crowdsource information, Crowdmap will do it for you. It’s our hosted version of the Ushahidi platform.
  • It occurs to me that mapping email feeds would fit right into my example in Marketing What Users Want…And An Example.

    June 29, 2011

    Topic Modeling Sarah Palin’s Emails

    Filed under: Latent Dirichlet Allocation (LDA),Linguistics — Patrick Durusau @ 9:05 am

    Topic Modeling Sarah Palin’s Emails from Edwin Chen.

    From the post:

    LDA-based Email Browser

    Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I did some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.

    Interesting analysis and promise of more to follow.

    With a US presidential election next year, there is little doubt there will be friendly as well as hostile floods of documents.

    Time to sharpen your data extraction tools.

    June 23, 2011

    Linked Data in Linguistics March 7 – 9, 2012, Frankfurt/Main, Germany

    Filed under: Conferences,Linguistics,Linked Data — Patrick Durusau @ 1:53 pm

    Linked Data in Linguistics March 7 – 9, 2012, Frankfurt/Main, Germany

    Important Dates:

    August 7, 2011: Deadline for extended abstracts (four pages plus references)
    September 9, 2011: Notification of acceptance
    October 23, 2011: One-page abstract for DGfS conference proceedings
    December 1, 2011: Camera-ready papers for workshop proceedings (eight pages plus references)
    March 7-9, 2012: Workshop
    March 6-9, 2012: Conference

    From the website:

    The explosion of information technology has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked. This workshop will present principles, use cases, and best practices for using the linked data paradigm to represent, exploit, store, and connect different types of linguistic data collections.

    Recent relevant developments include: (1) Language archives for language documentation, with audio, video, and text transcripts from hundreds of (endangered) languages (e.g. Dobes). (2) Typological databases with typological and geographical data about languages from all parts of the globe (e.g. WALS). (3) Development, distribution and application of lexical-semantic resources (LSRs) in NLP (e.g. WordNet). (4) Multi-layer annotations (e.g. ISO TC37/SC4) and semantic annotation of corpora (e.g. PropBank) by corpus linguists and computational linguists, often accompanied by the interlinking of corpora with LSRs (e.g. OntoNotes).

    The general trend of providing data online is accompanied by newly developing possibilities to link linguistic data and metadata. This may include general data sources (e.g. DBpedia.org), but also repositories with specific linguistic information about languages (Multitree.org, LL-MAP, ISO 639-3), as well as about linguistic categories and phenomena (GOLD, ISOcat).

    Originally noticed this from a tweet by Lutz Maicher.

    June 12, 2011

    A Few Subjects Go A Long Way

    Filed under: Data Analysis,Language,Linguistics,Text Analytics — Patrick Durusau @ 4:11 pm

    A post by Rich Cooper (Rich AT EnglishLogicKernel DOT com) Analyzing Patent Claims demonstrates the power of small vocabularies (sets of subjects) for the analysis of patent claims.

    It is a reminder that a topic map author need not identify every possible subject, but only so many of those as necessary. Other subjects abound and await other authors who wish to formally recognize them.

    It is also a reminder that a topic map need only be as complex or as complete as necessary for a particular task. My topic map may not be useful for Mongolian herdsmen or even the local bank. But, the test isn’t an abstract but a practical. Does it meet the needs of its intended audience?

    « Newer Posts

    Powered by WordPress