Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 17, 2012

Most developers don’t really know any computer language

Filed under: Language,Programming,Semantic Diversity — Patrick Durusau @ 1:39 pm

Most developers don’t really know any computer language by Derek Jones.

From the post:

What does it mean to know a language? I can count to ten in half a dozen human languages, say please and thank you, tell people I’m English and a few other phrases that will probably help me get by; I don’t think anybody would claim that I knew any of these languages.

It is my experience that most developers’ knowledge of the programming languages they use is essentially template based; they know how to write a basic instances of the various language constructs such as loops, if-statements, assignments, etc and how to define identifiers to have a small handful of properties, and they know a bit about how to glue these together.


[the topic map part]

discussions with developers: individuals and development groups invariabily have their own terminology for programming language constructs (my use of terminology appearing in the language definition usually draws blank stares and I have to make a stab at guessing what the local terms mean and using them if I want to be listened to); asking about identifier scoping or type compatibility rules (assuming that either of the terms ‘scope’ or ‘type compatibility’ is understood) usually results in a vague description of specific instances (invariably the commonly encountered situations),

What?! Semantic diversity in computer languages? Or at least as they are understood by programmers?

😉

I don’t see the problem with appreciating semantic diversity for the richness it offers.

There are use cases where semantic diversity interferes with some other requirement. Such as in accounting systems that depend upon normalized data for auditing purposes.

While there are other use cases, such as the history of ideas that depend upon preservation of the trail of semantic diversity. As part of the narrative of such histories.

And there are cases that fall in between, where the benefits of diverse points of view must be weighted against the cost of creating and maintaining a mapping between diverse viewpoints.

All of those use cases recognize that semantic diversity is the starting point. That is semantic diversity is always with us and the real question is the cost of its control for some particular use case.

I don’t view: “My software works if all users abandon semantic diversity.” as a use case. It is a confession of defective software.

I first saw this in a tweet from Computer Science Fact.

October 25, 2012

Exploiting Parallelism and Scalability (XPS) (NSF)

Filed under: Language,Language Design,Parallelism,Scalability — Patrick Durusau @ 4:53 am

Exploiting Parallelism and Scalability (XPS) (NSF)

From the announcement:

Synopsis of Program:

Computing systems have undergone a fundamental transformation from the single-processor devices of the turn of the century to today’s ubiquitous and networked devices and warehouse-scale computing via the cloud. Parallelism has become ubiquitous at many levels. The proliferation of multi- and many-core processors, ever-increasing numbers of interconnected high performance and data intensive edge devices, and the data centers servicing them, is enabling a new set of global applications with large economic and social impact. At the same time, semiconductor technology is facing fundamental physical limits and single processor performance has plateaued. This means that the ability to achieve predictable performance improvements through improved processor technologies has ended.

The Exploiting Parallelism and Scalability (XPS) program aims to support groundbreaking research leading to a new era of parallel computing. XPS seeks research re-evaluating, and possibly re-designing, the traditional computer hardware and software stack for today’s heterogeneous parallel and distributed systems and exploring new holistic approaches to parallelism and scalability. Achieving the needed breakthroughs will require a collaborative effort among researchers representing all areas– from the application layer down to the micro-architecture– and will be built on new concepts and new foundational principles. New approaches to achieve scalable performance and usability need new abstract models and algorithms, programming models and languages, hardware architectures, compilers, operating systems and run-time systems, and exploit domain and application-specific knowledge. Research should also focus on energy- and communication-efficiency and on enabling the division of effort between edge devices and clouds.

Full proposals due: February 20, 2013, (due by 5 p.m. proposer’s local time).

I see the next wave of parallelism and scalability being based on language and semantics. Less so on more cores and better designs in silicon.

Not surprising since I work in languages and semantics every day.

Even so, consider a go-cart that exceeds 160 miles per hour (260 km/h) remains a go-cart.

Go beyond building a faster go-cart.

Consider language and semantics when writing your proposal for this program.

October 23, 2012

Wyner et al.: An Empirical Approach to the Semantic Representation of Laws

Filed under: Language,Law,Legal Informatics,Machine Learning,Semantics — Patrick Durusau @ 10:37 am

Wyner et al.: An Empirical Approach to the Semantic Representation of Laws

Legal Informatics brings news of Dr. Adam Wyner’s paper, An Empirical Approach to the Semantic Representation of Laws, and quotes the abstract as:

To make legal texts machine processable, the texts may be represented as linked documents, semantically tagged text, or translated to formal representations that can be automatically reasoned with. The paper considers the latter, which is key to testing consistency of laws, drawing inferences, and providing explanations relative to input. To translate laws to a form that can be reasoned with by a computer, sentences must be parsed and formally represented. The paper presents the state-of-the-art in automatic translation of law to a machine readable formal representation, provides corpora, outlines some key problems, and proposes tasks to address the problems.

The paper originated at Project IMPACT.

If you haven’t looked at semantics and the law recently, this is a good opportunity to catch up.

I have only skimmed the paper and its references but am already looking for online access to early issues of Jurimetrics (a journal by the American Bar Association) that addressed such issues many years ago.

Should be fun to see what has changed and by how much. What issues remain and how they are viewed today.

Linguistic Society of America (LSA)

Filed under: Language,Linguistics — Patrick Durusau @ 10:01 am

Linguistic Society of America (LSA)

The membership page says:

The Linguistic Society of America is the major professional society in the United States that is exclusively dedicated to the advancement of the scientific study of language. With nearly 4,000 members, the LSA speaks on behalf of the field of linguistics and also serves as an advocate for sound educational and political policies that affect not only professionals and students of language, but virtually all segments of society. Founded in 1924, the LSA has on many occasions made the case to governments, universities, foundations, and the public to support linguistic research and to see that our scientific discoveries are effectively applied.

Language and linguistics are important in the description of numeric data but even more so for non-numeric data.

Another avenue to sharpen your skills at both.

PS: I welcome your suggestions of other language/linguistic institutions and organizations. Even if our machines don’t understand natural language, our users do.

October 21, 2012

DEITY Launches Indian Search

Filed under: Language,Topic Maps,Use Cases — Patrick Durusau @ 3:42 pm

DEITY Launches Indian Search by Angela Guess.

From the post:

Tech2 reports, “The Department of Electronics and Information Technology (DEITY) unveiled Internet search engine, Sandhan, yesterday to assist users searching for tourism-related information across websites. Sandhan will provide search results to user queries in five Indian languages – Bengali, Hindi, Marathi, Tamil and Telugu.

[Which further quotes Tech2:] With this service, the government aims to plug the wide gap that exists ’in fulfilling the information needs of Indians not conversant with English- estimated at 90 percent of the population’.

Let’s see: 1,220,200,000 (Wikipedia, Demographics of India, 2012 estimated population) x 90% (not conversant with English) = Potential consumer population of 1,098,180,000.

In case you are curious:

1,344,130,000 (Demographics of China, 2012 estimated population) is reported to have two hundred and ninety-two (292) living languages.

503,500,000 (Demographics of EU, 2012 estimated population) has twenty-three (23) official languages.

Wikiipedia has two hundred and eighty-five (285) different language editions.

No shortage of need, question is who has enough to gain to pay the cost of mapping?

October 5, 2012

Journal of Experimental Psychology: Applied

Filed under: Interface Research/Design,Language,Psychology — Patrick Durusau @ 2:16 pm

Journal of Experimental Psychology: Applied

From the website:

The mission of the Journal of Experimental Psychology: Applied® is to publish original empirical investigations in experimental psychology that bridge practically oriented problems and psychological theory.

The journal also publishes research aimed at developing and testing of models of cognitive processing or behavior in applied situations, including laboratory and field settings. Occasionally, review articles are considered for publication if they contribute significantly to important topics within applied experimental psychology.

Areas of interest include applications of perception, attention, memory, decision making, reasoning, information processing, problem solving, learning, and skill acquisition. Settings may be industrial (such as human–computer interface design), academic (such as intelligent computer-aided instruction), forensic (such as eyewitness memory), or consumer oriented (such as product instructions).

I browsed several recent issues of the Journal of Experimental Psychology: Applied while researching the Todd Rogers post. Fascinating stuff and some of it will find its way into interfaces or other more “practical” aspects of computer science.

Something to temper the focus on computer facing work.

No computer has ever originated a purchase order or contract. Might not hurt to know something about the entities that do.

Dodging and Topic Maps: Can Run but Can’t Hide

Filed under: Debate,Language — Patrick Durusau @ 2:05 pm

We have all been angry during televised debates when the “other” candidate slips by difficult questions.

To the partisan viewer it looks like they are lying and the moderator is in cahoots with them. They never get called down for failing to answer the question.

How come?

Alix Spiegel had a great piece on NPR called: How Politicians Get Away With Dodging The Question that may point in the right direction.

Research by Todd Rogers (homepage) of the Harvard School of Government demonstrates what is call a “pivot,” a point in an answer that starts to answer the question and then switches to something the candidate wanted to say.

It is reported that pivots were used about 70% of the time in one set of presidential debates.

In a similar vein, see: The Art of the Dodge by Peter Saalfield in Harvard Magazine, March-April 2012. (Watch for the bad link to the Journal of Experimental Psychology. Should be Journal of Experimental Psychology: Applied.

Or the original work:

The artful dodger: Answering the wrong question the right way. Rogers, Todd; Norton, Michael I. Journal of Experimental Psychology: Applied, Vol 17(2), Jun 2011, 139-147. doi: 10.1037/a0023439

Abtract:

What happens when speakers try to “dodge” a question they would rather not answer by answering a different question? In 4 studies, we show that listeners can fail to detect dodges when speakers answer similar—but objectively incorrect—questions (the “artful dodge”), a detection failure that goes hand-in-hand with a failure to rate dodgers more negatively. We propose that dodges go undetected because listeners’ attention is not usually directed toward a goal of dodge detection (i.e., Is this person answering the question?) but rather toward a goal of social evaluation (i.e., Do I like this person?). Listeners were not blind to all dodge attempts, however. Dodge detection increased when listeners’ attention was diverted from social goals toward determining the relevance of the speaker’s answers (Study 1), when speakers answered a question egregiously dissimilar to the one asked (Study 2), and when listeners’ attention was directed to the question asked by keeping it visible during speakers’ answers (Study 4). We also examined the interpersonal consequences of dodge attempts: When listeners were guided to detect dodges, they rated speakers more negatively (Study 2), and listeners rated speakers who answered a similar question in a fluent manner more positively than speakers who answered the actual question but disfluently (Study 3). These results add to the literatures on both Gricean conversational norms and goal-directed attention. We discuss the practical implications of our findings in the contexts of interpersonal communication and public debates. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Imagine an instant replay system for debates where pivots points are identified and additional data is mapped in.

Candidates would still try to dodge, but perhaps less successfully.

August 11, 2012

OPLSS 2012

Filed under: Language,Language Design,Programming,Types — Patrick Durusau @ 3:42 pm

OPLSS 2012 by Robert Harper.

From the post:

The 2012 edition of the Oregon Programming Languages Summer School was another huge success, drawing a capacity crowd of 100 eager students anxious to learn the latest ideas from an impressive slate of speakers. This year, as last year, the focus was on languages, logic, and verification, from both a theoretical and a practical perspective. The students have a wide range of backgrounds, some already experts in many of the topics of the school, others with little or no prior experience with logic or semantics. Surprisingly, a large percentage (well more than half, perhaps as many as three quarters) had some experience using Coq, a large uptick from previous years. This seems to represent a generational shift—whereas such topics were even relatively recently seen as the province of a few zealots out in left field, nowadays students seem to accept the basic principles of functional programming, type theory, and verification as a given. It’s a victory for the field, and extremely gratifying for those of us who have been pressing forward with these ideas for decades despite resistance from the great unwashed. But it’s also bittersweet, because in some ways it’s more fun to be among the very few who have created the future long before it comes to pass. But such are the proceeds of success.

As if a post meriting your attention wasn’t enough, it concludes with:

Videos of the lectures, and course notes provided by the speakers, are all available at the OPLSS 12 web site.

Just a summary of what you will find:

  • Logical relations — Amal Ahmed
  • Category theory foundations — Steve Awodey
  • Proofs as Processes — Robert Constable
  • Polarization and focalization — Pierre-Louis Curien
  • Type theory foundations — Robert Harper
  • Monads and all that — John Hughes
  • Compiler verification — Xavier Leroy
  • Language-based security — Andrew Myers
  • Proof theory foundations — Frank Pfenning
  • Software foundations in Coq — Benjamin Pierce

Enjoy!

July 27, 2012

July 17, 2012

Scalia and Garner on legal interpretation

Filed under: Language,Law — Patrick Durusau @ 4:50 pm

Scalia and Garner on legal interpretation by Mark Liberman.

Mark writes:

Antonin Scalia and Bryan Garner have recently (June 19) published Reading Law: The Interpretation of Legal Texts, a 608-page work in which, according to the publisher’s blurb, “all the most important principles of constitutional, statutory, and contractual interpretation are systematically explained”.

The post is full of pointers to additional materials both on this publication and notions of legal interpretation more generally.

A glimpse of why I think texts are so complex.

BTW, for the record, I disagree with both Scalia and the post-9/11 Stanley Fish on discovering the “meaning” of texts or authors, respectively. We can report our interpretation of a text, but that isn’t the same thing.

An interpretation is a report we may persuade others to be useful for some purpose, agreeable with their prior beliefs or even consistent with their world view. But for all of that, it remains always our report, nothing more.

The claim of “plain meaning” of words or the “intention” of an author (Scalia, Fish respectively) is an attempt to either avoid moral responsibility for a report or to privilege a report as being more than simply another report. Neither one is particularly honest or useful.

In a marketplace of reports, acknowledged to be reports, we can evaluate, investigate, debate and even choose from among reports.

Scalia and Fish would both advantage some reports over others, probably for different reasons. But whatever their reasons, fair or foul, I prefer to meet all reports on even ground.

July 12, 2012

grammar why ! matters

Filed under: Grammar,Language — Patrick Durusau @ 6:41 pm

grammar why ! matters

Bob Carpenter has a good rant on grammar.

The test I would urge everyone to use before buying software or even software services is to ask to see their documentation.

Give it to one of your technical experts and ask them to turn to any page and start reading.

If at any point your expert asks what was meant, thank the vendor for their time and show them the door.

It will save you time and expense in the long run to use only software with good documentation. (It would be nice to have software that doesn’t crash often too but I would not ask for the impossible.)

June 3, 2012

FreeLing 3.0 – An Open Source Suite of Language Analyzers

FreeLing 3.0 – An Open Source Suite of Language Analyzers

Features:

Main services offered by FreeLing library:

  • Text tokenization
  • Sentence splitting
  • Morphological analysis
  • Suffix treatment, retokenization of clitic pronouns
  • Flexible multiword recognition
  • Contraction splitting
  • Probabilistic prediction of unkown word categories
  • Named entity detection
  • Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.)
  • PoS tagging
  • Chart-based shallow parsing
  • Named entity classification
  • WordNet based sense annotation and disambiguation
  • Rule-based dependency parsing
  • Nominal correference resolution

[Not all features are supported for all languages, see Supported Languages.]

TOC for the user manual.

Something for your topic map authoring toolkit!

(Source: Jack Park)

May 11, 2012

Evaluating the Design of the R Language

Filed under: Language,Language Design,R — Patrick Durusau @ 3:33 pm

Evaluating the Design of the R Language

Sean McDirmid writes:

From our recent discussion on R, I thought this paper deserved its own post (ECOOP final version) by Floreal Morandat, Brandon Hill, Leo Osvald, and Jan Vitek; abstract:

R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we assess the success of different language features.

Excerpts from the paper:

R comes equipped with a rather unlikely mix of features. In a nutshell, R is a dynamic language in the spirit of Scheme or JavaScript, but where the basic data type is the vector. It is functional in that functions are first-class values and arguments are passed by deep copy. Moreover, R uses lazy evaluation by default for all arguments, thus it has a pure functional core. Yet R does not optimize recursion, and instead encourages vectorized operations. Functions are lexically scoped and their local variables can be updated, allowing for an imperative programming style. R targets statistical computing, thus missing value support permeates all operations.

One of our discoveries while working out the semantics was how eager evaluation of promises turns out to be. The semantics captures this with C[]; the only cases where promises are not evaluated is in the arguments of a function call and when promises occur in a nested function body, all other references to promises are evaluated. In particular, it was surprising and unnecessary to force assignments as this hampers building infinite structures. Many basic functions that are lazy in Haskell, for example, are strict in R, including data type constructors. As for sharing, the semantics cleary demonstrates that R prevents sharing by performing copies at assignments.

The R implementation uses copy-on-write to reduce the number of copies. With superassignment, environments can be used as shared mutable data structures. The way assignment into vectors preserves the pass-by-value semantics is rather unusual and, from personal experience, it is unclear if programmers understand the feature. … It is noteworthy that objects are mutable within a function (since fields are attributes), but are copied when passed as an argument.

Perhaps not immediately applicable to a topic map task today but I would argue very relevant for topic maps in general.

In part because it is a reminder that we are fashioning, when writing topic maps or topic map interfaces or languages to be used with topic maps, languages. Languages that will or perhaps will not fit how our users view the world and how they tend to formulate queries or statements.

The test for an artificial language should be whether users have to stop to consider the correctness of their writing. Every pause is a sign that error may be about to occur. Will they remember that this is an SVO language? Or is the terminology a familiar one?

Correcting the errors of others may “validate” your self-worth but is that what you want as the purpose of your language?

(I saw this at Christophe Lalanne’s blog.)

May 9, 2012

Ask a baboon

Filed under: Humor,Language — Patrick Durusau @ 4:04 pm

Ask a baboon

A post by Mark Liberman that begins with this quote:

Sindya N. Bhanoo, “Real Words or Gibberish? Just Ask a Baboon“, NYT 4/16/2012:

While baboons can’t read, they can tell the difference between real English words and nonsensical ones, a new study reports.

“They are using information about letters and the relation between letters to perform the task without any kind of linguistic training,” said Jonathan Grainger, a psychologist at the French Center for National Research and at Aix-Marseille University in France who was the study’s first author.

Mark finds a number of sad facts, some in the coverage of the story and the others in the story itself.

His analysis of the coverage and the story proper are quite delightful.

Enjoy.

May 3, 2012

Hyperbolic lots

Filed under: Humor,Language,Machine Learning — Patrick Durusau @ 6:23 pm

Hyperbolic lots by Ben Zimmer.

From the post:

For the past couple of years, Google has provided automatic captioning for all YouTube videos, using a speech-recognition system similar to the one that creates transcriptions for Google Voice messages. It’s certainly a boon to the deaf and hearing-impaired. But as with Google’s other ventures in natural language processing (notably Google Translate), this is imperfect technology that is gradually becoming less imperfect over time. In the meantime, however, the imperfections can be quite entertaining.

I gave the auto-captioning an admittedly unfair challenge: the multilingual trailer that Michael Erard put together for his latest book, Babel No More: The Search for the World’s Most Extraordinary Language Learners. The trailer features a story from the book told by speakers of a variety of languages (including me), and Erard originally set it up as a contest to see who could identify the most languages. If you go to the original video on YouTube, you can enable the auto-captioning by clicking on the “CC” and selecting “Transcribe Audio” from the menu.

The transcription does a decent job with Erard’s English introduction, though I enjoyed the interpretation of “hyperpolyglots” — the subject of the book — as “hyperbolic lots.” Hyperpolyglot (evidently coined by Dick Hudson) isn’t a word you’ll find in any dictionary, and it’s not that frequent online, so it’s highly unlikely the speech-to-text system could have figured it out. But the real fun begins with the speakers of other languages.

You will find this amusing.

Ben notes the imperfections are becoming fewer.

Curious, since languages are living, social constructs, at what point to we measure the number of “imperfections?”

Or should I say from whose perspective do we measure the number of “imperfections?”

Or should we use both of those measures and others?

April 27, 2012

Parallel Language Corpus Hunting?

Filed under: Corpora,EU,Language,Linguistics — Patrick Durusau @ 6:11 pm

Parallel language corpus hunters, particularly in legal informatics can rejoice!

[A] parallel corpus of all European Union legislation, called the Acquis Communautaire, translated into all 22 languages of the EU nations — has been expanded to include EU legislation from 2004-2010…

If you think semantic impedance in one language is tough, step up and try that across twenty-two (22) languages.

Of course, these countries share something of a common historical context. Imagine the gulf when you move up to languages from other historical contexts.

See: DGT-TM-2011, Parallel Corpus of All EU Legislation in Translation, Expanded to Include Data from 2004-2010 for links and other details.

March 24, 2012

The Heterogeneous Programming Jungle

The Heterogeneous Programming Jungle by Michael Wolfe.

Michael starts off with one definition of “heterogeneous:”

The heterogeneous systems of interest to HPC use an attached coprocessor or accelerator that is optimized for certain types of computation.These devices typically exhibit internal parallelism, and execute asynchronously and concurrently with the host processor. Programming a heterogeneous system is then even more complex than “traditional” parallel programming (if any parallel programming can be called traditional), because in addition to the complexity of parallel programming on the attached device, the program must manage the concurrent activities between the host and device, and manage data locality between the host and device.

And while he returns to that definition in the end, another form of heterogeneity is lurking not far behind:

Given the similarities among system designs, one might think it should be obvious how to come up with a programming strategy that would preserve portability and performance across all these devices. What we want is a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target. Is that too much to ask?

Let me reflect momentarily on the two gold standards in this arena. The first is high level programming languages in general. After 50 years of programming using Algol, Pascal, Fortran, C, C++, Java, and many, many other languages, we tend to forget how wonderful and important it is that we can write a single program, compile it, run it, and get the same results on any number of different processors and operating systems.

So there is the heterogeneity of attached coprocessor and, just as importantly, of the processors with coprocessors.

His post concludes with:

Grab your Machete and Pith Helmet

If parallel programming is hard, heterogeneous programming is that hard, squared. Defining and building a productive, performance-portable heterogeneous programming system is hard. There are several current programming strategies that attempt to solve this problem, including OpenCL, Microsoft C++AMP, Google Renderscript, Intel’s proposed offload directives (see slide 24), and the recent OpenACC specification. We might also learn something from embedded system programming, which has had to deal with heterogeneous systems for many years. My next article will whack through the underbrush to expose each of these programming strategies in turn, presenting advantages and disadvantages relative to the goal.

These are languages that share common subjects (think of their target architectures) and so are ripe for a topic map that co-locates their approaches to a particular architecture. Being able to incorporate official and non-official documentation, tests, sample code, etc., might enable faster progress in this area.

The future of HPC processors is almost upon us. It will not do to be tardy.

March 17, 2012

The Anachronism Machine: The Language of Downton Abbey

Filed under: Language,Linguistics — Patrick Durusau @ 8:19 pm

The Anachronism Machine: The Language of Downton Abbey

David Smith writes::

I’ve recently become hooked on the TV series Downton Abbey. I’m not usually one for costume dramas, but the mix of fine acting, the intriguing social relationships, and the larger WW1-era story make for compelling viewing. (Also: Maggie Smith is a treasure.)

Despite the widespread criticial acclaim, Downton has met with criticism for some period-innapropriate uses of language. For example, at one point Lady Mary laments “losting the high ground”, a phrase that didn’t come into use until the 1960s. But is this just a random slip, or are such anachronistic phrases par for the course on Downton? And how does it compare to other period productions in its use of language?

To answer these questions, Ben Schmidt (a graduate student in history at Princeton University and Visiting Graduate Fellow at the Cultural Observatory at Harvard) created an “Anachronism Machine“. Using the R statistical programming language and Google n-grams, it analyzes all of the two-word phrases in a Downton Abbey script, and compares their frequency of use with that in books written around the WW1 era (when Downton is set). For example, Schmidt finds that Downton characters, if they were to follow societal norms of the 1910’s (as reflected in books from that period), would rarely use the phrase “guest bedroom”, but in fact it’s regularly uttered during the series. Schmidt charts the frequency these phrases appear in the show versus the frequency they appear in contemporaneous books below:

Good post on the use of R for linguistic analysis!

As a topic map person, I am more curious what should be substituted for “guest bedroom” in a 1910’s series? Thinking it would be interesting to have a mapping between the “normal” speech patterns for various time periods.

January 22, 2012

Draft (polysemy and ambiguity)

Filed under: Ambiguity,Language,Polysemy — Patrick Durusau @ 7:40 pm

Draft by Mark Liberman

From the post:

In a series of Language Log posts, Geoff Pullum has called attention to the prevalence of polysemy and ambiguity:

The people who think clarity involves lack of ambiguity, so we have to strive to eliminate all multiple meanings and should never let a word develop a new sense… they simply don’t get it about how language works, do they?

Languages love multiple meanings. They lust after them. They roll around in them like a dog in fresh grass.

The other day, as I reading a discussion in our comments about whether English draftable does or doesn’t refer to the same concept as Finnish asevelvollisuus (“obligation to serve in the military”), I happened to be sitting in a current of uncomfortably cold air. So of course I wondered how the English word draft came to refer to military conscription as well as air flow. And a few seconds of thought brought to mind several others senses of the the noun draft and its associated verb. I figured that this must represent a confusion of several originally separate words. But then I looked it up.

If you like language and have an appreciation for polsemy and ambiguity, you will enjoy this post a lot.

January 14, 2012

A Taxonomy of Ideas?

Filed under: Language,Taxonomy — Patrick Durusau @ 7:35 pm

A Taxonomy of Ideas?

David McCandless writes:

Recently, when throwing ideas around with people, I’ve noticed something. There seems to be a hidden language we use when evaluating ideas.

Neat idea. Brilliant idea. Dumb idea. Bad idea. Strange idea. Cool idea.

There’s something going on here. Each one of these ideas is subtly different in character. Each adjective somehow conveys the quality of the concept in a way we instantly and unconsciously understand.

Good point. There is always a “hidden language” that will be understood by members of a social group. But that also means that “hidden language” and its implications, will not be understood, at least not in the same way, by people in other groups.

That same “hidden language” shapes our choices of subjects out of a grab bag of subjects (a particular data set if not the world).

We can name some things that influence our choices, but it is always far from being all of them. Which means that no set of rules will always lead to the choices we would make. We are incapable of specifying the rules in the require degree of detail.

November 21, 2011

TextMinr

Filed under: Data Mining,Language,Text Analytics — Patrick Durusau @ 7:31 pm

TextMinr

In pre-beta (can signal interest now) but:

Text Mining As A Service – Coming Soon!

What if you could incorporate state-of-the-art text mining, language processing & analytics into your apps and systems without having to learn the science or pay an arm and a leg for the software?

Soon you will be able to!

We aim to provide our text mining technology as a simple, affordable pay-as-you-go service, available through a web dashboard and a set of REST API’s.

If you already familiar with these tools and your data sets, this could be a useful convenience.

If you aren’t familiar with these tools and your data sets, this could be a recipe for disaster.

Like SurveyMonkey.

In the hands of a survey construction expert, with testing of the questions, etc., I am sure SurveyMonkey can be a very useful tool.

In the hands of management, who want to justify decisions where surveys can be used, SurveyMonkey is positively dangerous.

Ask yourself this: Why in an age of SurveyMonkey, do politicians pay pollsters big bucks?

Do you suspect there is something different from a professional pollster and SurveyMonkey?

Same distance between TextMinr and professional text analysis.

Or perhaps better, you get what you pay for.

October 23, 2011

Notation as a Tool of Thought – Iverson – Turing Lecture

Filed under: CS Lectures,Language,Language Design — Patrick Durusau @ 7:22 pm

Notation as a Tool of Thought by Kenneth E. Iverson – 1979 Turing Award Lecture

I saw this lecture tweeted with a link to a poor photocopy of a double column printing of the lecture.

I think you will find the single column version from the ACM awards site much easier to read.

Not to mention that the ACM awards site has all the Turing as well as other award lectures for viewing.

I suspect that a CS class could be taught using only ACM award lectures as the primary material. Perhaps someone already has, would appreciate a pointer if true.

October 22, 2011

Introducing fise, the Open Source RESTful Semantic Engine

Filed under: Entity Extraction,Entity Resolution,Language,Semantics,Taxonomy — Patrick Durusau @ 3:16 pm

Introducing fise, the Open Source RESTful Semantic Engine

From the post:

fise is now known as the Stanbol Enhancer component of the Apache Stanbol incubating project.

As a member of the IKS european project Nuxeo contributes to the development of an Open Source software project named fise whose goal is to help bring new and trendy semantic features to CMS by giving developers a stack of reusable HTTP semantic services to build upon.

Presenting the software in Q/A form:

What is a Semantic Engine?

A semantic engine is a software component that extracts the meaning of a electronic document to organize it as partially structured knowledge and not just as a piece of unstructured text content.

Current semantic engines can typically:

  • categorize documents (is this document written in English, Spanish, Chinese? is this an article that should be filed under the  Business, Lifestyle, Technology categories? …);
  • suggest meaningful tags from a controlled taxonomy and assert there relative importance with respect to the text content of the document;
  • find related documents in the local database or on the web;
  • extract and recognize mentions of known entities such as famous people, organizations, places, books, movies, genes, … and link the document to there knowledge base entries (like a biography for a famous person);
  • detect yet unknown entities of the same afore mentioned types to enrich the knowledge base;
  • extract knowledge assertions that are present in the text to fill up a knowledge base along with a reference to trace the origin of the assertion. Examples of such assertions could be the fact that a company is buying another along with the amount of the transaction, the release date of a movie, the new club of a football player…

During the last couple of years, many such engines have been made available through web-based API such as Open Calais, Zemanta and Evri just to name a few. However to our knowledge there aren't many such engines distributed under an Open Source license to be used offline, on your private IT infrastructure with your sensitive data.

Impressive work that I found through a later post on using this software on Wikipedia. See Mining Wikipedia with Hadoop and Pig for Natural Language Processing.

October 20, 2011

Tech Survivors: Geek Technologies…

Filed under: Language,Pattern Matching,Software — Patrick Durusau @ 6:33 pm

Tech Survivors: Geek Technologies That Still Survive 25 to 50 Years Later

Simply awesome!

Useful review for a couple of reasons:

First: New languages, formats, etc., will emerge but legacy systems “….will be with you always.” (Or at least it will feel that way so being able to interface with legacy systems (understand their semantics) is going to be important for very long time.)

Second: What was it about these technologies that made them succeed? (I don’t have the answer or I would be at the USPTO filing every patent and variant of patent that I could think of. 😉 It is clearly non-obvious because no one else is there either.) Equally long-lived technologies are with us today, we just don’t know which ones.

Would not hurt to put this on your calendar to review every year or so. The more you know about new technologies, the more likely you are to spot a resemblance or pattern matching one of these technologies. Maybe.

October 11, 2011

Free Programming Books

Filed under: Language,Programming,Recognition,Subject Identity — Patrick Durusau @ 5:54 pm

Free Programming Books

Despite the title (the author’s update only went so far), there are more than 50 books listed here.

I won’t have tweeted this because like Lucene turning ten, everyone in the world has already tweeted or retweeted the news of these books.

I seem to be on a run of mostly programming resources today and I thought you might find the list interesting, possibly useful.

Especially those of you interested on pattern matching.

It occurs to me that programming languages and books about programming languages are fit fodder for the same tools used on other texts.

I am sure there probably exists an index with all the “hello, world” examples from various computer languages but are there more/deeper similarities that the universal initial example?

There was a universe of programming languages prior to “hello, world” and there is certainly a very large one beyond those listed here but one has to start somewhere. So why not with this set?

I think the first question I would ask is the obvious one: Are there groupings of these works, other than the ones noted? What measures would you use and why? What results do you get?

I suppose before that you need to gather up the texts and do whatever cleanup/conversion is required, perhaps a short note on what you did there would be useful.

What parts were in anticipation of your methods for grouping the texts?

Patience topic map fans, we are getting to the questions of subject recognition.

So, what subjects should we recognize across programming languages? Not tokens or even signatures but subjects. Signatures may be a way of identifying subjects, but can the same subjects have different signatures in distinct languages?

Would recognition of subjects across programming languages assist in learning languages?, in developing new languages (what is commonly needed)?, in studying the evolution of languages (where did we go right/wrong)?, in usefully indexing CS literature?, etc.

And you thought this was a post about “free” programming books. 😉

October 9, 2011

Execution in the Kingdom of Nouns

Filed under: Java,Language,Language Design — Patrick Durusau @ 6:43 pm

Execution in the Kingdom of Nouns

From the post:

They’ve a temper, some of them—particularly verbs: they’re the proudest—adjectives you can do anything with, but not verbs—however, I can manage the whole lot of them! Impenetrability! That’s what I say!
— Humpty Dumpty

Hello, world! Today we’re going to hear the story of Evil King Java and his quest for worldwide verb stamp-outage.1

Caution: This story does not have a happy ending. It is neither a story for the faint of heart nor for the critical of mouth. If you’re easily offended, or prone to being a disagreeable knave in blog comments, please stop reading now.

Before we begin the story, let’s get some conceptual gunk out of the way.

What I find compelling is the notion that a programming language should follow how we think, that is the most of us.

If you want a successful topic map, should it follow/mimic the thinking of:

  1. the author
  2. the client
  3. intended user base?

#1 is easy, that’s the default and requires the least work.

#2 is instinctive, but you will need to educate the client to #3.

#3 is golden if you can hit that mark.

September 25, 2011

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation

Filed under: Language,Semantic Diversity,Vocabularies — Patrick Durusau @ 7:48 pm

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation by Odile Piton (SAMM), Slim Mesfar (RIADI), and Hélène Pignot (SAMM).

Abstract:

Since 2006 we have undertaken to describe the differences between 17th century English and contemporary English thanks to NLP software. Studying a corpus spanning the whole century (tales of English travellers in the Ottoman Empire in the 17th century, Mary Astell’s essay A Serious Proposal to the Ladies and other literary texts) has enabled us to highlight various lexical, morphological or grammatical singularities. Thanks to the NooJ linguistic platform, we created dictionaries indexing the lexical variants and their transcription in CE. The latter is often the result of the validation of forms recognized dynamically by morphological graphs. We also built syntactical graphs aimed at transcribing certain archaic forms in contemporary English. Our previous research implied a succession of elementary steps alternating textual analysis and result validation. We managed to provide examples of transcriptions, but we have not created a global tool for automatic transcription. Therefore we need to focus on the results we have obtained so far, study the conditions for creating such a tool, and analyze possible difficulties. In this paper, we will be discussing the technical and linguistic aspects we have not yet covered in our previous work. We are using the results of previous research and proposing a transcription method for words or sequences identified as archaic.

Everyone working on search engines needs to print a copy of this article and read it at least once a month.

Seriously, the senses of both words and grammar evolve over centuries and even more quickly. What seem like correct search results from as recently as the 1950’s may be quite incorrect.

For example (I don’t have the episode reference, perhaps someone can suppy it) there was an “I Love Lucy” episode where Lucy says on the phone to RIcky that some visitor (at home) is “making love to her,” which meant nothing more than sweet talk. Not sexual intercourse.

I leave it for your imagination how large the semantic gap may be between English texts and originals composed in another language, culture, historical context and between 2,000 to 6,000 years ago. Flattening the complexities of ancient texts to bumper sticker snippets does a disservice them and ourselves.

September 19, 2011

Philosophy of Language, Logic, and Linguistics – The Very Basics

Filed under: Language,Linguistics,Logic — Patrick Durusau @ 7:53 pm

Philosophy of Language, Logic, and Linguistics – The Very Basics

Any starting reading list in this area is going to have its proponents and its detractors.

I would add to this list John Sowa’s “Knowledge representation : logical, philosophical, and computational foundations“. There is a website as well. Just be aware that John’s views on Charles Peirce represent a unique view of the development of logic in the 20th century. Still an excellent bibliography of materials for reading. And, as always, you should read the original texts for yourself. You may reach different conclusions from those reported by others.

TCP Text Creation Partnership

Filed under: Concept Drift,Dataset,Language — Patrick Durusau @ 7:51 pm

TCP Text Creation Partnership

From the “mission” page:

The Text Creation Partnership’s primary objective is to produce standardized, digitally-encoded editions of early print books. This process involves a labor-intensive combination of manual keyboard entry (from digital images of the books’ original pages), the addition of digital markup (conforming to guidelines set by a text encoding standard-setting body know as the TEI), and editorial review.

The chief sources of the TCP’s digital images are database products marketed by commercial publishers. These include Proquest’s Early English Books Online (EEBO), Gale’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints. Idiosyncrasies in early modern typography make these collections very difficult to convert into searchable, machine-readable text using common scanning techniques (i.e., Optical Character Recognition). Through the TCP, commercial publishers and over 150 different libraries have come together to fund the conversion of these cultural heritage materials into enduring, digitally dynamic editions.

To date, the EEBO-TCP project has converted over 25,000 books. ECCO- and EVANS-TCP have converted another 7,000+ books. A second phase of EEBO-TCP production aims to convert the remaining 44,000 unique monograph titles in the EEBO corpus by 2015, and all of the TCP texts are scheduled to enter the public domain by 2020.

Several thousand titles from the 18th century collection are already available to the general public.

I mention this as a source of texts for testing search software against semantic drift. The sort of drift that occurs in any living language. To say nothing of the changing mores of our interpretation of languages with no native speakers remaining to defend them.

September 14, 2011

Don’t trust your instincts

Filed under: Data Analysis,Language,Recognition,Research Methods — Patrick Durusau @ 7:04 pm

I stumbled upon a review of: “The Secret Life of Pronouns: What Our Words Say About Us” by James W. Pennebaker in the New York Times Book Review, 28 August 2011.

Pennebaker is a word counter who first rule is: “Don’t trust your instincts.”

Why? In part because our expectations shape our view of the data. (sound familiar?)

The review quotes the Druge Report as posting a headline about President Obama that reads: “I ME MINE: Obama praises C.I.A. for bin Laden raid – while saying ‘I’ 35 Times.”

If the listener thinks President Obama is self-centered, the “I’s” have it as it were.

But, Pennebaker has used his programs to mindlessly count usage of words in press conferences since Truman. Obama is the lowest user I-word user of modern presidents.

That is only one illustration of how badly we can “look” at text or data and get it seriously wrong.

The Secret Life of Pronouns website has exercises to demonstrate how badly we get things wrong. (The videos are very entertaining.)

What does that mean for topic maps and authoring topic maps?

  1. Don’t trust your instincts. (courtesy of Pennebaker)
  2. View your data in different ways, ask unexpected questions.
  3. Ask people unfamiliar with your data how they view it.
  4. Read books on subjects you know nothing about. (Just general good advice.)
  5. Ask known unconventional people to question your data/subjects. (Like me! Sorry, consulting plug.)
« Newer PostsOlder Posts »

Powered by WordPress