Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 27, 2011

Lucene’s FuzzyQuery is 100 times faster in 4.0 (and a topic map tale)

Filed under: Authoring Topic Maps,Lucene,Topic Maps — Patrick Durusau @ 3:16 pm

Lucene’s FuzzyQuery is 100 times faster in 4.0

I first saw this post mentioned in a tweet by Lars Marius Garshol.

From the post:

There are many exciting improvements in Lucene’s eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms “close” to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other “close” terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.

The story is a good one and demonstrates the need for topic maps in computer science.

The authors used “Googling” to find an implementation by Jean-Phillipe Barrette-LaPierre of an algorithm in a paper by Klaus Schulz and Stoyan Mihov that enabled this increase in performance.

That’s one way to do it, but leaves it to hit or miss as to whether other researchers will find the same implementation.

Moreover, once that connection has been made, the implementation associated with the algorithm/paper, it should be preserved for subsequent searchers.

As well as pointing to the implementation of this algorithm in Lucene, or other implementations, or even other accounts by the same authors, such as the 2004 publication in Computational Linguistics of Fast Approximate Search in Large Dictionaries.

Sounds like a topic map to me. The question is how to make ad hoc authoring of a topic map practical?

Suggestions?

March 8, 2011

Topic Maps: Less Garbage In, Less Garbage Out

Filed under: Authoring Topic Maps,Marketing,Topic Maps — Patrick Durusau @ 10:03 am

The latest hue and cry over changes to the Google search algorithm (search for “Google farmer update,” I don’t want to dignify any of it with a link) seems like a golden advertising opportunity for topic maps.

The slogan?

Topic Maps: Less Garbage In, Less Garbage Out

That is one of the value-adds of any curated data source isn’t it?

Instead of say 200,000 “hits” post-Farmer update on some subject, what if a topic map offered 20?

Or 0.0001% of the 200,000?

Of course, there are those who would rush forward to say that I might miss an important email or blog posting on subject X.

True, but if it were truly an important email or blog posting then a curator is likely to have picked it up. Yes?

The point of curation is to save users the time and effort of winnowing (wading?) through information garbage.

Here’s a topic map construction idea:

  1. Capture all the out-going search requests from your location.
  2. Throw away all the porn searches.
  3. Create a topic map of the useful answers to the remaining searches.
  4. Use filtering software to block access to search engines and/or redirect to the topic map.

Your staff is looking for answers to work related questions, yes?

A curated resource, like a topic map, would save them time and effort in finding answers to those questions.

March 3, 2011

Baking a topic map (err, I mean bread)

Filed under: Authoring Topic Maps,Examples,Topic Maps — Patrick Durusau @ 1:49 pm

Benjamin Bock asked last week about how to topic map ingredients (and the measures of) as well as the order of steps in a recipe.

I can’t give you a complete answer in one post (or even in several) but I can highlight some of the issues and possible solutions.

First, we need a recipe. I will be using the basic bread recipe, from the Artisan Bread in 5 Minutes a Day site, which lists the following ingredients:

  • 3 1/2 cups lukewarm water
  • 4 teaspoons active dry yeast
  • 4 teaspoons coarse salt
  • 7 1/4 cups (2 lb. 4 oz.; 1027.67 grams) unbleached all-purpose flour (measure using scoop and sweep method)

That’s right. Carol has been teaching me to cook and I really enjoy baking bread.

If it is a good day, call ahead and I am likely to have fresh bread out of the oven within minutes of your arrival.

Anyway, at first blush, this looks easy, after all , people have been passing recipes along for thousands of years.

Second look, not so easy.

First try at baking the topic map

The recipe itself has a name, Master Artisan Bread Recipe.

That looks like a promising place to start, we have a recipe, it has a name and from what we read above, some ingredients.

We could simply create a topic for the recipe, record its name and include the ingredients as occurrences, of type ingredient.

After all, since we can search for strings across the topic map, it won’t be hard to find recipes with yeast, flour, etc., whatever ingredient we want.

And that would be a perfectly valid topic map.

Well, except that you or I may want to say something about the yeast, as a subject. Could be which brand to use, etc.

Could simply stuff that information into the occurrence but topic maps have a better solution.

Second try at baking the topic map

Isn’t there a hint in the way we have passed recipes down for years about how we should represent them in a topic map?

That is each ingredient, more or less, stands on its own. We can talk about each one and often measure them all out before starting.

What if we represented each ingredient as a subject, that is with a topic?

And we represent their relationships to the recipe, remember Master Artisan Bread Recipe?, with an ingredient_of association. (Stolen shamelessly from Sam Hunting’s chapter, How to Start Topic Mapping Right Away with the XTM Specification, in XML Topic Maps, ed. by Jack Park and Sam Hunting.)

Oh, err, one thing, how do I get from 3 1/2 cups lukewarm water from water as a subject in an ingredient_of association?

That wasn’t explained very well. 😉

Third try at baking the topic map

Err, hmmm, yes (stalling for time),

Well, let’s break the water subject out and see if we can establish some principles for a solution that works for the other ingredients.

The measurement, 3 1/2 cups and the temperature, lukewarm, do not affect the subject identity of the water, but the first establishes a particular/specific, set aside amount of water and lukewarm, defines a temperature for that set aside portion.

At its core the problem is that we would prefer to talk about water as an ingredient and to not have to use 3 1/2 cups as part of its identity.

That is, how would your topic map look with an ingredient_of association between a recipe and 3 1/2 cups of water?

Would your 3 1/2 cups of water only merge with other 3 1/2 cups of water topics in other recipes?

That sounds like a bad plan.

Fourth try at baking the topic map

Let’s think about this for a moment.

We want ingredient as subject so we can say things about them. We also want to record the amount or some condition of an ingredient as part of the recipe.

One work around, not necessarily a good one (discussion please!) would be to model the recipe – ingredient association as a three role relationship:

  • recipe
  • ingredient
  • measure_condition

That breaks out the measurement or condition of the ingredient as a separate subject. It also dodges some fairly complicated issues with regard to measurement but those are probably not critical to a bread recipe anyway.

Oh, sorry, did not mean to avoid answering Benjamin’s question about ordering steps in the recipe.

Did you know that when practicing my typing in grade school I duplicated my mother’s recipes and then discarded the originals?

I also left off the steps then. Had the amounts and ingredients, but no steps. 😉

She took it good naturedly enough but declined my further help where the recipe box was concerned.

I promise I won’t repeat that error but I won’t reach the step question today.

Besides, interested to hear what you think about the recipe illustration so far?

Understand that I need to include syntax but thought I would do that in the next post, before I get to the steps question.

March 1, 2011

cablegate.core 0.2.0-20110224

Filed under: Authoring Topic Maps,Examples,Topic Map Software,Topic Maps — Patrick Durusau @ 10:48 am

Did not mean to miss the updated release of cablegate.core yesterday.

Download a copy and post your comments/suggestions.

Better yet, contribute your analysis via topic maps that can be merged with other topic maps.

Use topic maps to make cablegate more than a titillating annoyance.

The thought occurs to me that with all the unrest in Libya, there could be a fresh crop of diplomatic cables about to become available.

And why not? It would be a nice window into the recent history in the region.

Would that endanger some actors?

Well, you know what they say about playing in the street.

And, they weren’t acting in anyone’s interest but their own, so I would not lose any sleep over it.

February 28, 2011

YouTube Topic Map?

Filed under: Authoring Topic Maps,Data Mining,Topic Maps — Patrick Durusau @ 10:55 am

Is anyone working or thinking about working on a topic map for YouTube?

I ask because while I can eventually find search terms that will narrow the videos down to a set of lectures, they are disorderly and have duplicates.

If someone is working on a project that would include CS lectures and similar offerings, I would be willing to contribute some editing/sorting of data.

Probably not the most popular subject for a community based topic map. 😉

I might be willing to contribute some editing/sorting of data for more popular topic maps as well. Depends on the topic. (sorry!)

Suggestions (with a link to a representative YouTube video) welcome!

You can even conceal your identity! I won’t out you for liking the sneezing panda video.

R Fundamentals and Programming Techniques

Filed under: Authoring Topic Maps,Data Mining,R — Patrick Durusau @ 8:33 am

R Fundamentals and Programming Techniques

Thomas Lumley on R.

One of the strengths and weaknesses of the topic map standardization effort was that it presumed you already had a topic map.

A strength because the methods for arriving at a topic map remain unbounded and unsullied by choices (and limitations) of languages, approaches, etc.

A weakness because the topic map novice is left in the position of a tourist who marvels at a medieval cathedral but has no idea how to build one themselves. (Well, ok, perhaps that is a bit of a stretch. 😉 )

The fact remains there is are ever increasing amounts of data becoming available, many of which are just crying out for topic maps to be built for their navigation.

R is one of the currently popular data mining languages that can be pressed into service for the exploration of data and construction of topic maps.

Definitely a resource to explore and exploit before you invest in any of the printed R reference materials.

February 21, 2011

Soylent: A Word Processor with a Crowd Inside

Filed under: Authoring Topic Maps,Crowd Sourcing,Interface Research/Design — Patrick Durusau @ 4:31 pm

Soylent: A Word Processor with a Crowd Inside

I know, I know, won’t even go there. As the librarians say: “Look it up!”

From the abstract:

This paper introduces architectural and interaction patterns for integrating crowdsourced human contributions directly into user interfaces. We focus on writing and editing, complex endeavors that span many levels of conceptual and pragmatic activity. Authoring tools offer help with pragmatics, but for higher-level help, writers commonly turn to other people. We thus present Soylent, a word processing interface that enables writers to call on Mechanical Turk workers to shorten, proofread, and otherwise edit parts of their documents on demand. To improve worker quality, we introduce the Find-Fix-Verify crowd programming pattern, which splits tasks into a series of generation and review stages. Evaluation studies demonstrate the feasibility of crowdsourced editing and investigate questions of reliability, cost, wait time, and work time for edits.

When I first started reading the article, it seemed obvious to me that the Human Macro option could be useful for topic map authoring. At least if the tasks were sufficiently constrained.

I was startled to see a 30% error rate for the “corrections” was considered a baseline, hence the necessity for correction/control mechanisms.

The authors acknowledge that the bottom line cost of out-sourcing may weigh against its use in commercial contexts.

Perhaps so but I would run the same tests against published papers and books. To determine the error rate without an out-sourced correction loop.

I think the idea is basically sound, although for some topic maps it might be better to place qualification requirements on the outsourcing.

TF-IDF Weight Vectors With Lucene And Mahout

Filed under: Authoring Topic Maps,Lucene,Mahout — Patrick Durusau @ 6:43 am

How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout

From the website:

You have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.

You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.

Lucene and Mahout can help you to do that almost in a snap.

Why is this important for topic maps?

Wikipedia reports:

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. (http://en.wikipedia.org/wiki/Tf-idf, cited in this posting)

Knowing the important terms in a document collection is one step towards a useful topic map. May not be definitive but it is a step in the right direction.

February 15, 2011

Auto Completion

Filed under: Authoring Topic Maps,Redis — Patrick Durusau @ 1:54 pm

Auto-completion is a feature that I find useful in a number of applications.

I suspect users would find that to be the case for topic map authoring and navigation software.

One article to look at is: Auto Complete with Redis.

Which was cited by: Announcing Soulmate, A Redis-Backed Service For Fast Autocompleting

The second item being an application complete with an interface.

From the Soulmate announcement:

Inspired by Auto Complete with Redis, Soulmate uses sorted sets to build an index of partially completed words and the corresponding top matching items, and provides a simple sinatra app to query them.

Here’s a quick overview of what the initial version of Soulmate supports:

  • Provide suggestions for multiple types of items in a single query (at SeatGeek we’re autocompleting for performers, events, and venues)
  • Results are ordered by a user-specified score
  • Arbitrary metadata for each item (at SeatGeek we’re storing both a url and a subtitle)

I rather like the idea of arbitrary metadata.

Could be a utility that presents snippets to paste into a topic map?

February 13, 2011

Apache Lucene 3.0 Tutorial

Filed under: Authoring Topic Maps,Indexing,Lucene — Patrick Durusau @ 1:34 pm

Apache Lucene 3.0 Tutorial by Bob Carpenter.

At 20 pages it isn’t your typical “Hello World” introduction. 😉

It should be the first document you hand a semi-technical person about Lucene.

Discovering the vocabulary of the documents/domain for which you are building a topic map is a critical first step.

Indexing documents gives you an important control over the accuracy and completeness of information you are given by domain “experts” and users.

There will be terms that are transparent to them and can only be clarified if you ask.

Text Analysis with LingPipe, Version 0.3

Filed under: Authoring Topic Maps,Indexing,LingPipe — Patrick Durusau @ 1:30 pm

Text Analysis with LingPipe 4 (Version 0.3) By Bob Carpenter.

On the importance of this book see: LingPipe Home.

February 10, 2011

The unreasonable effectiveness of simplicity

Filed under: Authoring Topic Maps,Crowd Sourcing,Data Analysis,Subject Identity — Patrick Durusau @ 1:50 pm

The unreasonable effectiveness of simplicity from Panos Ipeirotis suggests that simplicity should be considered in the construction of information resources.

The simplest aggregation technique: Use the majority vote as the correct answer.

I am mindful of the discussion several years ago about visual topic maps. Which was a proposal to use images as identifiers. Certainly doable now but the simplicity angle suggests an interesting possibility.

Would not work for highly abstract subjects, but what if users were presented with images when called upon to make identification choices for a topic map?

For example, marking entities in a newspaper account, the user is presented with images near each marked entity and chooses yes/no.

Or in legal discovery or research, a similar mechanism, along with the ability to annotate any string with an image/marker and that image/marker appears with that string in the rest of the corpus.

Unknown to the user is further information about the subject they have identified that forms the basis for merging identifications, linking into associations, etc.

A must read!

February 9, 2011

First, you need to Get the Data – Post

Filed under: Authoring Topic Maps,Data Source — Patrick Durusau @ 5:01 am

First, you need to Get the Data is a post by Mathew Hurst about a site for asking questions about data sets (and getting answers).

A couple of the questions just to give you an idea about the site:

  • How can I compile a log of Wikipedia articles by date of creation?
  • Are there any indexes of available data sets?

There are useful answers to both of those questions.

Before starting off to build a data set, this is one site to check first.

A listing of sites to check for existing data sets would make an useful chapter in a book on topic maps.

February 4, 2011

TweetDeck and Topic Maps

Filed under: Authoring Topic Maps,Marketing — Patrick Durusau @ 4:51 am

If you don’t know TweetDeck.com you need to slide by to take a look.

As an admittedly slow and still uncertain adopter of all this social software, I would appreciate any feedback you have on this or other alternatives.

But, onto the topic map relevant part of this post!

I noticed that TweetDeck 0.37 has a feature: Hide repeated retweets.

I think they should go one better than that and scan tweets for the same shortened URL and to offer an option to display what we would call a topic with multiple occurrences.

That is there would be the one shortend URL, which you could follow if you like, with occurrences under that one tweet that list all the various tweets that contain it.

Would certainly shorten up my tweet windows in TweetDeck a good bit. Most of the repeats aren’t marked as retweets so the software isn’t catching them.

Now, if TweetDeck or equivalent software wanted to be really clever, they could make associations with the senders of those tweets so I could see a list of all the users who sent that resource.

*****
PS: This would be a case where TweetDeck need not offer the generic in your face topic map interface but could offer some of the advantages of topic maps (de-duping content and gathering up all the authors of the same content).

Topic Map Competition

Filed under: Authoring Topic Maps,Interface Research/Design,Marketing,News,Topic Maps — Patrick Durusau @ 4:34 am

The idea of a topic map competition seems like a good one to me.

We need to demonstrate that topic map development isn’t like a trip to the ontological dentist or protologist.

Just some random thoughts that hopefully can firm up in the near future.

Suggest starting off with two contests, with two different data sets.

24-Hour Topic Map

A 24 hour contest, with points, in part, for inclusion of participants in different time zones. To encourage the spread of topic maps around the globe.

Each team would be encouraged (required?) to keep a blog while developing the topic map so that the progress of the map, interaction with others, etc., could be documented.

Points to be awarded for participants in different time zones (up to 24 points), up to 25 points for extraction of subjects/creation of topic map structures, up to 25 points for the interface/delivery, and up to 26 points for generality of the scripts/software used in generating the map.

The greatest number of points being for generality of scripts/software so we can encourage others to try these techniques on their own data sets.

7-Day Topic Map

Not unlike the 24-Hour Topic Map (24HTM) contest except that with a much longer time period, the expectations for the results are much higher.

Points should still be awarded for participants in different time zones but should drop to 12 points, extraction/subject map structures should remain at up to 25 points, interfaces/delivery should go up to 31 points and scripts/software, up to 32 points.

Since the teams will be composed of multiple individuals, I suspect prizes are going to be limited to award certificates, listing on public websites as the winners, etc.

Any number of governments are mandating a transition to digital records (including XML) as though that will solve their access problems. For those seeking contracts, being recognized for work with a data set from a particular government could not hurt.

I suppose that may depend on whether the government views you as having permission to work with the data set. 😉

This is a very rough draft and needs a lot more details before being something practical.

PS: Should either one or both or some other variation of this suggestion prove popular, contests could be run on a monthly basis.

February 3, 2011

PyBrain: The Python Machine Learning Library

PyBrain: The Python Machine Learning Library

From the website:

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

How is PyBrain different?

While there are a few machine learning libraries out there, PyBrain aims to be a very easy-to-use modular library that can be used by entry-level students but still offers the flexibility and algorithms for state-of-the-art research. We are constantly working on more and faster algorithms, developing new environments and improving usability.

What PyBrain can do

PyBrain, as its written-out name already suggests, contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.

Another tool kit to assist in the construction of topic maps.

And another likely contender for the Topic Map Competition!

MALLET: MAchine Learning for LanguagE Toolkit
Topic Map Competition (TMC) Contender?

MALLET: MAchine Learning for LanguagE Toolkit

From the website:

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of “pipes”, which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Another tool to assist in the authoring of a topic map from a large data set.

It would be interesting but beyond the scope of the topic maps class, to organize a competition around several of the natural language processing packages.

To have a common data set, to be released on X date, with topic maps due say within 24 hours (there is a TV show with that in the title or so I am told).

Will have to give that some thought.

Could be both interesting and entertaining.

February 2, 2011

CrowdFlower

Filed under: Authoring Topic Maps,Crowd Sourcing,Interface Research/Design — Patrick Durusau @ 9:16 am

CrowdFlower

From the website:

Like Cloud computing with People.

Computers can’t do every task. Luckily, we have people to help.

We provide instant access to an elastic labor force. And our statistical quality control technology yields results you can trust.

From CrowdFlower Gets Gamers to Do Real Work for Virtual Pay

Here’s how it works. CrowdFlower embeds tasks in online games like FarmVille, Restaurant City, It Girl, Happy Aquarium, Happy Pets, Happy Island and Pop Boom. This means that the estimated 80 million gamers — from teens to homemakers — who are hooked on FarmVille, Zynga’s popular virtual farming game on Facebook, can be transformed into a virtual workforce.

To get to the next level in FarmVille, for example, the gamer might need 600 XP (XP means “experience” in Farmville parlance). So the gamer might buy a bed and breakfast building for $60 in FarmVille cash, which would earn him 600 XP. But for many gamers, revenue — and XP — from crop harvesting comes too slowly.

To earn game money quickly, the gamer can click a tab on the FarmVille page that links to real-world tasks to be performed by crowdsourced workers. Once the task is successfully completed, the gamer gets his FarmVille cash and CrowdFlower is paid by the client. The latter pays in real money, usually with a 10 percent markup.

Like any number of crowd sourcing services but I was struck by the notion of embedding tasks inside games for virtual payment.

Not the answer to all topic map authoring tasks but certainly worth thinking about.

Question: Does anyone have experience with creating topic maps by embedding tasks in online games?

January 28, 2011

NLP (Natural Language Processing) tools

Filed under: Authoring Topic Maps,Natural Language Processing,Topic Models — Patrick Durusau @ 7:50 am

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources

From Stanford University.

It may not be every NLP resource but it is the place to start looking if you are looking for a new tool.

This should give you an idea of the range of tools that could be applied to the AF war diaries for example.

January 26, 2011

Palestine Papers

Filed under: Authoring Topic Maps,Examples,Topic Maps — Patrick Durusau @ 1:24 pm

Palestine Papers

Quite helpfully, Aljazeera has published a glossary for the Palestine Papers.

The Palestine Papers were intended as internal notes, and so they make heavy use of jargon, acronyms and abbreviations. We’ve compiled a list of the most frequently-used terms.

Acronym Definition
AMA Agreement on Movement and Access
API Arab Peace Initiative
BATNA Best alternative to a negotiated agreement
CBM Confidence-building measure
CEC Central Elections Committee
GOI Government of Israel
KSCP Kerem Shalom crossing point
LO Liaison office
MB Muslim Brotherhood
MF Multi-national force
MFA Israeli ministry of foreign affairs
NAD Negotiations affairs department
NSU Negotiation support unit
NUG National unity government
PA Palestinian Authority
PG Presidential Guard
PLC Palestinian Leadership Council
PS Permanent status
PSN Permanent status negotiations
RCP Rafah crossing point
RM Road Map
SPB State with provisional borders
SSR Security sector reform
SWG Security working group
TOR Terms of reference
WG Working group

People

Different documents use different abbreviations for key negotiators: Tzipi Livni, for example, is referred to as both TL and TZ. This list covers the most commonly-used abbreviations.

Acronym Person
AA Abu Ala’ (Ahmed Qureia)
AB Azem Bishara
AG Amos Gilad
AM Abu Mazen (Mahmoud Abbas)
ARY Gen. Abdel Razzaq Yahia
BM Ban Ki-moon
BO Barack Obama
CR Condoleezza Rice
DW David Welch
ES Ephraim Sneh
GS Gilad Shed
JS Javier Solana
KD Gen. Keith Gayton
KE Khaled el-Gindy
MD Mohammad Dahlan
MO Marc Otte
PP Lt. Gen. Pietro Pistolese
PR Col. Paul Rupp
PS Pablo Serrano
RD Rami Dajani
RN Gen. Raji Najami
SA Samih al-Abed
SE Saeb Erekat
SF Salam Fayyad
ST Shalom Tourgeman
TB Tal Becker
TL Tzipi Livni
UD Udi Dekel
YAR Yasser Abed Rabbo
YG Yossi Gal

I say helpfully but a printed glossary isn’t as helpful as it could be.

For example, what if instead of a static glossary, additional information could be added for each person or organization?

That was mappable to either additional public or private data.

Watch this space for first steps on making the glossary more than just a glossary.

Afghan War Diary – 2004 – Maiana – Puzzlers

Filed under: Authoring Topic Maps,Examples,Topic Maps — Patrick Durusau @ 10:35 am

I was looking at the Afghan War Diary – 2004 at Maiana yesterday.

A couple of things puzzled me so I though I would mention them here.

Take a short look at the ontology for the diary.

I’ wait.

OK, now follow the link for Index of Individuals.

Wait! Err, there wasn’t any category that I saw in the ontology for individuals.

Did you see one?

BTW, scroll down, way down, the listing of individuals. I am assuming that cities and diary entries are both individuals?

I suppose but it looks like an odd modeling choice.

When I think of individuals I think of, you know, people.

I haven’t looked closely but do the reports include the name of persons? That is what I would consider an individual.

Ah, you know what? Individuals = Topics. Someone renamed it.

But how useful is that?

Having every subject represented by a topic in a single index?

That is as unhelpful as a Google search result.

Particularly if your topic map is of any size.

Have indexes of commonly looked for things like geographic locations by name or organizations, etc.

BTW, I don’t think that USMC is of type Host Nation.

If USMC expands to United States Marine Corps then I suspect a type of military organization is probably more accurate.

I stopped looking at this point.

Please forward suggestions/corrections to the project.

January 24, 2011

Gprof2Dot

Filed under: Authoring Topic Maps,Examples,Graphs — Patrick Durusau @ 5:34 pm

Gprof2Dot

Convert profiling output to a dot graph.

This is very cool.

The resulting graph would make an excellent interface into further documentation or analysis powered by a topic map.

Such as other implementations of the same routine? (or improvements thereof?)

Sounds like same subject talk to me.

Ambiguity and Charity

Filed under: Authoring Topic Maps,Subject Identity,Topic Maps — Patrick Durusau @ 9:06 am

John McCarthy Notes on Formalizing Context says in Entering and Leaving Contexts:

Human natural language risks ambiguity by not always specifying such assumptions, relying on the hearer or reader to guess what contexts makes sense. The hearer employs a principle of charity and chooses an interpretation that assumes the speaker is making sense. In AI usage we probably don’t usually want computers to make assertions that depend on principles of charity for their interpretation.

Natural language statements, outside formal contexts, almost never specify their assumptions. And even when they attempt to specify assumptions, such as in formal contexts, it is always a partial specification.

Complete specification of context or assumptions isn’t possible. That would require recursive enumeration of all the information that forms a context and the context of that information and so on.

It really is a question of the degree of charity that is being practiced to resolve any potential ambiguity.

If AI chooses to avoid charity altogether, I think that says a lot about its chances for success.

Topic maps, on the other hand, can specify both the result of the charitable assumption, the subject recognized, as well as the charitable assumption itself. Which could (but not necessarily will be) expressed as scope.

For example, if I see the token who and I specify the scope as being rock-n-roll-bands, that avoids any potential ambiguity, at least from my perspective. I could be wrong, or it could have some other scope, but at least you know my charitable assumption.

What is particularly clever about topic maps is that other users can combine my charitable assumptions with their own as they merge topic maps together.

Think of it as stitching together a fabric of interpretation with a thread of charitable assumptions. A fabric that AI applications will never know.

January 17, 2011

Scraping for Journalism: A Guide for Collecting Data

Filed under: Authoring Topic Maps,Data Source — Patrick Durusau @ 3:24 pm

Scraping for Journalism: A Guide for Collecting Data by Dan Nguyen
at ProPublica.

I know, it says Journalism in the title. So just substitute topic map wherever you see journalism.. 😉

Scraping is a good way to collect data for topic maps or that other activity.

I saw the reference on FlowingData.com and thought I should pass it on.

January 16, 2011

Ontopia

Filed under: Authoring Topic Maps,Ontopia,Topic Map Software — Patrick Durusau @ 6:55 pm

I saw a tweet dated 2011-01-15 saying that Ontopia was alive.

Since Ontopia is a name known to anyone interested in topic maps for more than 30 minutes, I decided to take a look.

It is indeed the Ontopia software for topic maps.

It was disappointing that the homepage, despite being alive! needs updating. Such as not referring to last year’s TMRA conference.

All the additional resources listed are good ones, but the selection is somewhat limited.

One of my goals for 2011 is to develop a bibliography of topic map papers, presentations, etc.

Will have to see how the year goes.

Informer

Filed under: Authoring Topic Maps,Information Retrieval,Searching — Patrick Durusau @ 2:29 pm

The Informer is the newsletter of the BCS Information Retrieval Specialist Group (IRSG).

There is a single issue in 1994, although that is volume 3, which implies there were earlier issues.

A useful source of information on IR.

It would be more useful, if there were an index.

Let’s turn that lack of an index into a topic map exercise:

  1. Select one issue of the Informer.
  2. Create a traditional index for that issue.
  3. Using one or more search engines, create a machine index for that issue.
  4. Create a topic map for that issue.

One purpose of the exercise is to give you a feel for the labor/benefit/delivery characteristics of each method.

January 15, 2011

How To Model Search Term Data To Classify User Intent & Match Query Expectations – Post

Filed under: Authoring Topic Maps,Data Mining,Interface Research/Design,Search Data — Patrick Durusau @ 5:49 pm

How To Model Search Term Data To Classify User Intent & Match Query Expectations by Mark Sprague, courtesy of Searchengineland.com is an interesting piece on analysis of search data to extract user intent.

As interesting as that is, I think it could be used by topic map authors for a slightly different purpose.

What if we were to use search data to classify how users were seeking particular subjects?

That is to mine search data for patterns of subject identification, which really isn’t all that different than deciding what product or what service to market to a user.

As a matter of fact, I suspect that many of the tools used by marketeers could be dual purposed to develop subject identifications for non-marketing information systems.

Such as library catalogs or professional literature searches.

The later being often pay-per-view, maintaining high customer satisfaction means repeat business and work of mouth advertising.

I am sure there is already literature on this sort of mining of search data for subject identifications. If you have a pointer or two, please send them my way.

Regret The Error

Filed under: Authoring Topic Maps,Examples — Patrick Durusau @ 7:18 am

Regret the Error is both a website and book by Craig Silvermar.

From the website:

Regret the Error reports on media corrections, retractions, apologies, clarifications and trends regarding accuracy and honesty in the press. It was launched in October 2004 by Craig Silverman, a freelance journalist and author based in Montreal.

Silvermar’s free accuracy checklist is one that reporters (dare I say bloggers?) would do well to follow.

Silvermar recommends printing and laminating the checklist so you can use it with a dry erase pen to check items off.

Better than not having a checklist at all but that seems suboptimal to me.

For example, in a news operation with multiple reporters:

  • How would an editor discover that multiple reporters were relying on the same sources?
  • Or the same sources across multiple stories?
  • How would reporters avoid having to duplicate the effort of other reporters in verifying basic facts such as names, titles, urls, etc?
  • How would reporters build on the experts, resources, sources already located by other reporters?

Questions:

How would you:

  1. Convert Silvermar’s checklist into a topic map?
  2. How would you associate a particular set of items with a story and their being checked off by a reporter?
  3. What extensions or specifics would you add to the checklist?
  4. What other mechanisms would you want in place for such a topic map? (Anonymity for sources comes to mind.)

January 11, 2011

Every Subject A Topic?

Filed under: Authoring Topic Maps,Graphs,Networks — Patrick Durusau @ 10:16 am

The obvious answer to the question: Every Subject A Topic?, is no but I wanted to write up a specific use case I saw discussed today.

I was watching Understanding Graph Databases with Darren Wood, part of the NoSQL Tapes earlier today.

Wood mentioned that in intelligence work a node that has a lot of connections to other nodes, really isn’t that interesting.

For example, modeling telephone calls, that everyone calls the local pizza place isn’t all that interesting.

On the other hand, a node with few connections, especially a connection that bridges subgraphs, could be very interesting.

I thought about that in terms of modeling say campaign finances with a topic map.

I could have a topic that represents Democrats, one that represents Republicans and one for each of the other parties.

Plus create an association with each of those topics for each donation.

But noisy when you think about it from the perspective of the resulting graph.

Some options come to mind:

  1. Preserve the information but as part of each donation represented as a topic.
  2. Create a topic that is just the number of donations and the sum donated.
  3. A variant on #2 except by zip code, to enable a map coloring of donations by zip code.

Will have to think about different ways to create a topic map on the same data.

To establish a baseline for comparing modeling choices.

Finishing up ODF edits this month but perhaps something in the February time frame.

January 9, 2011

Apache UIMA

Apache UIMA

From the website:

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

UIMA enables applications to be decomposed into components, for example “language identification” => “language specific segmentation” => “sentence boundary detection” => “entity detection (person/place names etc.)”. Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.

UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

The UIMA project offers a number of annotators that produce structured information from unstructured texts.

If you are using UIMA as a framework for development of topic maps, please post concerning your experiences with UIMA. What works, what doesn’t, etc.

« Newer PostsOlder Posts »

Powered by WordPress