Archive for the ‘Authoring Topic Maps’ Category

Crowdsourcing Chemistry for the Community…

Friday, April 5th, 2013

Crowdsourcing Chemistry for the Community — 5 Year of Experiences by Antony Williams.

From the description:

ChemSpider is one of the internet’s primary resources for chemists. ChemSpider is a structure-centric platform and hosts over 26 million unique chemical entities sourced from over 400 different data sources and delivers information including commercial availability, associated publications, patents, analytical data, experimental and predicted properties. ChemSpider serves a rather unique role to the community in that any chemist has the ability to deposit, curate and annotate data. In this manner they can contribute their skills, and data, to any chemist using the system. A number of parallel projects have been developed from the initial platform including ChemSpider SyntheticPages, a community generated database of reaction syntheses, and the Learn Chemistry wiki, an educational wiki for secondary school students.

This presentation will provide an overview of the project in terms of our success in engaging scientists to contribute to crowdsouring chemistry. We will also discuss some of our plans to encourage future participation and engagement in this and related projects.

Perhaps not encouraging in terms of the rate of participation but certainly encouraging in terms of the impact of those who do participate.

I suspect the ratio of contributors to users isn’t that far off from those observed in open source projects.

On the whole, I take this as a plus sign for crowd-sourced curation projects, including topic maps.

I first saw this in a tweet by ChemConnector.

A Newspaper Clipping Service with Cascading

Friday, April 5th, 2013

A Newspaper Clipping Service with Cascading by Sujit Pal.

From the post:

This post describes a possible implementation for an automated Newspaper Clipping Service. The end-user is a researcher (or team of researchers) in a particular discipline who registers an interest in a set of topics (or web-pages). An assistant (or team of assistants) then scour information sources to find more documents of interest to the researcher based on these topics identified. In this particular case, the information sources were limited to a set of “approved” newspapers, hence the name “Newspaper Clipping Service”. The goal is to replace the assistants with an automated system.

The solution I came up with was to analyze the original web pages and treat keywords extracted out of these pages as topics, then for each keyword, query a popular search engine and gather the top 10 results from each query. The search engine can be customized so the sites it looks at is restricted by the list of approved newspapers. Finally the URLs of the results are aggregated together, and only URLs which were returned by more than 1 keyword topic are given back to the user.

The entire flow can be thought of as a series of Hadoop Map-Reduce jobs, to first download, extract and count keywords from (web pages corresponding to) URLs, and then to extract and count search result URLs from the keywords. I’ve been wanting to play with Cascading for a while, and this seemed like a good candidate, so the solution is implemented with Cascading.

Hmmm, but an “automated system” leaves the user to sort, create associations, etc., for themselves.

Assistants with such a “clipping service” could curate the clippings by creating associations with other materials and adding non-obvious but useful connections.

Think of the front page of the New York Times as an interface to curated content behind the stories that appear on it.

Where “home” is the article on the front page.

Not only more prose but a web of connections to material you might not even know existed.

For example, in Beijing Flaunts Cross-Border Clout in Search for Drug Lord by Jane Perlez and Bree Feng (NYT) we learn that:

Under Lao norms, law enforcement activity is not done after dark, (Liu Yuejin, leader of the antinarcotics bureau of the Ministry of Public Security)

Could be important information, depending upon your reasons for being in Laos.

Directed Graph Editor

Thursday, April 4th, 2013

Directed Graph Editor

This is a live directed graph editor so you will need to follow the link.

The instructions:

Click in the open space to add a node, drag from one node to another to add an edge.
Ctrl-drag a node to move the graph layout.
Click a node or an edge to select it.

When a node is selected: R toggles reflexivity, Delete removes the node.
When an edge is selected: L(eft), R(ight), B(oth) change direction, Delete removes the edge.

To see this example as part of a larger project, check out Modal Logic Playground!

Just an example of what is possible with current web technology.

Add the ability to record properties, well, could be interesting.

One of the display issues with a graph representation of a topic map is the proliferation of links, which can make the display too “busy.”

What if edges only appeared when mousing over a node? Or you had the ability to toggle some class of edges on/off? Or types of nodes on/off?

Something to keep in mind.

I first saw this in a tweet by Carter Cole.

Requirements for an Authoring Tool for Topic Maps

Wednesday, April 3rd, 2013

I appreciated the recent comment that made it clear I was conflating several things under “authoring.”

One of those things was the conceptual design topic map, another was the transformation or importing of data into a topic map.

A third one was the authoring of a topic map in the sense of using an editor, much like a writer using a typewriter.

Not to denigrate the other two aspects of authoring but I haven’t thought about them as long as the sense of writing a topic map.

Today I wanted to raise the issue of requirements for a authoring/writing tool for topic maps.

I appreciate the mention of Wandora, which is a very powerful topic map tool.

But Wandora has more features than a beginning topic map author will need.

An author could graduate to Wandora, but it makes a difficult starting place.

Here is my sketch of requirements for a topic map authoring/writing tool:

  • Text entry (Unicode)
  • Prompts/Guides for required/optional properties (subject identifier, subject locator or item identifier)
  • Prompts/Guides for required/optional components (Think roles in an associations)
  • Types (nice to have constrained to existing topic)
  • Scope (nice to have constrained to be existing topic)
  • Separation of topics, associations, occurrences (TMDM as legend)
  • As little topic map lingo as possible
  • Pre-defined topics

What requirements am I missing for a topic map authoring tool that is more helpful than a text editor but less complicated than TeX?

BTW, as I wrote this, it occurred to me to ask: How did you learn to write HTML?

Topic Map Tool Chain

Tuesday, April 2nd, 2013

Belaboring the state of topic map tools won’t change this fact: It could use improvement.

Leaving the current state of topic map tools to one side, I have a suggestion about going forward.

What if we conceptualize topic map production as a tool chain?

A chain that can exist as separate components or with combinations of components.

Thinking like *nix tools, each one could be designed to do one task well.

The stages I see:

  1. Authoring
  2. Merging
  3. Conversion
  4. Query
  5. Display

The only odd looking stage is “conversion.”

By that I mean conversion from being held in a topic map data store or format to some other format for integration, query or display.

TaxMap, the oldest topic map on the WWW, is a conversion to HTML for delivery.

Converting a topic map into graph format enables the use of graph display or query mechanisms.

End-to-end solutions are possible but a tool chain perspective enables smaller projects with quicker returns.

Comments/Suggestions?

Drake [Data Processing Workflow]

Wednesday, March 27th, 2013

Drake

From the webpage:

Drake is a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies and calculates:

  • which commands to execute (based on file timestamps)
  • in what order to execute the commands (based on dependencies)

Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.

The video demonstrating Drake is quite good.

Granting my opinion may be influenced by the use of awk in the early examples. 😉

Definitely a tool for scripted production of topic maps.

I first saw this in a tweet by Chris Diehl.

Implementing the RAKE Algorithm with NLTK

Monday, March 25th, 2013

Implementing the RAKE Algorithm with NLTK by Sujit Pal.

From the post:

The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn’t improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK’s built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

Reminds me of the “statistically insignificant phrases” at Amazon. Or was that “statistically improbable phrases?”

If you search on “statistically improbable phrases,” you get twenty (20) “hits” under books at Amazon.com.

Could be a handy tool to quickly extract candidates for topics in a topic map.

Collaborating, Online with LaTeX?

Sunday, December 16th, 2012

I saw a tweet tonight that mentioned two online collaborative editors based on LaTeX:

writeLaTeX

and,

ShareLaTeX

I don’t have the time to look closely at them tonight but thought you would find them interesting.

If collaborative editing is possible for LaTeX, shouldn’t that also be possible for a topic map?

I saw this mentioned in a tweet by Jan-Piet Mens

Autocomplete Search with Redis

Sunday, December 9th, 2012

Autocomplete Search with Redis

From the post:

When we launched GetGlue HD, we built a faster and more powerful search to help users find the titles they were looking for when they want to check-in to their favorite shows and movies as they typed into the search box. To accomplish that, we used the in-memory data structures of the Redis data store to build an autocomplete search index.

Search Goals

The results we wanted to autocomplete for are a little different than the usual result types. The Auto complete with Redis writeup by antirez explores using the lexicographical ordering behavior of sorted sets to autocomplete for names. This is a great approach for things like usernames, where the prefix typed by the user is also the prefix of the returned results: typing mar could return Mara, Marabel, and Marceline. The deal-breaking limitation is that it will not return Teenagers From Mars, which is what we want our autocomplete to be able to do when searching for things like show and movie titles. To do that, we decided to roll our own autocomplete engine to fit our requirements. (Updated the link to the “Auto complete with Redis” post.)

Rather like the idea of autocomplete being more than just string completion.

What if while typing a name, “autocompletion” returns one or more choices for what it thinks you may be talking about? With additional properties/characteristics, you can disambiguate your usage by allowing your editor to tag the term.

Perhaps another way to ease the burden of authoring a topic map.

Collaborative biocuration… [Pre-Topic Map Tasks]

Monday, November 26th, 2012

Collaborative biocuration—text-mining development task for document prioritization for curation by Thomas C. Wiegers, Allan Peter Davis and Carolyn J. Mattingly. (Database (2012) 2012 : bas037 doi: 10.1093/database/bas037)

Abstract:

The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems for the biological domain. The ‘BioCreative Workshop 2012’ subcommittee identified three areas, or tracks, that comprised independent, but complementary aspects of data curation in which they sought community input: literature triage (Track I); curation workflow (Track II) and text mining/natural language processing (NLP) systems (Track III). Track I participants were invited to develop tools or systems that would effectively triage and prioritize articles for curation and present results in a prototype web interface. Training and test datasets were derived from the Comparative Toxicogenomics Database (CTD; http://ctdbase.org) and consisted of manuscripts from which chemical–gene–disease data were manually curated. A total of seven groups participated in Track I. For the triage component, the effectiveness of participant systems was measured by aggregate gene, disease and chemical ‘named-entity recognition’ (NER) across articles; the effectiveness of ‘information retrieval’ (IR) was also measured based on ‘mean average precision’ (MAP). Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%. Each participating group also developed a prototype web interface; these interfaces were evaluated based on functionality and ease-of-use by CTD’s biocuration project manager. In this article, we present a detailed description of the challenge and a summary of the results.

The results:

“Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%.”

indicate there is plenty of room for improvement. Perhaps even commercially viable improvement.

In hindsight, not talking about how to make a topic map along with ISO 13250, may have been a mistake. Even admitting there are multiple ways to get there, a technical report outlining one or two ways would have made the process more transparent.

Answering the question: “What can you say with a topic map?” with “Anything you want.” was, a truthful answer but not a helpful one.

I should try to crib something from one of those “how to write a research paper” guides. I haven’t looked at one in years but the process is remarkably similar to what would result in a topic map.

Some of the mechanics are different but the underlying intellectual process is quite similar. Everyone who has been to college (at least of my age), had a course that talked about writing research papers. So it should be familiar terminology.

Thoughts/suggestions?

AgroTagger [Auto-Topic Map Authoring?]

Wednesday, November 7th, 2012

AgroTagger

From the webpage:

Used for indexing information resources, Agrotagger is a keyword extractor that uses the AGROVOC thesaurus as its set of allowable keywords. It can extract from Microsoft Office documents, PDF files and web pages.

There are currently several available services that can be accessed either as web interfaces for manual document upload or as REST web services that can be programmatically invoked:

Following up on the AGROVOC thesaurus, FAO thesaurus links with reegle, and found this interesting resource.

Doesn’t seem like a big jump to have a set of keyword that create topics, associations and occurrences With document author(s), journal, place of employment, etc.

Would need proofing but on the other hand could produce a topic map for proofing tout de suite. (No Michel, I had to look it up. 😉 )

LTM — Cheat-Sheet Update (One update begats another)

Saturday, November 3rd, 2012

LTM — Cheat-Sheet 0.3

Post-publicaton proofing is more accurate than pre-publication proofing.

Thoughts on why that is the case? 😉

I forgot to update the revision number in 0.2 and minor though it may be, wanted to correct that.

So, LTM Cheat-Sheet 0.3 is now available.

I will go back to the earlier posts so they point to the latest version.


Update: 15 November 2012. Latest version is LTM — Cheat-Sheet 0.4.

LTM — Cheat-Sheet Update

Friday, November 2nd, 2012

LTM — Cheat-Sheet 0.2

I caught a couple of typos in version 0.1 and have posted version 0.2 of the LTM — Cheat-Sheet.

Changes as follows:

to signal its missing -> to signal it’s missing

followed (bold)by(/bold) -> followed by

[and] -> and

[optional] -> [opt] 2X


Update: 15 November 2012. Latest version is LTM — Cheat-Sheet 0.4.

Update: 3 November 2012. Latest version is LTM — Cheat-Sheet 0.3. Post announcing it: LTM — Cheat-Sheet Update (One update begats another).

LTM – Cheat-Sheet

Sunday, October 28th, 2012

LTM – Cheat-Sheet

I had someone ask for Linear Topic Map (LTM) syntax instead of XTM.

My marketing staff advised: “The customer is always right.” 😉

I created this LTM cheat-sheet, based on “The Linear Topic Map Notation, version 1.3, by Lars Marius Garshol.

Thought it might be of interest.

Comments, suggestions, corrections welcome!


Update: 15 November 2012. Latest version is LTM — Cheat-Sheet 0.4.

Update: 3 November 2012. Latest version is LTM — Cheat-Sheet 0.3. Post announcing it: LTM — Cheat-Sheet Update (One update begats another).

Do Presidential Debates Approach Semantic Zero?

Thursday, October 18th, 2012

ReConstitution recreates debates through transcripts and language processing by Nathan Yau.

From Nathan’s post:

Part data visualization, part experimental typography, ReConstitution 2012 is a live web app linked to the US Presidential Debates. During and after the three debates, language used by the candidates generates a live graphical map of the events. Algorithms track the psychological states of Romney and Obama and compare them to past candidates. The app allows the user to get beyond the punditry and discover the hidden meaning in the words chosen by the candidates.

The visualization does not answer the thorny experimental question: Do presidential debates approach semantic zero?

Well, maybe the technique will improve by the next presidential election.

In the meantime, it was an impressive display of read time processing and analysis of text.

Imagine such an interface that was streaming text for you to choose subjects, associations between subjects, and the like.

Not trying to perfectly code any particular stretch of text but interacting with the flow of the text.

There are goals other than approaching semantic zero.

Calligra 2.6 Alpha Released [Entity/Association Recognition Writ Small?]

Wednesday, October 17th, 2012

Calligra 2.6 Alpha Released

The final version of Calligra 2.6 is due out in December of 2012. Too late to think about topic map features for that release.

But what about the release after that?

In 2.6 we will see:

Calligra Author is a new member of the growing Calligra application family. The application was announced just after the release of Calligra 2.5 with the following description:

The application will support a writer in the process of creating an eBook from concept to publication. We have two user categories in particular in mind:

  • Novelists who produce long texts with complicated plots involving many characters and scenes but with limited formatting.
  • Textbook authors who want to take advantage of the added possibilities in eBooks compared to paper-based textbooks.

Novelists and text book authors are prime candidates for topic maps, especially if integrated into a word processor.

Novelists track many relationships between people, places, things. What if entities were recognized and associations suggested, much like spell checking?

Not solving entity/association recognition writ large, but entity/association recognition writ small. Entity/association recognition for a single author.

Text book authors as well because they creating instructional maps of a field of study. Instructional maps that have to be updated with new information and references.

Separate indexes could be merged, to create meaningful indexes to entire series of works.

PS: In the interest of full disclosure, I am the editor of ODF, the default format for Calligra.

Five User Experience Lessons from Johnny Depp

Saturday, October 13th, 2012

Five User Experience Lessons from Johnny Depp by Steve Tengler.

Print this post out and pencil in your guesses for the Johnny Depp movies that illustrate these lessons:

Lesson #1: It’s Not About the Ship You Rode In On

Lesson #2: Good UXers Plan Ahead to Assimilate External Content

Lesson #3: Flexibility on Size Helps Win the Battle

Lesson #4: Design for What Your Customer Wants … Not for What You Want

Lesson #5: Tremendous Flexibility Can Lead to User Satisfaction

Then pass a clean copy to the next cubicle and see how they do.

Funny how Lesson #4 keeps coming up.

I had an Old Testament professor who said laws against idol worship were evidence people were engaged in idol worship. Rarely prohibit what isn’t a problem.

I wonder if #4 keeps coming up because designers keep designing for themselves?

What do you think?

If that is true, then it must be true that authors write for themselves. (Ouch!)

So how do authors discover (or do they) how to write for others?

Know the ones that succeed in commercial trade by sales. But that is after the fact and not explanatory.

Important question if you are authoring curated content with a topic map for sale.

Verification: In God We Trust, All Others Pay Cash

Thursday, October 11th, 2012

Crowdsourcing is a valuable technique, at least if accurate information is the result. Incorrect information or noise is still incorrect information or noise, crowdsourced or not.

From PLOS ONE (not Nature or Science) comes news of progress on verification of crowdsourced information. (Verification in Referral-Based Crowdsourcing Naroditskiy V, Rahwan I, Cebrian M, Jennings NR (2012) Verification in Referral-Based Crowdsourcing. PLoS ONE 7(10): e45924. doi:10.1371/journal.pone.0045924)

Abstract:

Online social networks offer unprecedented potential for rallying a large number of people to accomplish a given task. Here we focus on information gathering tasks where rare information is sought through “referral-based crowdsourcing”: the information request is propagated recursively through invitations among members of a social network. Whereas previous work analyzed incentives for the referral process in a setting with only correct reports, misreporting is known to be both pervasive in crowdsourcing applications, and difficult/costly to filter out. A motivating example for our work is the DARPA Red Balloon Challenge where the level of misreporting was very high. In order to undertake a formal study of verification, we introduce a model where agents can exert costly effort to perform verification and false reports can be penalized. This is the first model of verification and it provides many directions for future research, which we point out. Our main theoretical result is the compensation scheme that minimizes the cost of retrieving the correct answer. Notably, this optimal compensation scheme coincides with the winning strategy of the Red Balloon Challenge.

UCSD Jacobs School of Engineering, in Making Crowdsourcing More Reliable, reported the following experience with this technique:

The research team has successfully tested this approach in the field. Their group accomplished a seemingly impossible task by relying on crowdsourcing: tracking down “suspects” in a jewel heist on two continents in five different cities, within just 12 hours. The goal was to find five suspects. Researchers found three. That was far better than their nearest competitor, which located just one “suspect” at a much later time.

It was all part of the “Tag Challenge,” an event sponsored by the U.S. Department of State and the U.S. Embassy in Prague that took place March 31. Cebrian’s team promised $500 to those who took winning pictures of the suspects. If these people had been recruited to be part of “CrowdScanner” by someone else, that person would get $100. To help spread the word about the group, people who recruited others received $1 per person for the first 2,000 people to join the group.

This has real potential!

Could use money, but what of other inducements?

What if department professors agree to substitute participation in a verified crowdsourced bibliography in place of the usual 10% class participation?

Motivation, structuring the task, are all open areas for experimentation and research.

Suggestions on areas for topic maps using this methodology?

Some other resources you may find of interest:

Tag Challenge website

Tag Challenge – Wikipedia (Has links to team pages, etc.)

Topic Based Authoring (Webinar)

Wednesday, September 19th, 2012

Topic Based Authoring

Date: Thursday, October 4, 2012
Time: 11:00 AM PDT | 2:00 PM EDT

From the description:

Using a topic-based approach can improve consistency and usability of information and make it easier to reuse topics in different contexts. It can also simplify maintenance, speed up the review process, and facilitate shared authoring.

All of those benefits sound great. But which ones really matter to you, your business, and your customers? It’s important to know why you want to change your content strategy, and how you’ll evaluate whether you’ve been successful.

Topic-based authoring implementations often focus on learning writing patterns, techniques, and technologies like DITA and CCMS. Those are important and useful, but topic-based authoring doesn’t exist in a vacuum. Decisions you make about your content need to be tied to business goals and user needs. Too often, the activity of thinking through the business goals and user needs gets neglected.

This 45-minute webinar will define topic-based authoring and help you understand not only the benefits of this approach but also walk you through the critical steps to defining and implementing a successful program.

Topic Map Cheat Sheets?

Tuesday, September 18th, 2012

I have run across several collections of cheat sheets recently.

Would it be helpful to have “cheat sheets” for topic maps?

And if so, would it be more helpful to have “cheat sheets” that were subject specific?

Thinking of subjects I helped identify for a map with chemicals in it. Used a standard set of identifiers plus alternate identifiers as well. Some might be commonly known, others possibly not.

Thoughts? Suggestions? Volunteers?

Hands-on with Google Docs’s new research tool [UI Idea?]

Friday, June 15th, 2012

Hands-on with Google Docs’s new research tool by Joel Mathis, Macworld.com.

From the post:

Google Docs has unveiled a new research tool meant to help writers streamline their browser-based research, making it easier for them to find and cite the information they need while composing text.

The feature, announced Tuesday, appears as an in-page vertical pane on the right side of your Google Doc. (You can see an example of the pane at left.) It can be accessed either through the page’s Tools menu, or with a Command-Option-R keyboard shortcut on your Mac.

The tool offers three types of searches: A basic “everything” search, another just for images, and a third featuring quotes about—or by—the subject of your search.

In “everything” mode, a search for GOP presidential candidate Mitt Romney brought up a column of images and information. At the top of the column, a scrollable set of thumbnail pictures of the man, followed by some basic dossier information—birthday, hometown, and religion—followed by a quote from Romney, taken from an ABC News story that had appeared within the last hour.

The top Web links for a topic are displayed underneath that roster of information. You’re given three option with the links: First, you can “preview” the linked page within the Google Docs page—though you’ll have to open a new tab if you want to conduct a more thorough perusal of the pertinent info. The second option is to create a link to that page directly from the text you’re writing. The third is to create a footnote in the text that cites the link.

Interfaces are forced to make assumptions about the “average” user and their needs. This one sounds like it is hitting around or even close to needs that are fairly common.

Makes me wonder if topic map authoring interfaces should place more emphasis on incorporation of content and authoring, with correspondingly less emphasis on the topic mappishness of the result.

Perhaps cleaning up a map is something that should be a separate task anyway.

Authors write and editors edit.

Is there some reason to combine those two tasks?

(I first saw this at Research Made Easy With Google Docs by Stephen Arnold.)

Web sequence diagrams

Thursday, May 24th, 2012

Web sequence diagrams

I ran across this while looking for information on Lucene indexing.

It may be that I am confusing the skill of the author with the utility of the interface (which may be commonly available via other sources) but I was impressed enough that I wanted to point it out.

It does seem a bit pricey ($99 for two users) but on the other hand, developing good documentation is (should be) a team based task. This would be a good way to insure a common understanding of sequences of operations.

Are there similar tools you would recommend for team based activities?

Thinking that authoring a topic map is very much a team activity. From domain experts who vet content to UI experts who create and test interfaces to experts who load and maintain content servers and others.

Keeping a common sense of purpose and interdependence (team effort) goes a long way to a successful project conclusion.

Auto Tagging Articles using Semantic Analysis and Machine Learning

Wednesday, May 2nd, 2012

Auto Tagging Articles using Semantic Analysis and Machine Learning

Description:

The idea is to implement an auto tagging feature that provides tags automatically to the user depending upon the content of the post. The tags will get populated as soon as the user leaves the focus on the content text area or via ajax on the press of a button.I’ll be using semantic analysis and topic modeling techniques to judge the topic of the article and extract keywords also from it. Based on an algorithm and a ranking mechanism the user will be provided with a list of tags from which he can select those that best describe the article and also train a user-content specific semi-supervised machine learning model in the background.

A Drupal sandbox for work on auto tagging posts.

Or, topic map authoring without being “in your face.”

Depends on how you read “tags.”

Experiments in genetic programming

Monday, March 19th, 2012

Experiments in genetic programming

Lars Marius Garshol writes:

I made an engine called Duke that can automatically match records to see if they represent the same thing. For more background, see a previous post about it. The biggest problem people seem to have with using it is coming up with a sensible configuration. I stumbled across a paper that described using so-called genetic programming to configure a record linkage engine, and decided to basically steal the idea.

You need to read about the experiments in the post but I can almost hear Lars saying the conclusion:

The result is pretty clear: the genetic configurations are much the best. The computer can configure Duke better than I can. That’s almost shocking, but there you are. I guess I need to turn the script into an official feature.

😉

Excellent post and approach by the way!

Lars also posted a link to Reddit about his experiments. Several links appear in comments that I have turned into short posts to draw more attention to them.

Another tool for your topic mapping toolbox.

Question: I wonder what it would look like to have the intermediate results used for mapping, only to be replaced as “better” mappings become available? Has a terminating condition but new content can trigger additional cycles but only as relevant to its content.

Or would queries count as new content? If they expressed synonymy or other relations?

Crowdsourcing and the end of job interviews

Thursday, March 1st, 2012

Crowdsourcing and the end of job interviews by Panos Ipeirotis.

From the post:

When you discuss crowdsourcing solutions with people that have not heard the concept before, they tend to ask the question: “Why is crowdsourcing so much cheaper than existing solutions that depend on ‘classic’ outsourcing?

Interestingly enough, this is not a phenomenon that appears only in crowdsourcing. The Sunday edition of the New York Times has an article titled Why Are Harvard Graduates in the Mailroom?. The article discusses the job searching strategy in some fields (e.g., Hollywood, academic, etc), where talented young applicants are willing to start with jobs that are paying well below what their skills deserve, in exchange for having the ability to make it big later in the future:

[This is] the model lottery industry. For most companies in the business, it doesn’t make economic sense to, as Google does, put promising young applicants through a series of tests and then hire only the small number who pass. Instead, it’s cheaper for talent agencies and studios to hire a lot of young workers and run them through a few years of low-paying drudgery…. This occupational centrifuge allows workers to effectively sort themselves out based on skill and drive. Over time, some will lose their commitment; others will realize that they don’t have the right talent set; others will find that they’re better at something else.

Interestingly enough, this occupational centrifuge is very close to the model of employment in crowdsourcing.

The author’s take is that esoteric interview questions aren’t as effective as using a crowdsourcing model. I suspect he may be right.

If that is true, how would you go about structuring a topic map authoring project for crowdsourcing? What framework would you erect going into the project? What sort of quality checks would you implement? Would you “prime the pump” with already public data to be refined?

Are we on the verge of a meritocracy of performance?

As opposed to once meritocracies of performance, now the lands of clannish and odd questions in interviews?

Inventing on Principle

Monday, February 20th, 2012

Inventing on Principle by Bret Victor.

Nathan Yau at Flowing Data writes:

This talk by Bret Victor caught fire a few days ago, but I just got a chance to watch to it in its entirety. It’s worth the one hour. Victor demos some great looking software that connects code to the visual, making the creation process more visceral, and he finishes up with worthwhile thoughts on the invention process.

Think about authoring a graph or topic map with the sort of immediate feedback that Bret demonstrates.

Construction of Learning Path Using Ant Colony Optimization from a Frequent Pattern Graph

Sunday, January 22nd, 2012

Construction of Learning Path Using Ant Colony Optimization from a Frequent Pattern Graph by Souvik Sengupta, Sandipan Sahu and Ranjan Dasgupta.

Abstract:

In an e-Learning system a learner may come across multiple unknown terms, which are generally hyperlinked, while reading a text definition or theory on any topic. It becomes even harder when one tries to understand those unknown terms through further such links and they again find some new terms that have new links. As a consequence they get confused where to initiate from and what are the prerequisites. So it is very obvious for the learner to make a choice of what should be learnt before what. In this paper we have taken the data mining based frequent pattern graph model to define the association and sequencing between the words and then adopted the Ant Colony Optimization, an artificial intelligence approach, to derive a searching technique to obtain an efficient and optimized learning path to reach to a unknown term.

The phrase “multiple unknown terms, which are generally hyperlinked” is a good description of any location in a topic map for anyone other than its author and other experts in the field it describes.

Although couched in terms of a classroom educational setting, I suspect techniques very similar to these could be used with any topic map interface with users.

This post on Google+ statistics is a billion* times better than any other post

Saturday, January 21st, 2012

This post on Google+ statistics is a billion* times better than any other post by Rocky Agrawal.

From the post:

In Thursday’s Google earnings call, CEO Larry Page told the world that the company’s fledgling social network, Google+ has reached 90 million registered users. He went on to say that, “Over 60 percent of Google+ users use Google products on a daily basis. Over 80 percent of Google+ users use Google products every week.”

I’m not impressed by the numbers, and I’m not impressed by what Page was trying to do with them.

Counting registered users instead of daily active users tells us nothing about the popularity of the service. Think of the millions of people who’ve registered for Google+ but never use it. Second, given the huge popularity of Google search, Gmail, and YouTube, it’s actually surprising that so few people who have registered for Google+ are using those more popular services on a daily basis — only 60 percent. After all, remember that a lot of Google+ users accidentally became Google+ users only because they were already attached to another Google service.

But what concerns me most is that Google is touting these meaningless statistics in the hopes that journalists will misunderstand them and report that Google+ is seeing rapid growth. The bottom line is, those 60 percents, 80 percents and 90 million registered users are just there to mask the fact that Google doesn’t want to tell us how many people are actually using Google+.

It’s intellectually dishonest. And as a public company, it raises questions of Google’s intent — the market is watching Google’s moves in social and needs to see traction. I expect better from Google.

Some journalists, to say nothing of the great mass of the unwashed, will be misled by Larry Page’s statements. Whether that was his intent or not. But many of them would have mis-understood his comments had they been delivered with the aid of anatomically correct dolls.

To illustrate another reporting approach to statements about usage by Larry Page, consider the Wall Street Journal coverage on the same topic:

The company has pushed into social media, launching Google+ as an alternative to Facebook’s popular website. (From: Google Shares Plunge As Earnings Report Raises Growth Concerns. Viewed January 21, 2012, 4:46 PM East Coast time.

That’s it, one sentence. And apparently investors were among those not misled by Page’s comments. If they cared at all about Page’s comments on usage of Google+.

Topic map authoring tip:

For business topic maps, remember your readers are interested in facts or statements of facts that can form the basis for investment decisions or fraud lawsuits following investments. Lies about extraneous or irrelevant matters need not be included.

Oil Drop Semantics?

Sunday, January 15th, 2012

Interconnection of Communities of Practice: A Web Platform for Knowledge Management and some related material made me think of the French “oil drop” counter-insurgency strategy.

With one important difference.

In a counter-insurgency context, the oil drop strategy is being used to further the goals of counter-insurgency force. Whatever you think of those goals or the alleged benefits for the places covered by the oil drops, the fundamental benefit is to the counter-insurgency force.

In a semantic context, one that seeks to elicit the local semantics of a group, the goal is not the furtherance of an outside semantic, but the exposition of a local semantic with the goal of benefiting the group covered by the oil spot. That as the oil drop spreads, those semantics may be combined with other oil drop semantics, but that is a cost and effort borne by the larger community seeking that benefit.

There are several immediate advantages to this approach with semantics.

First, the discussion of semantics at every level is taking place with the users of those semantics. You can hardly get closer to a useful answer than being able to ask the users of a semantic what was meant or for examples of usage. I don’t have a formalism for it but I would postulate that as the distance from users increases, so does the usefulness of the semantics of those users.

Ask the FBI about the Virtual Case Management project. Didn’t ask users or at least enough of them and flushed lots of cash. Lesson: Asking management, IT, etc., about the semantics of users is a utter waste of time. Really.

If you want to know the semantics of user group X, then ask group X. If you ask Y about X, you will get Y’s semantics about X. If that is what you want, fine, but if you want the semantics of group X, you have wasted your time and resources.

Second, asking the appropriate group of users for their semantics means that you can make explicit the ROI from making their semantics explicit. That is to say if asked, the group will ask about semantics that are meaningful to them. That either solve some task or issue that they encounter. May or may not be the semantics that interest you but recall the issue is the group’s semantics, not yours.

The reason for the ROI question at the appropriate group level is so that the project is justified both to the group being asked to make the effort as well as those who must approve the resources for such a project. Answering that question up front helps get buy-in from group members and makes them realize this isn’t busy work but will have a positive benefit for them.

Third, such a bottom-up approach, whether you are using topic maps, RDF, etc. will mean that only the semantics that are important to users and justified by some positive benefit are being captured. Your semantics may not have the rigor of SUMO, for example, but they are a benefit to you. What other test would you apply?

How accurate can manual review be?

Friday, December 23rd, 2011

How accurate can manual review be?

From the post:

One of the chief pleasures for me of this year’s SIGIR in Beijing was attending the SIGIR 2011 Information Retrieval for E-Discovery Workshop (SIRE 2011). The smaller and more selective the workshop, it often seems, the more focused and interesting the discussion.

My own contribution was “Re-examining the Effectiveness of Manual Review”. The paper was inspired by an article from Maura Grossman and Gord Cormack, whose message is neatly summed up in its title: “Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review”.

Fascinating work!

Does this give you pause about automated topic map authoring? Why/why not?