Archive for the ‘Authoring Topic Maps’ Category

@niccdias and @cward1e on Mis- and Dis-information [Additional Questions]

Friday, September 29th, 2017

10 questions to ask before covering mis- and dis-information by Nic Dias and Claire Wardle.

From the post:

Can silence be the best response to mis- and dis-information?

First Draft has been asking ourselves this question since the French election, when we had to make difficult decisions about what information to publicly debunk for CrossCheck. We became worried that – in cases where rumours, misleading articles or fabricated visuals were confined to niche communities – addressing the content might actually help to spread it farther.

As Alice Marwick and Rebecca Lewis noted in their 2017 report, Media Manipulation and Disinformation Online, “[F]or manipulators, it doesn’t matter if the media is reporting on a story in order to debunk or dismiss it; the important thing is getting it covered in the first place.” Buzzfeed’s Ryan Broderick seemed to confirm our concerns when, on the weekend of the #MacronLeaks trend, he tweeted that 4channers were celebrating news stories about the leaks as a “form of engagement.”

We have since faced the same challenges in the UK and German elections. Our work convinced us that journalists, fact-checkers and civil society urgently need to discuss when, how and why we report on examples of mis- and dis-information and the automated campaigns often used to promote them. Of particular importance is defining a “tipping point” at which mis- and dis-information becomes beneficial to address. We offer 10 questions below to spark such a discussion.

Before that, though, it’s worth briefly mentioning the other ways that coverage can go wrong. Many research studies examine how corrections can be counterproductive by ingraining falsehoods in memory or making them more familiar. Ultimately, the impact of a correction depends on complex interactions between factors like subject, format and audience ideology.

Reports of disinformation campaigns, amplified through the use of bots and cyborgs, can also be problematic. Experiments suggest that conspiracy-like stories can inspire feelings of powerlessness and lead people to report lower likelihoods to engage politically. Moreover, descriptions of how bots and cyborgs were found give their operators the opportunity to change strategies and better evade detection. In a month awash with revelations about Russia’s involvement in the US election, it’s more important than ever to discuss the implications of reporting on these kinds of activities.

Following the French election, First Draft has switched from the public-facing model of CrossCheck to a model where we primarily distribute our findings via email to newsroom subscribers. Our election teams now focus on stories that are predicted (by NewsWhip’s “Predicted Interactions” algorithm) to be shared widely. We also commissioned research on the effectiveness of the CrossCheck debunks and are awaiting its results to evaluate our methods.

The ten questions (see the post) should provoke useful discussions in newsrooms around the world.

I have three additional questions that round Nic Dias and Claire Wardle‘s list to a baker’s dozen:

  1. How do you define mis- or dis-information?
  2. How do you evaluate information to classify it as mis- or dis-information?
  3. Are your evaluations of specific information as mis- or dis-information public?

Defining dis- or mis-information

The standard definitions (Merriam Webster) for:

disinformation: false information deliberately and often covertly spread (as by the planting of rumors) in order to influence public opinion or obscure the truth

misinformation: incorrect or misleading information

would find nodding agreement from Al Jazeera and the CIA, to the European Union and Recep Tayyip Erdoğan.

However, what is or is not disinformation or misinformation would vary from one of those parties to another.

Before reaching the ten questions of Nic Dias and Claire Wardle, define what you mean by disinformation or misinformation. Hopefully with numerous examples, especially ones that are close to the boundaries of your definitions.

Otherwise, all your readers know is that on the basis of some definition of disinformation/misinformation known only to you, information has been determined to be untrustworthy.

Documenting your process to classify as dis- or mis-information

Assuming you do arrive at a common definition of misinformation or disinformation, what process do you use to classify information according to those definitions? Ask your editor? That seems like a poor choice but no doubt it happens.

Do you consult and abide by an opinion found on Snopes? Or Politifact? Or Do all three have to agree for a judgement of misinformation or disinformation? What about other sources?

What sources do you consider definitive on the question of mis- or disinformation? Do you keep that list updated? How did you choose those sources over others?

Documenting your evaluation of information as dis- or mis-information

Having a process for evaluating information is great.

But have you followed that process? If challenged, how would you establish the process was followed for a particular piece of information?

Is your documentation office “lore,” or something more substantial?

An online form that captures the information, its source, the check fact source consulted with date, decision and person making the decision would take only seconds to populate. In addition to documenting the decision, you can build up a record of a source’s reliability.


Vagueness makes discussion and condemnation of mis- or dis-information easy to do and difficult to have a process for evaluating information, a common ground for classifying that information, to say nothing of documenting your decision on specific information.

Don’t be the black box of whim and caprice users experience at Twitter, Facebook and Google. You can do better than that.

Deep Learning: Image Similarity and Beyond (Webinar, May 10, 2016)

Friday, May 6th, 2016

Deep Learning: Image Similarity and Beyond (Webinar, May 10, 2016)

From the registration page:

Deep Learning is a powerful machine learning method for image tagging, object recognition, speech recognition, and text analysis. In this demo, we’ll cover the basic concept of deep learning and walk you through the steps to build an application that finds similar images using an already-trained deep learning model.

Recommended for:

  • Data scientists and engineers
  • Developers and technical team managers
  • Technical product managers

What you’ll learn:

  • How to leverage existing deep learning models
  • How to extract deep features and use them using GraphLab Create
  • How to build and deploy an image similarity service using Dato Predictive Services

What we’ll cover:

  • Using an already-trained deep learning model
  • Extracting deep features
  • Building and deploying an image similarity service for pictures 

Deep learning has difficulty justifying its choices, just like human judges of similarity, but could it play a role in assisting topic map authors in constructing explicit decisions for merging?

Once trained, could deep learning suggest properties and/or values to consider for merging it has not yet experienced?

I haven’t seen any webinars recently so I am ready to gamble on this being an interesting one.


Web Page Structure, Without The Semantic Web

Saturday, May 30th, 2015

Could a Little Startup Called Diffbot Be the Next Google?

From the post:

Diffbot founder and CEO Mike Tung started the company in 2009 to fix a problem: there was no easy, automated way for computers to understand the structure of a Web page. A human looking at a product page on an e-commerce site, or at the front page of a newspaper site, knows right away which part is the headline or the product name, which part is the body text, which parts are comments or reviews, and so forth.

But a Web-crawler program looking at the same page doesn’t know any of those things, since these elements aren’t described as such in the actual HTML code. Making human-readable Web pages more accessible to software would require, as a first step, a consistent labeling system. But the only such system to be seriously proposed, Tim Berners-Lee’s Semantic Web, has long floundered for lack of manpower and industry cooperation. It would take a lot of people to do all the needed markup, and developers around the world would have to adhere to the Resource Description Framework prescribed by the World Wide Web Consortium.

Tung’s big conceptual leap was to dispense with all that and attack the labeling problem using computer vision and machine learning algorithms—techniques originally developed to help computers make sense of edges, shapes, colors, and spatial relationships in the real world. Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page.

Using machine-learning techniques, this geometric data can then be compared to frameworks or “ontologies”—patterns distilled from training data, usually by humans who have spent time drawing rectangles on Web pages, painstakingly teaching the software what a headline looks like, what an image looks like, what a price looks like, and so on. The end result is a marked-up summary of a page’s important parts, built without recourse to any Semantic Web standards.

The irony here, of course, is that much of the information destined for publication on the Web starts out quite structured. The WordPress content-management system behind Xconomy’s site, for example, is built around a database that knows exactly which parts of this article should be presented as the headline, which parts should look like body text, and (crucially, to me) which part is my byline. But these elements get slotted into a layout designed for human readability—not for parsing by machines. Given that every content management system is different and that every site has its own distinctive tags and styles, it’s hard for software to reconstruct content types consistently based on the HTML alone.

There are several themes here that are relevant to topic maps.

First, it is true that most data starts with some structure, styles if you will, before it is presented for user consumption. Imagine an authoring application that automatically and unknown to its user, metadata that can then provide semantics for its data.

Second, the recognition of structure approach being used by Diffbot is promising in the large but should also be promising in the small as well. Local documents of a particular type are unlikely to have the variance of documents across the web. Meaning that with far less effort, you can build recognition systems that can empower more powerful searching of local document repositories.

Third, and perhaps most importantly, while the results may not be 100% accurate, the question for any such project should be how much accuracy is required? If I am mining social commentary blogs, a 5% error rate on recognition of speakers might be acceptable, because for popular threads or speakers, those errors are going to be quickly corrected. Unpopular threads or authors never followed, does that come under no harm/no foul?

Highly recommended for reading/emulation.

Are Government Agencies Trustworthy? FBI? No!

Thursday, April 23rd, 2015

Pseudoscience in the Witness Box: The FBI faked an entire field of forensic science by Dahlia Lithwick.

From the post:

The Washington Post published a story so horrifying this weekend that it would stop your breath: “The Justice Department and FBI have formally acknowledged that nearly every examiner in an elite FBI forensic unit gave flawed testimony in almost all trials in which they offered evidence against criminal defendants over more than a two-decade period before 2000.”

What went wrong? The Post continues: “Of 28 examiners with the FBI Laboratory’s microscopic hair comparison unit, 26 overstated forensic matches in ways that favored prosecutors in more than 95 percent of the 268 trials reviewed so far.” The shameful, horrifying errors were uncovered in a massive, three-year review by the National Association of Criminal Defense Lawyers and the Innocence Project. Following revelations published in recent years, the two groups are helping the government with the country’s largest ever post-conviction review of questioned forensic evidence.

Chillingly, as the Post continues, “the cases include those of 32 defendants sentenced to death.” Of these defendants, 14 have already been executed or died in prison.

You should read Dahlia’s post carefully and then write “untrustworthy” next to any reference to or material from the FBI.

This particular issue involved identifying hair samples to be the same, which went beyond any known science.

But if 26 out of 28 experts were willing to go there, how far do you think the average agent on the street goes towards favoring the prosecution?

True, the FBI is working to find all the cases where this has happened, but questions about this type of evidence were raised long before now. But questioning the prosecution’s evidence doesn’t work in favor of the FBI.

Defense teams need to start requesting judicial notice of the propensity of executive branch department employees to give false testimony and a cautionary instruction to jurors in cases where they appear in trials.

Unstructured Topic Map-Like Data Powering AI

Monday, March 23rd, 2015

Artificial Intelligence Is Almost Ready for Business by Brad Power.

From the post:

Such mining of digitized information has become more effective and powerful as more info is “tagged” and as analytics engines have gotten smarter. As Dario Gil, Director of Symbiotic Cognitive Systems at IBM Research, told me:

“Data is increasingly tagged and categorized on the Web – as people upload and use data they are also contributing to annotation through their comments and digital footprints. This annotated data is greatly facilitating the training of machine learning algorithms without demanding that the machine-learning experts manually catalogue and index the world. Thanks to computers with massive parallelism, we can use the equivalent of crowdsourcing to learn which algorithms create better answers. For example, when IBM’s Watson computer played ‘Jeopardy!,’ the system used hundreds of scoring engines, and all the hypotheses were fed through the different engines and scored in parallel. It then weighted the algorithms that did a better job to provide a final answer with precision and confidence.”

Granting that the tagging and annotation is unstructured, unlike a topic map, but it is as unconstrained by first order logic and other crippling features of RDF and OWL. Out of that mass of annotations, algorithms can construct useful answers.

Imagine what non-experts (Stanford logic refugees need not apply) could author about your domain, to be fed into an AI algorithm. That would take more effort than relying upon users chancing upon subjects of interest but it would also give you greater precision in the results.

Perhaps, just perhaps, one of the errors in the early topic maps days was the insistence on high editorial quality at the outset, as opposed to allowing editorial quality to emerge out of data.

As an editor I’m far more in favor of the former than the latter but seeing the latter work, makes me doubt that stringent editorial control is the only path to an acceptable degree of editorial quality.

What would a rough-cut topic map authoring interface look like?


Flock: Hybrid Crowd-Machine Learning Classifiers

Monday, March 16th, 2015

Flock: Hybrid Crowd-Machine Learning Classifiers by Justin Cheng and Michael S. Bernstein.


We present hybrid crowd-machine learning classifiers: classification models that start with a written description of a learning goal, use the crowd to suggest predictive features and label data, and then weigh these features using machine learning to produce models that are accurate and use human-understandable features. These hybrid classifiers enable fast prototyping of machine learning models that can improve on both algorithm performance and human judgment, and accomplish tasks where automated feature extraction is not yet feasible. Flock, an interactive machine learning platform, instantiates this approach. To generate informative features, Flock asks the crowd to compare paired examples, an approach inspired by analogical encoding. The crowd’s efforts can be focused on specific subsets of the input space where machine-extracted features are not predictive, or instead used to partition the input space and improve algorithm performance in subregions of the space. An evaluation on six prediction tasks, ranging from detecting deception to differentiating impressionist artists, demonstrated that aggregating crowd features improves upon both asking the crowd for a direct prediction and off-the-shelf machine learning features by over 10%. Further, hybrid systems that use both crowd-nominated and machine-extracted features can outperform those that use either in isolation.

Let’s see, suggest predictive features (subject identifiers in the non-topic map technical sense) and label data (identify instances of a subject), sounds a lot easier that some of the tedium I have seen for authoring a topic map.

I particularly like the “inducing” of features versus relying on a crowd to suggest identifying features. I suspect that would work well in a topic map authoring context, sans the machine learning aspects.

This paper is being presented this week, CSCW 2015, so you aren’t too far behind. 😉

How would you structure an inducement mechanism for authoring a topic map?

TM-Gen: A Topic Map Generator from Text Documents

Wednesday, January 21st, 2015

TM-Gen: A Topic Map Generator from Text Documents by Angel L. Garrido, et al.

From the post:

The vast amount of text documents stored in digital format is growing at a frantic rhythm each day. Therefore, tools able to find accurate information by searching in natural language information repositories are gaining great interest in recent years. In this context, there are especially interesting tools capable of dealing with large amounts of text information and deriving human-readable summaries. However, one step further is to be able not only to summarize, but to extract the knowledge stored in those texts, and even represent it graphically.

In this paper we present an architecture to generate automatically a conceptual representation of knowledge stored in a set of text-based documents. For this purpose we have used the topic maps standard and we have developed a method that combines text mining, statistics, linguistic tools, and semantics to obtain a graphical representation of the information contained therein, which can be coded using a knowledge representation language such as RDF or OWL. The procedure is language-independent, fully automatic, self-adjusting, and it does not need manual configuration by the user. Although the validation of a graphic knowledge representation system is very subjective, we have been able to take advantage of an intermediate product of the process to make an experimental
validation of our proposal.

Of particular note on the automatic construction of topic maps:

Addition of associations:

TM-Gen adds to the topic map the associations between topics found in each sentence. These associations are given by the verbs present in the sentence. TM-Gen performs this task by searching the subject included as topic, and then it adds the verb as its association. Finally, it links its verb complement with the topic and with the association as a new topic.

Depending on the archive one would expect associations between the authors and articles but also topics within articles, to say nothing of date, the publication, etc. Once established, a user can request a view that consists of more or less detail. If not captured, however, more detail will not be available.

There is only a general description of TM-Gen but enough to put you on the way to assembling something quite similar.

You Say “Concepts” I Say “Subjects”

Wednesday, August 27th, 2014

Researchers are cracking text analysis one dataset at a time by Derrick Harris.

From the post:

Google on Monday released the latest in a string of text datasets designed to make it easier for people outside its hallowed walls to build applications that can make sense of all the words surrounding them.

As explained in a blog post, the company analyzed the New York Times Annotated Corpus — a collection of millions of articles spanning 20 years, tagged for properties such as people, places and things mentioned — and created a dataset that ranks the salience (or relative importance) of every name mentioned in each one of those articles.

Essentially, the goal with the dataset is to give researchers a base understanding of which entities are important within particular pieces of content, an understanding that should then be complemented with background data sources that will provide even more information. So while the number of times a person or company is mentioned in an article can be a very strong sign of which words are important — especially when compared to the usual mention count for that word, one of the early methods for ranking search results — a more telling method of ranking importance would also leverage existing knowledge of broader concepts to capture important words that don’t stand out from a volume perspective.

A summary of some of the recent work on recognizing concepts in text and not just key words.

As topic mappers know, there is no universal one to one correspondence between words and subjects (“concepts” in this article). Finding “concepts” means that whatever words triggered that recognition, we can supply other information that is known about the same concept.

Certainly will make topic map authoring easier when text analytics can generate occurrence data and decorate existing topic maps with their findings.

MeSH on Demand Tool:…

Saturday, August 23rd, 2014

MeSH on Demand Tool: An Easy Way to Identify Relevant MeSH Terms by Dan Cho.

From the post:

Currently, the MeSH Browser allows for searches of MeSH terms, text-word searches of the Annotation and Scope Note, and searches of various fields for chemicals. These searches assume that users are familiar with MeSH terms and using the MeSH Browser.

Wouldn’t it be great if you could find MeSH terms directly from your text such as an abstract or grant summary? MeSH on Demand has been developed in close collaboration among MeSH Section, NLM Index Section, and the Lister Hill National Center for Biomedical Communications to address this need.

Using MeSH on Demand

Use MeSH on Demand to find MeSH terms relevant to your text up to 10,000 characters. One of the strengths of MeSH on Demand is its ease of use without any prior knowledge of the MeSH vocabulary and without any downloads.

Now there’s a clever idea!

Imagine extending it just a bit so that it produces topics for subjects it detects in your text and associations with the text and author of the text. I would call that assisted topic map authoring. You?

I followed a tweet by Michael Hoffman, which lead to: MeSH on Demand Update: How to Find Citations Related to Your Text, which describes an enhancement to MeSH on demands that finds relevant citations (10) based on your text.

The enhanced version mimics the traditional method of writing court opinions. A judge writes his decision and then a law clerk finds cases that support the positions taken in the opinion. You really thought it worked some other way? 😉

Quote for a Terrorism Topic Map?

Wednesday, July 9th, 2014


British Airways has warned that passengers travelling to the US will be banned from their flight if they are unable to turn on their electronic devices when asked.

The airline said passengers will still be banned from travelling and need to reschedule even if they offer to abandon the item. British Airways says US-bound passengers will be BANNED if they can’t turn on mobile phone. British Airways says US-bound passengers will be BANNED if they can’t turn on mobile phone.

reminded me of another quotation about someone slavishly following the lead of another.

But I need you help finding it.

It’s been forty odd years ago and I was reading a young adult account of Benito Mussolini when I ran across an alleged direct quote from Mussolini in the early 1930’s:

If I starting hopping on one leg, that idiot in Munich [Hitler] would start bouncing on his head…

Not really my time period so I am unfamiliar with possible sources to track the alleged quote down. The usual suspects on the WWW have provided no answer.


PS: After no one followed their sycophantic excesses, British Airways backed off banning flyers who abandon non-working devices. UK follows US in banning uncharged devices from flights — and BA floated even tougher rules.

Property Suggester

Wednesday, July 2nd, 2014

Wikidata just got 10 times easier to use by Lydia Pintscher.

From an email post:

We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn’t have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item. This will make it a lot easier for you to figure out what the hell is missing on an item and which property to use.

Thank you so much to the student team who worked on this as part of their bachelor thesis over the last months as well as everyone who gave feedback and helped them along the way.

I’m really happy to see this huge improvement towards making Wikidata easier to use. I hope so are you.

I suspect such a suggester for topic map authoring would need to be domain specific but it would certainly be a useful feature.

At least so long as I can say: No more suggestions of X property. 😉

An added wrinkle could be suggested properties and why, from a design standpoint, they could be useful to include.

Annotating the news

Monday, June 16th, 2014

Annotating the news: Can online annotation tools help us become better news consumers? by Jihii Jolly.

From the post:

Last fall, Thomas Rochowicz, an economics teacher at Washington Heights Expeditionary Learning School in New York, asked his seniors to research news stories about steroids, drone strikes, and healthcare that could be applied to their class reading of Michael Sandel’s Justice. The students were to annotate their articles using Ponder, a tool that teachers can use to track what their students read and how they react to it. Ponder works as a browser extension that tracks how long a reader spends on a page, and it allows them to make inline annotations, which include highlights, text, and reaction buttons. These allow students to mark points in the article that relate to what they are learning in class—in this case, about economic theories. Responses are aggregated and sent back to the class feed, which the teacher controls.

Interesting piece on the use of annotation software with news stories.

I don’t know how configurable Ponder is in terms of annotation and reporting but being able to annotate web and pdf documents would be a long step towards lay authoring of topic maps.

For example, the “type” of a subject could be selected from a pre-composed list and associations created to map this occurrence of the subject in a particular document, by a particular author, etc. I can’t think of any practical reason to bother the average author with such details. Can you?

Certainly an expert author should have the ability to be less productive and more precise than the average reader but then we are talking about news stories. 😉 How precise does it need to be?

The post also mentions News Genius, which was pointed out to me by Sam Hunting some time ago. Probably better known for its annotation of rap music at rap genius. The only downside I see to Rap/News Genius is that the text to be annotated is loaded onto the site.

That is a disadvantage because if I wanted to create a topic map from annotations of archive files from the New York Times, that would not be possible. Remote annotation and then re-display of those annotations when a text is viewed (by an authorized user) is the sin qua non of topic maps for data resources.

Expert vs. Volunteer Semantics

Thursday, April 17th, 2014

The variability of crater identification among expert and community crater analysts by Stuart J. Robbins, et al.


The identification of impact craters on planetary surfaces provides important information about their geological history. Most studies have relied on individual analysts who map and identify craters and interpret crater statistics. However, little work has been done to determine how the counts vary as a function of technique, terrain, or between researchers. Furthermore, several novel internet-based projects ask volunteers with little to no training to identify craters, and it was unclear how their results compare against the typical professional researcher. To better understand the variation among experts and to compare with volunteers, eight professional researchers have identified impact features in two separate regions of the moon. Small craters (diameters ranging from 10 m to 500 m) were measured on a lunar mare region and larger craters (100s m to a few km in diameter) were measured on both lunar highlands and maria. Volunteer data were collected for the small craters on the mare. Our comparison shows that the level of agreement among experts depends on crater diameter, number of craters per diameter bin, and terrain type, with differences of up to ∼±45. We also found artifacts near the minimum crater diameter that was studied. These results indicate that caution must be used in most cases when interpreting small variations in crater size-frequency distributions and for craters ≤10 pixels across. Because of the natural variability found, projects that emphasize many people identifying craters on the same area and using a consensus result are likely to yield the most consistent and robust information.

The identification of craters on the Moon may seem far removed from your topic map authoring concerns but I would suggest otherwise.

True the paper is domain specific in some of it concerns (crater age, degradation, etc.) but the most important question was whether volunteers in aggregate could be as useful as experts in the identification of craters?

The author conclude:

Except near the minimum diameter, volunteers are able to identify craters just as well as the experts (on average) when using the same interface (the Moon Mappers interface), resulting in not only a similar number of craters, but also a similar size distribution. (page 34)

I find that suggestive for mapping semantics because unlike moon craters, what words mean (and implicitly why) are a daily concern for users, including ones in your enterprise.

You can, of course, employ experts to re-interpret what they have been told by some of your users into the expert’s language and produce semantic integration based on the expert’s understanding or mis-understanding of your domain.

Or, you can use your own staff, with experts to facilitate encoding their understanding of your enterprise semantics, as in a topic map.

Recalling that the semantics for your enterprise aren’t “out there” in the ether but residing within the staff that make up your enterprise.

I still see an important role for experts but it isn’t as the source of your semantics, rather at the hunters who assist in capturing your semantics.

I first saw this in a tweet by astrobites that lead me to: Crowd-Sourcing Crater Identification by Brett Deaton.

Hemingway App

Friday, April 11th, 2014

Hemingway App

We are a long way from something equivalent to Hemingway App for topic maps or other semantic technologies but it struck me that may not always be true.

Take it for a spin and see what you think.

What modifications would be necessary to make this concept work for a semantic technology?

Making Data Classification Work

Friday, April 4th, 2014

Making Data Classification Work by James H. Sawyer.

From the post:

The topic of data classification is one that can quickly polarize a crowd. The one side believes there is absolutely no way to make the classification of data and the requisite protection work — probably the same group that doesn’t believe in security awareness and training for employees. The other side believes in data classification as they are making it work within their environments, primarily because their businesses require it. The difficulty in choosing a side lies in the fact that both are correct.

Apologies, my quoting of James is mis-leading.

James is addressing the issue of “classification” of data in the sense of keeping information secret.

What is amazing is that the solution James proposes for “classification” in terms of what is kept secret, has a lot of resonance for “classification” in the sense of getting users to manage categories of data or documents.

One hint:

Remember how poorly even librarians use the Library of Congress subject listings? Contrast that with nearly everyone using aisle categories at the local grocery store.

You can design a topic map where experts use it poorly, or so nearly everyone be able to use it.

Your call.

How to Quickly Add Nodes and Edges…

Sunday, March 23rd, 2014

How to Quickly Add Nodes and Edges to Graphs

From the webpage:

The existing interfaces for graph manipulation all suffer from the same problem: it’s very difficult to quickly enter the nodes and edges. One has to create a node, then another node, then make an edge between them. This takes a long time and is cumbersome. Besides, such approach is not really as fast as our thinking is.

We, at Nodus Labs, decided to tackle this problem using what we already do well: #hashtagging the @mentions. The basic idea is that you create the nodes and edges in something that we call a “statement”. Within this #statement you can mark the #concepts with #hashtags, which will become nodes and then mark the @contexts or @lists where you want them to appear with @mentions. This way you can create huge graphs in a matter of seconds and if you do not believe us, watch this screencast of our application below.

You can also try it online on or even install it on your local machine using our free open-source repository on

+1! for using “…what we already do well….” for an authoring interface.

Getting any ideas for a topic map authoring interface?

Office Lens Is a Snap (Point and Map?)

Monday, March 17th, 2014

Office Lens Is a Snap

From the post:

The moment mobile-phone manufacturers added cameras to their devices, they stopped being just mobile phones. Not only have lightweight phone cameras made casual photography easy and spontaneous, they also have changed the way we record our lives. Now, with help from Microsoft Research, the Office team is out to change how we document our lives in another way—with the Office Lens app for Windows Phone 8.

Office Lens, now available in the Windows Phone Store, is one of the first apps to use the new OneNote Service API. The app is simple to use: Snap a photo of a document or a whiteboard, and upload it to OneNote, which stores the image in the cloud. If there is text in the uploaded image, OneNote’s cloud-based optical character-recognition (OCR) software turns it into editable, searchable text. Office Lens is like having a scanner in your back pocket. You can take photos of recipes, business cards, or even a whiteboard, and Office Lens will enhance the image and put it into your OneNote Quick Notes for reference or collaboration. OneNote can be downloaded for free.

Less than five (5) years ago, every automated process in Office Lens would have been a configurable setting.

Today, it’s just point and shoot.

There is an interface lesson for topic maps in the Office Lens interface.

Some people will need the Office Lens API. But, the rest of us, just want to take a picture of the whiteboard (or some other display). Automatic storage and OCR are welcome added benefits.

What about a topic map authoring interface that looks a lot like MS Word™ or Open Office. A topic map is loaded much like a spelling dictionary. When the user selects “map-it,” links are inserted that point into the topic map.

Hover over such a link and data from the topic map is displayed. Can be printed, annotated, etc.

One possible feature would be “subject check” which displays the subjects “recognized” in the document. To enable the author to correct any recognition errors.

In case you are interested, I can point you to some open source projects that have general authoring interfaces. 😉

PS: If you have a Windows phone, can you check out Office Lens for me? I am still sans a cellphone of any type. Since I don’t get out of the yard a cellphone doesn’t make much sense. But I do miss out on the latest cellphone technology. Thanks!

Quizz: Targeted Crowdsourcing…

Friday, March 7th, 2014

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users by Panagiotis G. Ipeirotis and Evgeniy Gabrilovich.


We describe Quizz, a gamified crowdsourcing system that simultaneously assesses the knowledge of users and acquires new knowledge from them. Quizz operates by asking users to complete short quizzes on specific topics; as a user answers the quiz questions, Quizz estimates the user’s competence. To acquire new knowledge, Quizz also incorporates questions for which we do not have a known answer; the answers given by competent users provide useful signals for selecting the correct answers for these questions. Quizz actively tries to identify knowledgeable users on the Internet by running advertising campaigns, effectively leveraging the targeting capabilities of existing, publicly available, ad placement services. Quizz quantifies the contributions of the users using information theory and sends feedback to the advertising system about each user. The feedback allows the ad targeting mechanism to further optimize ad placement.

Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute. Finally, our cost- quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.

Crowd sourcing isn’t an automatic slam-dunk but with research like this, it will start moving towards being a repeatable experience.

What do you want to author using a crowd?

I first saw this at Greg Linden’s More quick links.

Crisis News on Twitter

Thursday, March 6th, 2014

Who to Follow on Twitter for Crisis News, Part 2: Venezuela by David Godsall.

From the post:

With political strife dominating so much of our news cycle these past months, and events from Ukraine to Venezuela rapidly unfolding, Twitter is one of the best ways to stay informed in real time. But when social media turns everyone into an information source, it can be a challenge to sort the signal from the noise and figure out who to trust.

To help you find reliable sources for some of the most timely geopolitical news stories, we’ve created a series of Twitter lists compiling trusted journalists, activists and citizens on the ground in the conflict regions. These are the people sharing the most up-to-date information, often from their own first hand experiences. In Part 1 of this series, we talked about sources of news from Ukraine.

Our second list in the series focuses on the events currently taking place in Venezuela:

If you are building a topic map for current events, you need information feeds. Twitter has some suggestions if you want to follow events in the Ukraine or Venezuela.

As will any information feed, use even the best feeds with caution. I saw Henry Kissinger on Charlie Rose. Kissinger was very even handed while Rose was an “America lectures the world” advocate. If you haven’t read The ugly American by William J Lederer and Eugene Burdick, you should.

It is a very crowded field for who would qualify as the “ugliest” American these days.

Anonymous Authoring of Topic Maps?

Friday, January 24th, 2014

Arthur D. Santana documents in Virtuous or Vitriolic: The effect of anonymity on civility in online newspaper reader comment boards that anonymity have given online discussion boards their chief characteristic, “rampant incivility.”


In an effort to encourage community dialogue while also building reader loyalty, online newspapers have offered a way for readers to become engaged in the news process, most popularly with online reader comment boards. It is here that readers post their opinion following an online news story, and however much community interaction taking place therein, one thing appears evident: sometimes the comments are civil; sometimes they are not. Indeed, one of the chief defining characteristics of these boards has become the rampant incivility—a dilemma many newspapers have struggled with as they seek to strengthen the value of the online dialogue. Many journalists and industry observers have pointed to a seemingly straightforward reason for the offensive comments: anonymity. Despite the claim, however, there is a striking dearth of empirical evidence in the academic literature of the effect that anonymity has on commenters’ behavior. This research offers an examination of user comments of newspapers that allow anonymity (N=450) and the user comments of newspapers that do not (N=450) and compares the level of civility in both. In each group, comments follow news stories on immigration, a topic prevalent in the news in recent years and which is especially controversial and prone to debate. Results of this quantitative content analysis, useful for journalism practitioners and scholars, provide empirical evidence of the effect that anonymity has on the civility of user comments.

I haven’t surveyed the academic literature specific to online newspaper forums but it is a common experience that shouting from a crowd is one thing. Standing separate and apart as an individual is quite another.

There is a long history of semi-anonymous flame wars conducted in forums and email lists, so the author’s conclusions come as no surprise.

Despite being “old news,” I do think the article raises the question of whether you want to allow anonymous authoring in an shared topic map environment?

Assuming that authors cannot specify merges that damage the ability of the map to function, would you allow anonymous authoring in a shared topic map?

I say “shared” topic map rather than “online” because topic map environments exist separate from any public facing network or even any network at all but what’s critical here is with multiple authors, should any of them be able to be anonymous?

I have heard it argued that some analysts, I won’t say what discipline, want to be able to float their ideas anonymously but then also get credit should they be proven to be correct. Anonymity but also tracking at the author’s behest.

If required to build such a system I would, but I would not encourage it.

In part because of the civility issue but also because people should own their ideas, suggestions, statements, etc., and to take responsibility for them.

Think of it this way, segregation wasn’t ended by people posting anonymous comments to newspaper forums. Segregation was ended by people of different races and religions owning their words in opposition to segregation, to the point of discrimination, harassment, physical injury and even death.

If you are not that brave, why would anyone want to listen to you?

DARPA’s online games crowdsource software security

Friday, December 6th, 2013

DARPA’s online games crowdsource software security by Kevin McCaney.

From the post:

Flaws in commercial software can cause serious problems if cyberattackers take advantage of them with their increasingly sophisticated bag of tricks. The Defense Advanced Research Projects Agency wants to see if it can speed up discovery of those flaws by making a game of it. Several games, in fact.

DARPA’s Crowd Sourced Formal Verification (CSFV) program has just launched its Verigames portal, which hosts five free online games designed to mimic the formal software verification process traditionally used to look for software bugs.

Verification, both dynamic and static, has proved to be the best way to determine if software free of flaws, but it requires software engineers to perform “mathematical theorem-proving techniques” that can be time-consuming, costly and unable to scale to the size of some of today’s commercial software, according to DARPA. With Verigames, the agency is testing whether untrained (and unpaid) users can verify the integrity of software more quickly and less expensively.

“We’re seeing if we can take really hard math problems and map them onto interesting, attractive puzzle games that online players will solve for fun,” Drew Dean, DARPA program manager, said in announcing the portal launch. “By leveraging players’ intelligence and ingenuity on a broad scale, we hope to reduce security analysts’ workloads and fundamentally improve the availability of formal verification.”

If program verification is possible with online games, I don’t know of any principled reason why topic map authoring should not be possible.

Maybe fill-in-the-blank topic map authoring is just a poor authoring technique for topic maps.

Imagine gamifying data streams to be like Missile Command. 😉

Can you even count the number of hours that you played Missile Command?

Now consider the impact of a topic map authoring interface that addictive.

Particularly if the user didn’t know they were doing useful work.

Google’s R Style Guide [TM Guides?]

Monday, December 2nd, 2013

Google’s R Style Guide

From the webpage:

R is a high-level programming language used primarily for statistical computing and graphics. The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify. The rules below were designed in collaboration with the entire R user community at Google.

Useful if you are trying to develop good R coding habits from the start.

Makes me wonder about a similar need for topic maps authors? At least on a project by project basis.

If I am always representing marital status as an occurrence on a topic, that isn’t going to fit well with another author who always uses associations to represent marriages.

There could be compelling reasons in a project for choosing one or the other.

Similar questions will come up with other subjects and relationships as well.

It won’t be 100% but best to try to get everyone off on the same foot and to validate output against your local authoring guidelines.

Fair Use Prevails!

Thursday, November 14th, 2013

Google wins book-scanning case: judge finds “fair use,” cites many benefits by Jeff John Roberts.

From the post:

Google has won a resounding victory in its eight-year copyright battle with the Authors Guild over the search giant’s controversial decision to scan more than 20 million library and make the available on the internet.

In a ruling issued Thursday morning in New York, US Circuit Judge Denny Chin said the book scanning amounted to fair use because it was “highly transformative” and because it didn’t harm the market for the original work.

“Google Books provides significant public benefits,” writes Chin, describing it as “an essential research tool” and noting that the scanning service has expanded literary access for the blind and helped preserve the text of old books from physical decay.

Chin also rejected the theory that Google was depriving authors of income, noting that the company does not sell the scans or make whole copies of books available. He concluded, instead, that Google Books served to help readers discover new books and amounted to “new income from authors.”


In case you are interested in “why” Google prevailed: The Authors Guild, Inc., et. al. vs. Google, Inc..

Sets an important precedent for topic maps that extract small portions of print or electronic works for presentation to users.

Especially works that sit on library shelves, waiting for their copyright imprisonment to end.

On-demand Synonym Extraction Using Suffix Arrays

Saturday, August 24th, 2013

On-demand Synonym Extraction Using Suffix Arrays by Minoru Yoshida, Hiroshi Nakagawa, and Akira Terada. (Yoshida, M., Nakagawa, H. & Terada, A. (2013). On-demand Synonym Extraction Using Suffix Arrays. Information Extraction from the Internet. ISBN: 978-1463743994. iConcept Press. Retrieved from

From the introduction:

The amount of electronic documents available on the World Wide Web (WWW) is continuously growing. The situation is the same in a limited part of the WWW, e.g., Web documents from specific web sites such as ones of some specific companies or universities, or some special-purpose web sites such as, etc. This chapter mainly focuses on such a limited-size corpus. Automatic analysis of this large amount of data by text-mining techniques can produce useful knowledge that is not found by human efforts only.

We can use the power of on-memory text mining for such a limited-size corpus. Fast search for required strings or words available by putting whole documents on memory contributes to not only speeding up of basic search operations like word counting, but also making possible more complicated tasks that require a number of search operations. For such advanced text-mining tasks, this chapter considers the problem of extracting synonymous strings for a query given by users. Synonyms, or paraphrases, are words or phrases that have the same meaning but different surface strings. “HDD” and “hard drive” in documents related to computers and “BBS” and “message boards” in Web pages are examples of synonyms. They appear ubiquitously in different types of documents because the same concept can often be described by two or more expressions, and different writers may select different words or phrases to describe the same concept. In such cases, the documents that include the string “hard drive” might not be found by if the query “HDD” is used, which results in a drop in the coverage of the search system. This could become a serious problem, especially for searches of limited-size corpora. Therefore, being able to find such synonyms significantly improves the usability of various systems. Our goal is to develop an algorithm that can find strings synonymous with the user input. The applications of such an algorithm include augmenting queries with synonyms in information retrieval or text-mining systems, and assisting input systems by suggesting expressions similar to the user input.

The authors concede the results of their method are inferior to the best results of other synonym extraction methods but go on to say:

However, note that the main advantage of our method is not its accuracy, but its ability to extract synonyms of any query without a priori construction of thesauri or preprocessing using other linguistic tools like POS taggers or dependency parsers, which are indispensable for previous methods.

An important point to remember about all semantic technologies. How appropriate a technique is for your project depends on your requirements, not qualities of a technique in the abstract.

Technique N may not support machine reasoning but sending coupons to mobile phones “near” a restaurant doesn’t require that overhead. (Neither does standing outside the restaurant with flyers.)

Choose semantic techniques based on their suitability for your purposes.

Topic Map Patterns?

Wednesday, August 7th, 2013

A comment yesterday:

However, the first step would be to create a catalog of common topic map structures or patterns. It seems like such a catalog could eventually enable automated or computer assisted construction of topic maps to supplement hand-editing of topic maps. Hand-editing is a necessary first step but it does not scale well. Imagine how few applications there would be now if everything had to be coded in assembler. Or how few databases would there be now if everyone had to build their own out of B-Trees. (Carl)

resonated when I was writing an entry about computational linguistics today.

I think Carl is right, people don’t create their own databases out of B-Trees.

But by the same token, they don’t forge completely new patterns of speaking either.

I don’t know what the numbers are, but how many original constructions in your native language do you use every day? Particularly in a professional setting?

Rather than looking for “topic map” patterns, shouldn’t we be looking for speech patterns in particular user communities?

Such that our interfaces, when set to a particular community, can automatically parse input into a topic map.

Not unconstrained subject recognition but using language patterns to capture some percentage of subjects rather than the user.

Interactive Entity Resolution in Relational Data… [NG Topic Map Authoring]

Wednesday, June 5th, 2013

Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation by Hyunmo Kang, Lise Getoor, Ben Shneiderman, Mustafa Bilgic, Louis Licamele.


Databases often contain uncertain and imprecise references to real-world entities. Entity resolution, the process of reconciling multiple references to underlying real-world entities, is an important data cleaning process required before accurate visualization or analysis of the data is possible. In many cases, in addition to noisy data describing entities, there is data describing the relationships among the entities. This relational data is important during the entity resolution process; it is useful both for the algorithms which determine likely database references to be resolved and for visual analytic tools which support the entity resolution process. In this paper, we introduce a novel user interface, D-Dupe, for interactive entity resolution in relational data. D-Dupe effectively combines relational entity resolution algorithms with a novel network visualization that enables users to make use of an entity’s relational context for making resolution decisions. Since resolution decisions often are interdependent, D-Dupe facilitates understanding this complex process through animations which highlight combined inferences and a history mechanism which allows users to inspect chains of resolution decisions. An empirical study with 12 users confirmed the benefits of the relational context visualization on the performance of entity resolution tasks in relational data in terms of time as well as users’ confidence and satisfaction.

Talk about a topic map authoring tool!

Even chains entity resolution decisions together!

Not to be greedy, but interactive data deduplication and integration in Hadoop would be a nice touch. 😉

Software: D-Dupe: A Novel Tool for Interactive Data Deduplication and Integration.

An introduction to Emacs Lisp

Saturday, June 1st, 2013

An introduction to Emacs Lisp by Christian Johansen.

From the webpage:

As a long-time passionate Emacs user, I’ve been curious about Lisp in general and Emacs Lisp in particular for quite some time. Until recently I had not written any Lisp apart from my .emacs.d setup, despite having read both An introduction to programming in Emacs Lisp and The Little Schemer last summer. A year later, I have finally written some Lisp, and I thought I’d share the code as an introduction to others out there curious about Lisp and extending Emacs.


The Task

The task I set out to solve was to make Emacs slightly more intelligent when working with tests written in Buster.JS, which is a test framework for JavaScript I’m working on with August Lilleaas. In particular I wanted Emacs to help me with Buster’s concept of deferred tests.

Yesterday a graph programmer suggested to me some people program in Lisp and but the whole world uses Java.

Of course, most of the world is functionally illiterate too but I don’t take that as an argument for illiteracy.

Not to cast aspersions on Java, a great deal of excellent work is done in Java. (See the many Apache projects that use Java.)

But counting noses is a lemming measure, which is not related the pros or cons of any particular language.

What topic map authoring tasks would you extend Emacs to facilitate?

I first saw this in Christophe Lalanne’s A bag of tweets / May 2013.

Topic Maps in Lake Wobegon

Wednesday, May 15th, 2013

Jim Harris writes in The Decision Wobegon Effect:

In his book The Most Human Human, Brian Christian discussed what Baba Shiv of the Stanford Graduate School of Business called the decision dilemma, “where there is no objectively best choice, where there are simply a number of subjective variables with trade-offs between them. The nature of the situation is such that additional information probably won’t even help. In these cases – consider the parable of the donkey that, halfway between two bales of hay and unable to decide which way to walk, starves to death – what we want, more than to be correct, is to be satisfied with our choice (and out of the dilemma).”


Jim describes the Wobegon effect, an effect that blinds decision makers to alternative bales of hay.

Topic maps are composed of a mass of decisions, both large and small.

Is the Wobegon effect affecting your topic map authoring?

Check Jim’s post and think about your topic map authoring practices.

The Amateur Data Scientist and Her Projects

Saturday, April 20th, 2013

The Amateur Data Scientist and Her Projects by Vincent Granville.

From the post:

With so much data available for free everywhere, and so many open tools, I would expect to see the emergence of a new kind of analytic practitioner: the amateur data scientist.

Just like the amateur astronomer, the amateur data scientist will significantly contribute to the art and science, and will eventually solve mysteries. Could the Boston bomber be found thanks to thousands of amateurs analyzing publicly available data (images, videos, tweets, etc.) with open source tools? After all, amateur astronomers have been able to detect exoplanets and much more.

Also, just like the amateur astronomer only needs one expensive tool (a good telescope with data recording capabilities), the amateur data scientist only needs one expensive tool (a good laptop and possibly subscription to some cloud storage/computing services).

Amateur data scientists might earn money from winning Kaggle contests, working on problems such as identifying a Bonet, explaining the stock market flash crash, defeating Google page-ranking algorithms, helping find new complex molecules to fight cancer (analytical chemistry), predicting solar flares and their intensity. Interested in becoming an amateur data scientist? Here’s a first project for you, to get started:

Amateur data scientist, I rather like the sound of that.

And would be an intersection of interests and talents, just like professional data scientists.

Vincent’s example of posing entry level problems is a model I need to follow for topic maps.

Amateur topic map authors?


Saturday, April 20th, 2013

PhenoMiner: quantitative phenotype curation at the rat genome database by Stanley J. F. Laulederkind, (Database (2013) 2013 : bat015 doi: 10.1093/database/bat015)


The Rat Genome Database (RGD) is the premier repository of rat genomic and genetic data and currently houses >40 000 rat gene records as well as human and mouse orthologs, >2000 rat and 1900 human quantitative trait loci (QTLs) records and >2900 rat strain records. Biological information curated for these data objects includes disease associations, phenotypes, pathways, molecular functions, biological processes and cellular components. Recently, a project was initiated at RGD to incorporate quantitative phenotype data for rat strains, in addition to the currently existing qualitative phenotype data for rat strains, QTLs and genes. A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature. Concurrently, three of those ontologies needed extensive addition of new terms to move the curation forward. The curation interface development, as well as ontology development, was an ongoing process during the early stages of the PhenoMiner curation project.

Database URL:

The line:

A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature.

sounded relevant to topic maps.

Turns out to be five ontologies and the article reports:

The ‘Create Record’ page (Figure 4) is where the rest of the data for a single record is entered. It consists of a series of autocomplete text boxes, drop-down text boxes and editable plain text boxes. All of the data entered are associated with terms from five ontologies/vocabularies: RS, CMO, MMO, XCO and the optional MA (Mouse Adult Gross Anatomy Dictionary) (13)

Important to note that authoring does not require the user to make explicit the properties underlying any of the terms from the different ontologies.

Some users probably know that level of detail but what is important is the capturing of their knowledge of subject sameness.

A topic map extension/add-on to such a system could flesh out those bare terms to provide a basis for treating terms from different ontologies as terms for the same subjects.

That merging/mapping detail need not bother an author or casual user.

But it increases the odds that future data sets can be reliably integrated with this one.

And issues with the correctness of a mapping can be meaningfully investigated.

If it helps, think of correctness of mappping as accountability, for someone else.