Archive for the ‘Annotation’ Category

Merge 5 Proxies, Take Away 1 Proxy = ? [Data Provenance]

Monday, September 5th, 2016

Provenance for Database Transformations by Val Tannen. (video)


Database transformations (queries, views, mappings) take apart, filter,and recombine source data in order to populate warehouses, materialize views,and provide inputs to analysis tools. As they do so, applications often need to track the relationship between parts and pieces of the sources and parts and pieces of the transformations’ output. This relationship is what we call database provenance.

This talk presents an approach to database provenance that relies on two observations. First, provenance is a kind of annotation, and we can develop a general approach to annotation propagation that also covers other applications, for example to uncertainty and access control. In fact, provenance turns out to be the most general kind of such annotation,in a precise and practically useful sense. Second, the propagation of annotation through a broad class of transformations relies on just two operations: one when annotations are jointly used and one when they are used alternatively.This leads to annotations forming a specific algebraic structure, a commutative semiring.

The semiring approach works for annotating tuples, field values and attributes in standard relations, in nested relations (complex values), and for annotating nodes in (unordered) XML. It works for transformations expressed in the positive fragment of relational algebra, nested relational calculus, unordered XQuery, as well as for Datalog, GLAV schema mappings, and tgd constraints. Finally, when properly extended to semimodules it works for queries with aggregates. Specific semirings correspond to earlier approaches to provenance, while others correspond to forms of uncertainty, trust, cost, and access control.

What does happen when you subtract from a merge? (Referenced here as an “aggregation.”)

Although possible to paw through logs to puzzle out a result, Val suggests there are more robust methods at our disposal.

I watched this over the weekend and be forewarned, heavy sledding ahead!

This is an active area of research and I have only begun to scratch the surface for references.

I may discover differently, but the “aggregation” I have seen thus far relies on opaque strings.

Not that all uses of opaque strings are inappropriate, but imagine the power of treating a token as an opaque string for one use case and exploding that same token into key/value pairs for another.


Collaborative Annotation for Scientific Data Discovery and Reuse [+ A Stumbling Block]

Thursday, July 2nd, 2015

Collaborative Annotation for Scientific Data Discovery and Reuse by Kirk Borne.

From the post:

The enormous growth in scientific data repositories requires more meaningful indexing, classification and descriptive metadata in order to facilitate data discovery, reuse and understanding. Meaningful classification labels and metadata can be derived autonomously through machine intelligence or manually through human computation. Human computation is the application of human intelligence to solving problems that are either too complex or impossible for computers. For enormous data collections, a combination of machine and human computation approaches is required. Specifically, the assignment of meaningful tags (annotations) to each unique data granule is best achieved through collaborative participation of data providers, curators and end users to augment and validate the results derived from machine learning (data mining) classification algorithms. We see very successful implementations of this joint machine-human collaborative approach in citizen science projects such as Galaxy Zoo and the Zooniverse (

In the current era of scientific information explosion, the big data avalanche is creating enormous challenges for the long-term curation of scientific data. In particular, the classic librarian activities of classification and indexing become insurmountable. Automated machine-based approaches (such as data mining) can help, but these methods only work well when the classification and indexing algorithms have good training sets. What happens when the data includes anomalous patterns or features that are not represented in the training collection? In such cases, human-supported classification and labeling become essential – humans are very good at pattern discovery, detection and recognition. When the data volumes reach astronomical levels, it becomes particularly useful, productive and educational to crowdsource the labeling (annotation) effort. The new data objects (and their associated tags) then become new training examples, added to the data mining training sets, thereby improving the accuracy and completeness of the machine-based algorithms.

Kirk goes onto say:

…it is incumbent upon science disciplines and research communities to develop common data models, taxonomies and ontologies.

Sigh, but we know from experience that has never worked. True, we can develop more common data models, taxonomies and ontologies, but they will be in addition to the present common data models, taxonomies and ontologies. Not to mention that developing knowledge is going to lead to future common data models, taxonomies and ontologies.

If you don’t believe me, take a look at: Library of Congress Subject Headings Tentative Monthly List 07 (July 17, 2015). These subject headings have not yet been approved but they are in addition to existing subject headings.

The most recent approved list: Library of Congress Subject Headings Monthly List 05 (May 18, 2015). For approved lists going back to 1997, see: Library of Congress Subject Headings (LCSH) Approved Lists.

Unless you are working in some incredibly static and sterile field, the basic terms that are found in “common data models, taxonomies and ontologies” are going to change over time.

The only sure bet in the area of knowledge and its classification is that change is coming.

But, Kirk is right, common data models, taxonomies and ontologies are useful. So how do we make them more useful in the face of constant change?

Why not use topics to model elements/terms of common data models, taxonomies and ontologies? Which would enable user to search across such elements/terms by the properties of those topics. Possibly discovering topics that represent the same subject under a different term or element.

Imagine working on an update of a common data model, taxonomy or ontology and not having to guess at the meaning of bare elements or terms? A wealth of information, including previous elements/terms for the same subject being present at each topic.

All of the benefits that Kirk claims would accrue, plus empowering users who only know previous common data models, taxonomies and ontologies, to say nothing of easing the transition to future common data models, taxonomies and ontologies.

Knowledge isn’t static. Our methodologies for knowledge classification should be as dynamic as the knowledge we seek to classify.

Unstructured Topic Map-Like Data Powering AI

Monday, March 23rd, 2015

Artificial Intelligence Is Almost Ready for Business by Brad Power.

From the post:

Such mining of digitized information has become more effective and powerful as more info is “tagged” and as analytics engines have gotten smarter. As Dario Gil, Director of Symbiotic Cognitive Systems at IBM Research, told me:

“Data is increasingly tagged and categorized on the Web – as people upload and use data they are also contributing to annotation through their comments and digital footprints. This annotated data is greatly facilitating the training of machine learning algorithms without demanding that the machine-learning experts manually catalogue and index the world. Thanks to computers with massive parallelism, we can use the equivalent of crowdsourcing to learn which algorithms create better answers. For example, when IBM’s Watson computer played ‘Jeopardy!,’ the system used hundreds of scoring engines, and all the hypotheses were fed through the different engines and scored in parallel. It then weighted the algorithms that did a better job to provide a final answer with precision and confidence.”

Granting that the tagging and annotation is unstructured, unlike a topic map, but it is as unconstrained by first order logic and other crippling features of RDF and OWL. Out of that mass of annotations, algorithms can construct useful answers.

Imagine what non-experts (Stanford logic refugees need not apply) could author about your domain, to be fed into an AI algorithm. That would take more effort than relying upon users chancing upon subjects of interest but it would also give you greater precision in the results.

Perhaps, just perhaps, one of the errors in the early topic maps days was the insistence on high editorial quality at the outset, as opposed to allowing editorial quality to emerge out of data.

As an editor I’m far more in favor of the former than the latter but seeing the latter work, makes me doubt that stringent editorial control is the only path to an acceptable degree of editorial quality.

What would a rough-cut topic map authoring interface look like?


Data as “First Class Citizens”

Tuesday, February 10th, 2015

Data as “First Class Citizens” by Łukasz Bolikowski, Nikos Houssos, Paolo Manghi, Jochen Schirrwagen.

The guest editorial to D-Lib Magazine, January/February 2015, Volume 21, Number 1/2, introduces a collection of articles focusing on data as “first class citizens,” saying:

Data are an essential element of the research process. Therefore, for the sake of transparency, verifiability and reproducibility of research, data need to become “first-class citizens” in scholarly communication. Researchers have to be able to publish, share, index, find, cite, and reuse research data sets.

The 2nd International Workshop on Linking and Contextualizing Publications and Datasets (LCPD 2014), held in conjunction with the Digital Libraries 2014 conference (DL 2014), took place in London on September 12th, 2014 and gathered a group of stakeholders interested in growing a global data publishing culture. The workshop started with invited talks from Prof. Andreas Rauber (Linking to and Citing Data in Non-Trivial Settings), Stefan Kramer (Linking research data and publications: a survey of the landscape in the social sciences), and Dr. Iain Hrynaszkiewicz (Data papers and their applications: Data Descriptors in Scientific Data). The discussion was then organized into four full-paper sessions, exploring orthogonal but still interwoven facets of current and future challenges for data publishing: “contextualizing and linking” (Semantic Enrichment and Search: A Case Study on Environmental Science Literature and A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing), “new forms of publishing” (A Framework Supporting the Shift From Traditional Digital Publications to Enhanced Publications and Science 2.0 Repositories: Time for a Change in Scholarly Communication), “dataset citation” (Data Citation Practices in the CRAWDAD Wireless Network Data Archive, A Methodology for Citing Linked Open Data Subsets, and Challenges in Matching Dataset Citation Strings to Datasets in Social Science) and “dataset peer-review” (Enabling Living Systematic Reviews and Clinical Guidelines through Semantic Technologies and Data without Peer: Examples of Data Peer Review in the Earth Sciences).

We believe these investigations provide a rich overview of current issues in the field, by proposing open problems, solutions, and future challenges. In short they confirm the urgent and fascinating demands of research data publishing.

The only stumbling point in this collection of essays is the notion of data as “First Class Citizens.” Not that I object to a catchy title but not all data is going to be equal when it comes to first class citizenship.

Take Semantic Enrichment and Search: A Case Study on Environmental Science Literature, for example. Great essay on using multiple sources to annotate entities once disambiguation had occurred. But some entities are going to have more annotations than others and some entities may not be recognized at all. Not to mention it is rather doubtful that the markup containing those entities was annotated at all?

Are we sure we want to exclude from data the formats that contain the data? Isn’t a format a form of data? As well as the instructions for processing that data? Perhaps not in every case but shouldn’t data and the formats that hold date be equally treated as first class citizens? I am mindful that hundreds of thousands of people saw the pyramids being built but we have not one accurate report on the process.

Will the lack of that one accurate report deny us access to data quite skillfully preserved in a format that is no longer accessible to us?

While I support the cry for all data to be “first class citizens,” I also support a very broad notion of data to avoid overlooking data that may be critical in the future.

Web Annotation Data Model [First Public Working Draft]

Friday, December 12th, 2014

Web Annotation Data Model [First Public Working Draft]

Web Annotation Principles

The Web Annotation Data Model is defined using the following basic principles:

  • An Annotation is a resource that represents the link between resources, or a selection within a resource.
  • There are two types of participating resources, Bodies and Targets.
  • Annotations have 0..n Bodies.
  • Annotations have 1..n Targets.
  • The content of the Body resources is related to, and typically “about”, the content of the Target resources.
  • Annotations, Bodies and Targets may have their own properties and relationships, typically including provenance information and descriptive metadata.
  • The intent behind the creation of an Annotation is an important property, and is identified by a Motivation resource.

The following principles describe additional distinctions needed regarding the exact nature of Target and Body:

  • The Target or Body resource may be more specific than the entity identified by the resource’s URI alone.
  • In particular,
    • The Target or Body resource may be a specific segment of a resource.
    • The Target or Body resource may be a resource with a specific style.
    • The Target or Body resource may be a resource in a specific context or container.
    • The Target or Body resource may be any combination of the above.
  • The identity of the specific resource is separate from the description of how to obtain the specific resource.
  • The specific resource is derived from a resource identified by a URI.

The following principles describe additional semantics regarding multiple resources:

  • A resource may be a choice between multiple resources.
  • A resource may be a unique, unordered set of resources.
  • A resource may be an ordered list of resources.
  • These resources may be used anywhere a resource may be used.

Take the time to read and comment!

If you wish to make comments regarding this document, please send them to (subscribe, archives), or use the specification annotation tool by selecting some text and activating the sidebar. (emphasis added)

That’s right! You can annotate the annotation draft itself. Unfortunate that more standards organizations don’t offer that type of commenting facility by default.

Although transclusion would be a better solution, annotations may offer a way to finally break through the document wall that conceals information. For example, making a reference to the Senate Report on CIA torture, page 33, means that I have to look up that document, locate page 33 and then pair your comment to that page. (Easier than manual retrieval but still less than optimal.)

Say you wanted to comment on:

After the July 2002 meetings, the CIA’s (deleted) CTC Legal, (deleted) , drafted a letter to Attorney General John Ashcroft asking the Department of Justice for “a formal declination of prosecution, in advance, for any employees of the United States, as well as any other personnel acting on behalf of the United States, who may employ methods in the interrogation of Abu Zubaydah that otherwise might subject those individuals to prosecution.”*

To supply the information that has been deleted in that sentence and then to share that annotation with others, such that when they view that document, the information appears as annotations on the deleted portions of text.

Or to annotate various lies being told by former Vice President Cheney and others with pointers to the Senate CIA torture report.

The power of annotation breaks barrier that documents pose to the composition of a sub-document portion of information with other information.

If you thought we needed librarians when organizing information at the document level, just imagine how badly we will need them when organization of information is at the sub-document level.

* So much for torture being the solution when a bomb is ticking in a school. Anyone would use whatever means necessary to stop such a bomb and accept the consequences of their actions. Requests for legal immunity demonstrates those involved in U.S. sponsored torture were not only ineffectual but cowards as well.

Video: Experts Share Perspectives on Web Annotation

Thursday, November 20th, 2014

Video: Experts Share Perspectives on Web Annotation by Gary Price.

From the post:

The topic of web annotation continues to grow in interest and importance.

Here’s how the World Wide Web Consortium (W3C) describes the topic:

Web annotations are an attempt to recreate and extend that functionality as a new layer of interactivity and linking on top of the Web. It will allow anyone to annotate anything anywhere, be it a web page, an ebook, a video, an image, an audio stream, or data in raw or visualized form. Web annotations can be linked, shared between services, tracked back to their origins, searched and discovered, and stored wherever the author wishes; the vision is for a decentralized and open annotation infrastructure.

A Few Examples

In recent weeks and months a WC3 Web Annotation working group got underway,, a company that has been working in this area for several years (and one we’ve mentioned several times on infoDOCKET) formally launched a web annotation extension for Chrome, the Mellon Foundation awarded $750,000 in research funding, and The Journal of Electronic Publishing began offering annotation for each article in the publication.

New Video

Today, posted a 15 minute video (embedded below) where several experts share some of their perspectives (Why the interest in the topic? Biggest Challenges, Future Plans, etc.) on the topic of web annotation.

The video was recorded at the recent W3C TPAC 2014 Conference in Santa Clara, CA.

I am puzzled by more than one speaker on the video referring to the lack of robust addressing as a reason why annotation has not succeeded in the past. Perhaps they are unaware of the XLink and XPointer work at the W3C? Or HyTime for that matter?

True, none of those efforts were widely supported but that doesn’t mean that robust addressing wasn’t available.

I for one will be interested in comparing the capabilities of prior efforts against what is marketed as “new web annotation” capabilities.

Annotation, particularly what was known as “extended XLinks” is very important for the annotation of materials to which you don’t have read/write access. Think about annotating texts distributed by a vendor on DVD. Or annotating text that are delivered over a web stream. A separate third-party value-add product. Like a topic map for instance.

See videos from I Annotate 2014

Supporting Open Annotation

Friday, October 10th, 2014

Supporting Open Annotation by Gerben.

From the post:

In its mission to connect the world’s knowledge and thoughts, the solution pursues is a web-wide mechanism to create, share and discover annotations. One of our principal steps towards this end is providing a browser add-on that works with our annotation server, enabling people to read others’ annotations on any web page they visit, and to publish their own annotations for others to see.

I spent my summer interning at to work towards a longer term goal, taking annotation sharing to the next level: an open, decentralised approach. In this post I will describe how such an approach could work, how standardisation efforts are taking off to get us there, and how we are involved in making this happen – the first step being support for the preliminary Open Annotation data model.

An annotation ecosystem

While we are glad to provide a service enabling people to create and share annotations on the web, we by no means want to become the sole annotation service provider, as this would imply creating a monopoly position and a single point of failure. Rather, we encourage anyone to build annotation tools and services, possibly using the same code we use. Of course, a problematic consequence of having multiple organisations each running separate systems is that even more information silos emerge on the web, and quite likely the most popular service would obtain a monopoly position after all.

To prevent either fragmentation or monopolisation of the world’s knowledge, we would like an ecosystem to evolve, comprising interoperable annotation services and client implementations. Such an ecosystem would promote freedom of innovation, prevent dependence on a single party, and provide scalability and robustness. It would be like the architecture of the web itself.

Not a bad read if you accept the notion that interoperable annotation servers are an acceptable architecture for annotation of web resources.

Me? I would just as soon put:


on my annotation and mail the URL to the CIA, FBI, NSA and any foreign intelligence agencies that I can think of with a copy of my annotaton.

You can believe that government agencies will follow the directives of Congress with regard to spying on United States citizens, but then that was already against the law. Remember the old saying, “Fool me once, shame on you. Fool me twice, shame on me.”? That is applicable to government surveillance.

We need robust annotation mechanisms but not ones that make sitting targets out of our annotations. Local, encrypted annotation mechanisms that can cooperate with other local, encrypted annotation mechanisms would be much more attractive to me.

I first saw this in a tweet by Ivan Herman.

Who Dat?

Sunday, August 24th, 2014


From the about page:

Dat is an grant funded, open source project housed in the US Open Data Institute. While dat is a general purpose tool, we have a focus on open science use cases.

The high level goal of the dat project is to build a streaming interface between every database and file storage backend in the world. By building tools to build and share data pipelines we aim to bring to data a style of collaboration similar to what git brings to source code.

The first alpha release is now out!

More on this project later this coming week.

I first saw this in Nat Torkington’s Four short links: 21 August 2014.

Web Annotation Working Group (Preventing Semantic Rot)

Wednesday, August 20th, 2014

Web Annotation Working Group

From the post:

The W3C Web Annotation Working Group is chartered to develop a set of specifications for an interoperable, sharable, distributed Web annotation architecture. The chartered specs consist of:

  1. Abstract Annotation Data Model
  2. Data Model Vocabulary
  3. Data Model Serializations
  5. Client-side API

The working group intends to use the Open Annotation Data Model and Open Annotation Extension specifications, from the W3C Open Annotation Community Group, as a starting point for development of the data model specification.

The Robust Link Anchoring specification will be jointly developed with the WebApps WG, where many client-side experts and browser implementers participate.

Some good news for the middle of a week!

Shortcomings to watch for:

Can annotations be annotated?

Can non-Web addressing schemes be used by annotators?

Can the structure of files (visible or not) in addition to content be annotated?

If we don’t have all three of those capabilities, then the semantics of annotations will rot, just as semantics of earlier times have rotted away. The main distinction is that most of our ancestors didn’t choose to allow the rot to happen.

I first saw this in a tweet by Rob Sanderson.

Web Annotation Working Group Charter

Wednesday, July 23rd, 2014

Web Annotation Working Group Charter

From the webpage:

Annotating, which is the act of creating associations between distinct pieces of information, is a widespread activity online in many guises but currently lacks a structured approach. Web citizens make comments about online resources using either tools built into the hosting web site, external web services, or the functionality of an annotation client. Readers of ebooks make use the tools provided by reading systems to add and share their thoughts or highlight portions of texts. Comments about photos on Flickr, videos on YouTube, audio tracks on SoundCloud, people’s posts on Facebook, or mentions of resources on Twitter could all be considered to be annotations associated with the resource being discussed.

The possibility of annotation is essential for many application areas. For example, it is standard practice for students to mark up their printed textbooks when familiarizing themselves with new materials; the ability to do the same with electronic materials (e.g., books, journal articles, or infographics) is crucial for the advancement of e-learning. Submissions of manuscripts for publication by trade publishers or scientific journals undergo review cycles involving authors and editors or peer reviewers; although the end result of this publishing process usually involves Web formats (HTML, XML, etc.), the lack of proper annotation facilities for the Web platform makes this process unnecessarily complex and time consuming. Communities developing specifications jointly, and published, eventually, on the Web, need to annotate the documents they produce to improve the efficiency of their communication.

There is a large number of closed and proprietary web-based “sticky note” and annotation systems offering annotation facilities on the Web or as part of ebook reading systems. A common complaint about these is that the user-created annotations cannot be shared, reused in another environment, archived, and so on, due to a proprietary nature of the environments where they were created. Security and privacy are also issues where annotation systems should meet user expectations.

Additionally, there are the related topics of comments and footnotes, which do not yet have standardized solutions, and which might benefit from some of the groundwork on annotations.

The goal of this Working Group is to provide an open approach for annotation, making it possible for browsers, reading systems, JavaScript libraries, and other tools, to develop an annotation ecosystem where users have access to their annotations from various environments, can share those annotations, can archive them, and use them how they wish.

Depending on how fine grained you want your semantics, annotation is one way to convey them to others.

Unfortunately, looking at the starting point for this working group, “open” means RDF, OWL and other non-commercially adopted technologies from the W3C.

Defining the ability to point, using XQuery perhaps and reserving to users the ability to create standards for annotation payloads would be a much more “open” approach. That is an approach you are unlikely to see from the W3C.

I would be more than happy to be proven wrong on that point.

Which gene did you mean?

Wednesday, July 16th, 2014

Which gene did you mean? by Barend Mons.


Computational Biology needs computer-readable information records. Increasingly, meta-analysed and pre-digested information is being used in the follow up of high throughput experiments and other investigations that yield massive data sets. Semantic enrichment of plain text is crucial for computer aided analysis. In general people will think about semantic tagging as just another form of text mining, and that term has quite a negative connotation in the minds of some biologists who have been disappointed by classical approaches of text mining. Efforts so far have tried to develop tools and technologies that retrospectively extract the correct information from text, which is usually full of ambiguities. Although remarkable results have been obtained in experimental circumstances, the wide spread use of information mining tools is lagging behind earlier expectations. This commentary proposes to make semantic tagging an integral process to electronic publishing.

From within the post:

If all words had only one possible meaning, computers would be perfectly able to analyse texts. In reality however, words, terms and phrases in text are highly ambiguous. Knowledgeable people have few problems with these ambiguities when they read, because they use context to disambiguate ‘on the fly’. Even when fed a lot of semantically sound background information, however, computers currently lag far behind humans in their ability to interpret natural language. Therefore, proper semantic tagging of concepts in texts is crucial to make Computational Biology truly viable. Electronic Publishing has so far only scratched the surface of what is needed.

Open Access publication shows great potential, andis essential for effective information mining, but it will not achieve its full potential if information continues to be buried in plain text. Having true semantic mark up combined with open access for mining is an urgent need to make possible a computational approach to life sciences.

Creating semantically enriched content as part and parcel of the publication process should be a winning strategy.

First, for current data, estimates of what others will be searching for should not be hard to find out. That will help focus tagging on the material users are seeking. Second, a current and growing base of enriched material will help answer questions about the return on enriching material.

Other suggestions for BMC Bioinformatics?

Domeo and Utopia for PDF…

Tuesday, July 1st, 2014

Domeo and Utopia for PDF, Achieving annotation interoperability by Paolo Ciccarese.

From the description:

The Annotopia ( Open Annotation universal Hub allows to achieve annotation interoperability between different annotation clients. This is a first small demo where the annotations created with the Domeo Web Annotation Tool ( can be seen by the users of the Utopia for PDF application (

The demonstration shows highlighting of text and attachment of a note to an HTML page in a web browser and then the same document is loaded as PDF and the highlighting and note appear as specified in the HTML page.

The Domeo Web Annotation Tool appears to have the capacity to be a topic map authoring tool against full text.

Definite progress on the annotation front!

Next question is how do we find all the relevant annotations despite differences in user terminology? Same problem that we have with searching but in annotations instead of the main text.

You could start from some location in the text but I’m not sure all users will annotate the same material. Some may comment on the article in general, others, will annotate very specific text.

Definitely a topic map issue both in terms of subjects in the text as well as in the annotations. Project

Monday, June 30th, 2014 Project

From the post:

The Project is pleased to announce an award for $752,000 from the Andrew W. Mellon Foundation to investigate the use of annotation in humanities and social science scholarship over a two year period. Our partners in this grant include Michigan Publishing at the University of Michigan; Project MUSE at the Johns Hopkins University; Project Scalar at USC; Stanford University’s Shared Canvas; the Modern Language Association, and the Open Knowledge Foundation. In addition, we will be working with the World Wide Web Consortium (W3C) and edX/HarvardX to explore integration into other environments with high user interaction.

This grant was established to address potential impediments in the arts and humanities which could retard the adoption of open standards. These barriers range from the prevalence of more tradition-bound forms of communication and publishing; the absence of pervasive experimentation with network-based models of sharing and knowledge extraction; the difficulties of automating description for arts and disciplines of practice; and the reliance on information dense media such as images, audio, and video. Nonetheless, we believe that with concerted work among our partners, alongside groups making steady progress in the annotation community, we can unite useful threads, bringing the arts and humanities to a point where self-sustaining interest in annotation can be reached.

The project is also seeking donations of time and expertise and subject identity is always in play with annotation projects.

Are you familiar with this project?

Annotating the news

Monday, June 16th, 2014

Annotating the news: Can online annotation tools help us become better news consumers? by Jihii Jolly.

From the post:

Last fall, Thomas Rochowicz, an economics teacher at Washington Heights Expeditionary Learning School in New York, asked his seniors to research news stories about steroids, drone strikes, and healthcare that could be applied to their class reading of Michael Sandel’s Justice. The students were to annotate their articles using Ponder, a tool that teachers can use to track what their students read and how they react to it. Ponder works as a browser extension that tracks how long a reader spends on a page, and it allows them to make inline annotations, which include highlights, text, and reaction buttons. These allow students to mark points in the article that relate to what they are learning in class—in this case, about economic theories. Responses are aggregated and sent back to the class feed, which the teacher controls.

Interesting piece on the use of annotation software with news stories.

I don’t know how configurable Ponder is in terms of annotation and reporting but being able to annotate web and pdf documents would be a long step towards lay authoring of topic maps.

For example, the “type” of a subject could be selected from a pre-composed list and associations created to map this occurrence of the subject in a particular document, by a particular author, etc. I can’t think of any practical reason to bother the average author with such details. Can you?

Certainly an expert author should have the ability to be less productive and more precise than the average reader but then we are talking about news stories. 😉 How precise does it need to be?

The post also mentions News Genius, which was pointed out to me by Sam Hunting some time ago. Probably better known for its annotation of rap music at rap genius. The only downside I see to Rap/News Genius is that the text to be annotated is loaded onto the site.

That is a disadvantage because if I wanted to create a topic map from annotations of archive files from the New York Times, that would not be possible. Remote annotation and then re-display of those annotations when a text is viewed (by an authorized user) is the sin qua non of topic maps for data resources.

GATE 8.0

Monday, May 12th, 2014

GATE (general architecture for text engineering) 8.0

From the download page:

Release 8.0 (May 11th 2014)

Most users should download the installer package (~450MB):

If the installer does not work for you, you can download one of the following packages instead. See the user guide for installation instructions:

The BIN, SRC and ALL packages all include the full set of GATE plugins and all the libraries GATE requires to run, including sample trained models for the LingPipe and OpenNLP plugins.

Version 8.0 requires Java 7 or 8, and Mac users must install the full JDK, not just the JRE.

Four major changes in this release:

  1. Requires Java 7 or later to run
  2. Tools for Twitter.
  3. ANNIE (named entity annotation pipeline) Refreshed.
  4. Tools for Crowd Sourcing.

Not bad for a project that will turn twenty (20) next year!

More resources:


Nightly Snapshots

Mastering a substantial portion of GATE should keep you in nearly constant demand.

brat rapid annotation tool

Sunday, May 11th, 2014

brat rapid annotation tool

From the introduction:

brat is a web-based tool for text annotation; that is, for adding notes to existing text documents.

brat is designed in particular for structured annotation, where the notes are not free form text but have a fixed form that can be automatically processed and interpreted by a computer.

The examples page has examples of:

  • Entity mention detection
  • Event extraction
  • Coreference resolution
  • Normalization
  • Chunking
  • Dependency syntax
  • Meta-knowledge
  • Information extraction
  • Bottom-up Metaphor annotation
  • Visualization
  • Information Extraction system evaluation

I haven’t installed the local version but it is on my to-do list.

I first saw this in a tweet by Steven Bird.

Annotating, Extracting, and Linking Legal Information

Sunday, April 20th, 2014

Annotating, Extracting, and Linking Legal Information by Adam Wyner. (slides)

Great slides, provided you have enough background in the area to fill in the gaps.

I first saw this at: Wyner: Annotating, Extracting, and Linking Legal Information, which has collected up the links/resources mentioned in the slides.

Despite decades of electronic efforts and several centuries of manual effort before that, legal information retrieval remains an open challenge.

tagtog: interactive and text-mining-assisted annotation…

Monday, April 14th, 2014

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles by Juan Miguel Cejuela, et al.


The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.

Database URL:,

Encouraging because the “tagging” is not wholly automated nor is it wholly hand-authored. Rather the goal is to create an interface that draws on the strengths of automated processing as moderated by human expertise.

Annotation remains at a document level, which consigns subsequent users to mining full text but this is definitely a step in the right direction.

The GATE Crowdsourcing Plugin:…

Monday, March 24th, 2014

The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy by Kalina Bontcheva, Ian Roberts, Leon Derczynski, and Dominic Rout.


Crowdsourcing is an increasingly popular, collaborative approach for acquiring annotated corpora. Despite this, reuse of corpus conversion tools and user interfaces between projects is still problematic, since these are not generally made available. This demonstration will introduce the new, open-source GATE Crowd-sourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units and back, as well as automatically generating reusable crowd-sourcing interfaces for NLP classification and selection tasks. The entire work-flow will be demonstrated on: annotating named entities; disambiguating words and named entities with respect to DBpedia URIs; annotation of opinion holders and targets; and sentiment.

From the introduction:

A big outstanding challenge for crowdsourcing projects is that the cost to define a single annotation task remains quite substantial. This demonstration will introduce the new, open-source GATE Crowdsourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units, as well as automatically generated, reusable user interfaces [1] for NLP classification and selection tasks. Their use will be demonstrated on annotating named entities (selection task), disambiguating words and named entities with respect to DBpedia URIs (classification task), annotation of opinion holders and targets (selection task), as well as sentiment (classification task).


Are the difficulties associated with annotation UIs a matter of creating the UI or the choices that underlie the UI?

This plugin may shed light on possible answers to that question.

Annotation Use Cases

Friday, March 14th, 2014

Annotation Use Cases

From the Introduction:

Annotation is a pervasive activity when reading or otherwise engaging with publications. In the physical world, highlighting and sticky notes are common paradigms for marking up and associating one’s own content with the work being read, and many digital solutions exist in the same space. These digital solutions are, however, not interoperable between systems, even when there is only one user with multiple devices.

This document lays out the use cases for annotations on digital publications, as envisioned by the W3C Digital Publishing Interest Group, the W3C Open Annotation Community Group and the International Digital Publishing Forum. The use cases are provided as a means to drive forwards the conversation about standards in this arena.

Just for the record, all of these use cases and more were doable with HyTime more than twenty (20) years ago.

The syntax was ugly but the underlying concepts are as valid now as they were then. Something to keep in mind while watching this activity.


Saturday, March 8th, 2014


From the “Features” page:

Performance analysis made easy

LongoMatch has been designed to be very easy to use, exposing the basic functionalities of video analysis in an intuitive interface. Tagging, playback and edition of stored events can be easily done from the main window, while more specific features can be accessed through menus when needed.

Flexible and customizable for all sports

LongoMatch can be used for any kind of sports, allowing to create custom templates with an unlimited number of tagging categories. It also supports defining custom subcategories and creating templates for your teams with detailed information of each player which is the perfect combination for a fine-grained performance analysis.

Post-match and real time analysis

LongoMatch can be used for post-match analysis supporting the most common video formats as well as for live analysis, capturing from Firewire, USB video capturers, IP cameras or without any capture device at all, decoupling the capture process from the analysis, but having it ready as soon as the recording is done. With live replay, without stopping the capture, you can review tagged events and export them while still analyzing the game live.

Although pitched as software for analyzing sports events, it occurs to me this could be useful in a number of contexts.

Such as analyzing news footage of police encounters with members of the public.

Or video footage of particular locations. Foot or vehicle traffic.

The possibilities are endless.

Then it’s just a question of tying that information together with data from other information feeds. 😉

News Genius

Saturday, February 8th, 2014

News Genius (about page)

From the webpage:

What is News Genius?

News Genius helps you make sense of the news by putting stories in context, breaking down subtext and bias, and crowdsourcing knowledge from around the world!

You can find speeches, interviews, articles, recipes, and even sports news, from yesterday and today, all annotated by the community and verified experts. With everything from Eisenhower speeches to reports on marijuana arrest horrors, you can learn about politics, current events, the world stage, and even meatballs!

Who writes the annotations?

Anyone can! Just create an account and start annotating. You can highlight any line to annotate it yourself, suggest changes to existing annotations, and even put up your favorite texts. Getting started is very easy. If you make good contributions, you’ll earn News IQ™, and if you share true knowledge, eventually you’ll be able to edit and annotate anything on the site.

How do I make verified annotations on my own work?

Verified users are experts in the news community. This includes journalists, like Spencer Ackerman, groups like the ACLU and Smart Chicago Collaborative, and even U.S. Geological Survey. Interested in getting you or your group verified? Sign up and request your verified account!

Sam Hunting forwarded this to my attention.

Interesting interface.

Assuming that you created associations between the text and annotator without bothering the author, this would work well for some aspects of a topic map interface.

I did run into the problem that who gets to be the “annotation” depends on who gets there first. If you pick text that has already been annotated, at most you can post a suggestion or vote it up or down.

BTW, this started as a music site so when you search for topics, there are a lot of rap, rock and poetry hits. Not so many news “hits.”

You can imagine my experience when I searched for “markup” and “semantics.”

I probably need to use more common words. 😉

I don’t know the history of the site but other than the not more than one annotation rule, you can certainly get started quickly creating and annotating content.

That is a real plus over many of the interfaces I have seen.


PS: The only one annotation rule is all the more annoying when you find that very few Jimi Hendrix songs have any parts that are not annotated. 🙁

A Case Study on Legal Case Annotation

Friday, October 18th, 2013

A Case Study on Legal Case Annotation by Adam Wyner, Wim Peters, and Daniel Katz.


The paper reports the outcomes of a study with law school students to annotate a corpus of legal cases for a variety of annotation types, e.g. citation indices, legal facts, rationale, judgement, cause of action, and others. An online tool is used by a group of annotators that results in an annotated corpus. Differences amongst the annotations are curated, producing a gold standard corpus of annotated texts. The annotations can be extracted with semantic searches of complex queries. There would be many such uses for the development and analysis of such a corpus for both legal education and legal research.

author = {Adam Wyner and Peters, Wim, and Daniel Katz},
title = {A Case Study on Legal Case Annotation},
booktitle = {Proceedings of 26th International Conference on Legal Knowledge and Information Systems (JURIX 2013)},
year = {2013},
pages = {??-??},
address = {Amsterdam},
publisher = {IOS Press}

The methodology and results of this study will be released as open source resources.

A gold standard for annotation of legal texts will create the potential for automated tools to assist lawyers, judges and possibly even lay people.

Deeply interested to see where this project goes next.

Docear 1.0 (stable),…

Thursday, October 17th, 2013

Docear 1.0 (stable), a new video, new manual, new homepage, new details page, … by Joeran Beel.

From the post:

It’s been almost two years since we released the first private Alpha of Docear and today, October 17 2013, Docear 1.0 (stable) is finally available for Windows, Mac, and Linux to download. We are really proud of what we accomplished in the past years and we think that Docear is better than ever. In addition to all the enhancements we made during the past years, we completely rewrote the manual with step-by-step instructions including an overview of supported PDF viewers, we changed the homepage, we created a new video, and we made the features & details page much more comprehensive. For those who already use Docear 1.0 RC4, there are not many changes (just a few bug fixes). For new users, we would like to explain what Docear is and what makes it so special.

Docear is a unique solution to academic literature management that helps you to organize, create, and discover academic literature. The three most distinct features of Docear are:

  1. A single-section user-interface that differs significantly from the interfaces you know from Zotero, JabRef, Mendeley, Endnote, … and that allows a more comprehensive organization of your electronic literature (PDFs) and the annotations you created (i.e highlighted text, comments, and bookmarks).
  2. A ‘literature suite concept’ that allows you to draft and write your own assignments, papers, theses, books, etc. based on the annotations you previously created.
  3. A research paper recommender system that allows you to discover new academic literature.

Aside from Docear’s unique approach, Docear offers many features more. In particular, we would like to point out that Docear is free, open source, not evil, and Docear gives you full control over your data. Docear works with standard PDF annotations, so you can use your favorite PDF viewer. Your reference data is directly stored as BibTeX (a text-based format that can be read by almost any other reference manager). Your drafts and folders are stored in Freeplane’s XML format, again a text-based format that is easy to process and understood by several other applications. And although we offer several online services such as PDF metadata retrieval, backup space, and online viewer, we do not force you to register. You can just install Docear on your computer, without any registration, and use 99% of Docear’s functionality.

But let’s get back to Docear’s unique approach for literature management…

Impressive “academic literature management” package!

I have done a lot of research over the years but unaided in large part by citation management software. Perhaps it is time to try a new approach.

Just scanning the documentation it does not appear that I can share my Docear annotations with another user.

Unless we were fortunate enough to have used the same terminology the same way while doing our research.

That is to say any research project I undertake will result in the building of a silo that is useful to me, but that others will have to duplicate.

If true, I just scanned the documentation, that is an observation and not a criticism.

I will keep track of my experience with a view towards suggesting changes that could make Docear more transparent.

W3C Open Annotation: Status and Use Cases

Tuesday, June 25th, 2013

W3C Open Annotation: Status and Use Cases by Robert Sanderson and Paolo Ciccarese.

Presentation slides from OAI8: Innovations in Scholarly Communication, June 19-21 2013, Geneva, Switzerland.

For more details about the OpenAnnotation group:

Annotation, particularly if data storage becomes immutable, will become increasingly important.

Perhaps a revival of HyTime-based addressing or a robust version of XLink is in the offing.

As we have recently learned from the NSA, “web scale” data isn’t very much data at all.

Our addressing protocols should not be limited to any particular data subset.

Smart Visualization Annotation

Wednesday, June 19th, 2013

Smart Visualization Annotation by Enrico Bertini.

From the post:

There are three research papers which have drawn my attention lately. They all deal with automatic annotation of data visualizations, that is, adding labels to the visualization automatically.

It seems to me that annotations, as an integral part of a visualization design, have received somewhat little attention in comparison to other components of a visual representation (shapes, layouts, colors, etc.). A quick check in the books I have in my bookshelf kind of support my hypothesis. The only exception I found is Colin Ware’s Information Visualization book, which has a whole section on “Linking Text with Graphical Elements“. This is weird because, think about it, text is the most powerful means we have to bridge the semantic gap between the visual representation and its interpretation. With text we can clarify, explain, give meaning, etc.

Smart annotations is an interesting area of research because, not only it can reduce the burden of manually annotating a visualization but it can also reveal interesting patterns and trends we might not know about or, worse, may get unnoticed. Here are the three papers (click on the images to see a higher resolution version).


What do you make of: “…bridge the semantic gap between the visual representation and its interpretation.”?

Is there a gap between the “visual representation and its interpretation,” or is there a semantic gap between multiple observers of a visual representation?

I ask because I am not sure annotations (text) limits the range of interpretation unless the observers are already very close in world views.

That is text cannot command us to accept interpretations unless we are already disposed to accept them.

I commend all three papers to you for a close reading.

Textual Processing of Legal Cases

Tuesday, June 4th, 2013

Textual Processing of Legal Cases by Adam Wyner.

A presentation on Adam’s Crowdsourced Legal Case Annotation project.

Very useful if you are interested in guidance on legal case annotation.

Of course I see the UI as using topics behind the UI’s identifications and associations between those topics.

But none of that has to be exposed to the user.

Contextifier: Automatic Generation of Annotated Stock Visualizations

Sunday, May 12th, 2013

Contextifier: Automatic Generation of Annotated Stock Visualizations by Jessica Hullman, Nicholas Diakopoulos and Eytan Adar.


Online news tools—for aggregation, summarization and automatic generation—are an area of fruitful development as reading news online becomes increasingly commonplace. While textual tools have dominated these developments, annotated information visualizations are a promising way to complement articles based on their ability to add context. But the manual effort required for professional designers to create thoughtful annotations for contextualizing news visualizations is difficult to scale. We describe the design of Contextifier, a novel system that automatically produces custom, annotated visualizations of stock behavior given a news article about a company. Contextifier’s algorithms for choosing annotations is informed by a study of professionally created visualizations and takes into account visual salience, contextual relevance, and a detection of key events in the company’s history. In evaluating our system we find that Contextifier better balances graphical salience and relevance than the baseline.

The authors use a stock graph as the primary context in which to link in other news about a publicly traded company.

Other aspects of Contextifier were focused on enhancement of that primary context.

The lesson here is that a tool with a purpose is easier to hone than a tool that could be anything for just about anybody.

I first saw this at Visualization Papers at CHI 2013 by Enrico Bertini.

Collaborative annotation… [Human + Machine != Semantic Monotony]

Sunday, April 21st, 2013

Collaborative annotation for scientific data discovery and reuse by Kirk Borne. (Borne, K. (2013), Collaborative annotation for scientific data discovery and reuse. Bul. Am. Soc. Info. Sci. Tech., 39: 44–45. doi: 10.1002/bult.2013.1720390414)


Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.

A persuasive call to arms to develop “collaborative annotation:”

Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams.

And more specifically for the Large Synoptic Survey Telescope (LSST):

The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System,

As you might expect, semantic diversity is going to be present with “collaborative annotation.”

Semantic Monotony (aka Semantic Web) has failed for machines alone.

No question it will fail for humans + machines.

Are you ready to step up to the semantic diversity of collaborative annotation (humans + machines)?

Open Annotation Data Model

Tuesday, March 19th, 2013

Open Annotation Data Model


The Open Annotation Core Data Model specifies an interoperable framework for creating associations between related resources, annotations, using a methodology that conforms to the Architecture of the World Wide Web. Open Annotations can easily be shared between platforms, with sufficient richness of expression to satisfy complex requirements while remaining simple enough to also allow for the most common use cases, such as attaching a piece of text to a single web resource.

An Annotation is considered to be a set of connected resources, typically including a body and target, where the body is somehow about the target. The full model supports additional functionality, enabling semantic annotations, embedding content, selecting segments of resources, choosing the appropriate representation of a resource and providing styling hints for consuming clients.

My first encounter with this proposal so I need to compare it to my Simple Web Semantics.

At first blush, the Open Annotation Core Model looks a lot heavier than Simple Web Semantics.

I need to reform my blog posts into a formal document and perhaps attach a comparison as an annex.