Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 4, 2013

A.nnotate

Filed under: Annotation,Indexing — Patrick Durusau @ 3:29 pm

A.nnotate

From the homepage:

A.nnotate is an online annotation, collaboration and indexing system for documents and images, supporting PDF, Word and other document formats. Instead of emailing different versions of a document back and forth you can now all comment on a single read-only copy online. Documents are displayed in high quality with fonts and layout just like the printed version. It is easy to use and runs in all common web browsers, with no software or plugins to install.

Hosted solutions are available for individuals and workgroups. For enterprise users the full system is available for local installation. Special discounts apply for educational use. A.nnotate technology can also be used to enhance existing document and content management systems with high quality online document viewing, annotation and collaboration facilities.

I suppose that is one way to solve the “index merging” problem.

Everyone use a common document.

Doesn’t help if a group starts with different copies of the same document.

Or if other indexes from other documents need to be merged with the present document.

Not to mention merging indexes/annotations separate from any particular document instance.

Still, a step away from the notion of a document as a static object.

Which is a good thing.

I first saw this in a tweet by Stian Danenbarger.

February 18, 2013

Open Annotation Collaboration

Filed under: AnnotateIt,Annotation,Annotator — Patrick Durusau @ 5:41 am

Open Annotation Collaboration

From the webpage:

We are pleased to announce the publication of the 1.0 release of the Open Annotation Data Model & Ontology. This work is the product of the W3C Open Annotation Community Group jointly founded by the Annotation Ontology and the Open Annotation Collaboration. The OA Community Group will be hosting three public rollout events, U.S. West Coast, U.S. East Coast, and in the U.K. this Spring and early Summer. Implementers, developers, and information managers who attend one of these meetings will learn about the OA Data Model & Ontology firsthand from OA Community implementers and see existing annotation services that have been built using the OA model.

Open Annotation Specification Rollout Dates

U.S. West Coast Rollout – 09 April 2013 at Stanford University

U.S. East Coast Rollout – 06 May 2013 at the University of Maryland

U.K. Rollout – 24 June 2013 at the University of Manchester

No registration fee but RSVP required.

Materials on the Open Annotation Data Model & Ontology (W3C) and other annotation resources.

The collection of Known Annotation Clients is my favorite.

January 20, 2013

Interactive Text Mining

Filed under: Annotation,Bioinformatics,Curation,Text Mining — Patrick Durusau @ 8:03 pm

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task Cecilia N. Arighi, et. al. (Database (2013) 2013 : bas056 doi: 10.1093/database/bas056)

Abstract:

In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.

Curation is an aspect of topic map authoring, albeit with the latter capturing information for later merging with other sources of information.

Definitely an article you will want to read if you are designing text mining as part of a topic map solution.

December 27, 2012

Utopia Documents

Filed under: Annotation,Bioinformatics,Biomedical,Medical Informatics,PDF — Patrick Durusau @ 3:58 pm

Checking the “sponsored by” link for pdfx v1.0 and discovered: Utopia Documents.

From the homepage:

Reading, redefined.

Utopia Documents brings a fresh new perspective to reading the scientific literature, combining the convenience and reliability of the PDF with the flexibility and power of the web. Free for Linux, Mac and Windows.

Building Bridges

The scientific article has been described as a Story That Persuades With Data, but all too often the link between data and narrative is lost somewhere in the modern publishing process. Utopia Documents helps to rebuild these connections, linking articles to underlying datasets, and making it easy to access online resources relating to an article’s content.

A Living Resource

Published articles form the ‘minutes of science‘, creating a stable record of ideas and discoveries. But no idea exists in isolation, and just because something has been published doesn’t mean that the story is over. Utopia Documents reconnects PDFs with the ongoing discussion, keeping you up-to-date with the latest knowledge and metrics.

Comment

Make private notes for yourself, annotate a document for others to see or take part in an online discussion.

Explore article content

Looking for clarification of given terms? Or more information about them? Do just that, with integrated semantic search.

Interact with live data

Interact directly with curated database entries- play with molecular structures; edit sequence and alignment data; even plot and export tabular data.

A finger on the pulse

Stay up to date with the latest news. Utopia connects what you read with live data from Altmetric, Mendeley, CrossRef, Scibite and others.

A user can register for an account (enabling comments on documents) or use the application anonymously.

Presently focused on the life sciences but no impediment to expansion into computer science for example.

It doesn’t solve semantic diversity issues so an opportunity for topic maps there.

Doesn’t address the issue of documents being good at information delivery but not so good for information storage.

But issues of semantic diversity and information storage, are growth areas for Utopia Documents, not reservations about its use.

Suggest you start using and exploring Utopia Documents sooner rather than later!

December 8, 2012

Piccolo: Distributed Computing via Shared Tables

Filed under: Annotation,Distributed Systems,Piccolo — Patrick Durusau @ 7:41 pm

Piccolo: Distributed Computing via Shared Tables

From the homepage:

Piccolo is a framework designed to make it easy to develop efficient distributed applications.

In contrast to traditional data-centric models (such as Hadoop) which present the user a single object at a time to operate on, Piccolo exposes a global table interface which is available to all parts of the computation simulataneously. This allows users to specify programs in an intuitive manner very similar to that of writing programs for a single machine.

Piccolo includes a number of optimizations to ensure that using this table interface is not just easy, but also fast:

Locality
To ensure locality of execution, tables are explicitly partitioned across machines. User code that interacts with the tables can specify a locality preference: this ensures that the code is executed locally with the data it is accessing.
Load-balancing
Not all load is created equal – often some partition of a computation will take much longer then others. Waiting idly for this task to finish wastes valuable time and resources. To address this Piccolo can migrate tasks away from busy machines to take advantage of otherwise idle workers, all while preserving the locality preferences and the correctness of the program.
Failure Handling
Machines failures are inevitable, and generally occur when you’re at the most critical time in your computation. Piccolo makes checkpointing and restoration easy and fast, allowing for quick recovery in case of failures.
Synchronization
Managing the correct synchronization and update across a distributed system can be complicated and slow. Piccolo addresses this by allowing users to defer synchronization logic to the system. Instead of explicitly locking tables in order to perform updates, users can attach accumulation functions to a table: these are used automatically by the framework to correctly combine concurrent updates to a table entry.

The closer you are to the metal, the more aware you will be of the distributed nature of processing and data.

Will the success of distributed processing/storage be when all but systems architects are unaware of its nature?

November 16, 2012

Phrase Detectives

Filed under: Annotation,Games,Interface Research/Design,Linguistics — Patrick Durusau @ 5:21 am

Phrase Detectives

This annotation game was also mentioned in Bob Carpenter’s Another Linguistic Corpus Collection Game, but it merits separate mention.

From the description:

Welcome to Phrase Detectives

Lovers of literature, grammar and language, this is the place where you can work together to improve future generations of technology. By indicating relationships between words and phrases you will help to create a resource that is rich in linguistic information.

It is easy to see how this could be adapted to identification of subjects, roles and associations in texts.

And in a particular context, the interest would be in capturing usage in that context, not the wider world.

Definitely has potential as a topic map authoring interface.

Another Linguistic Corpus Collection Game

Filed under: Annotation,Games,Interface Research/Design — Patrick Durusau @ 5:11 am

Another Linguistic Corpus Collection Game by Bob Carpenter

From the post:

Johan Bos and his crew at University of Groningen have a new suite of games aimed at linguistic data data collection. You can find them at:

http://www.wordrobe.org/

Wordrobe is currently hosting four games. Twins is aimed at part-of-speech tagging, Senses is for word sense annotation, Pointers for coref data, and Names for proper name classification.

One of the neat things about Wordrobe is that they try to elicit some notion of confidence by allowing users to “bet” on their answers.

Used here with a linguistic data collection but there is no reason why such games would not work in other contexts.

For instance, in an enterprise environment seeking to collect information for construction of a topic map. An alternative to awkward interviews where you try to elicit intuitive knowledge from users.

Create a game using their documents and give meaningful awards, extra vacation time for instance.

November 4, 2012

Manual Gene Ontology annotation workflow

Filed under: Annotation,Bioinformatics,Curation,Ontology — Patrick Durusau @ 9:00 pm

Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database by Harold J. Drabkin, Judith A. Blake and for the Mouse Genome Informatics Database. Database (2012) 2012 : bas045 doi: 10.1093/database/bas045.

Abstract:

The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.

Semantic uniformity is achievable, in a limited enough sphere, provided you are willing to pay the price for it.

It has a high rate of return over less carefully curated content.

The project is producing high quality results, although hampered by a lack of resources.

My question is whether a similar high quality of results could be achieved with less semantically consistent curation by distributed contributors?

Harnessing the community of those interested in such a resource. And refining those less semantically consistent entries into higher quality annotations.

Pointers to examples of such projects?

August 30, 2012

MyMiner: a web application for computer-assisted biocuration and text annotation

Filed under: Annotation,Bioinformatics,Biomedical,Classification — Patrick Durusau @ 10:35 am

MyMiner: a web application for computer-assisted biocuration and text annotation by David Salgado, Martin Krallinger, Marc Depaule, Elodie Drula, Ashish V. Tendulkar, Florian Leitner, Alfonso Valencia and Christophe Marcelle. ( Bioinformatics (2012) 28 (17): 2285-2287. doi: 10.1093/bioinformatics/bts435 )

Abstract:

Motivation: The exponential growth of scientific literature has resulted in a massive amount of unstructured natural language data that cannot be directly handled by means of bioinformatics tools. Such tools generally require structured data, often generated through a cumbersome process of manual literature curation. Herein, we present MyMiner, a free and user-friendly text annotation tool aimed to assist in carrying out the main biocuration tasks and to provide labelled data for the development of text mining systems. MyMiner allows easy classification and labelling of textual data according to user-specified classes as well as predefined biological entities. The usefulness and efficiency of this application have been tested for a range of real-life annotation scenarios of various research topics.

Availability: http://myminer.armi.monash.edu.au.

Contacts: david.salgado@monash.edu and christophe.marcelle@monash.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.

A useful tool and good tutorial materials.

I could easily see something similar for CS research (unless such already exists).

August 2, 2012

Community Based Annotation (mapping?)

Filed under: Annotation,Bioinformatics,Biomedical,Interface Research/Design,Ontology — Patrick Durusau @ 1:51 pm

Enabling authors to annotate their articles is examined in: Assessment of community-submitted ontology annotations from a novel database-journal partnership by Tanya Z. Berardini, Donghui Li, Robert Muller, Raymond Chetty, Larry Ploetz, Shanker Singh, April Wensel and Eva Huala.

Abstract:

As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles’ contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed.

We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality.

It is encouraging that this annotation effort started with the persons most likely to know the correct answers, authors of the papers in question.

The low initial participation rate (16%) and improved after email reminder rate (53%), were less encouraging.

I suspect unless and until prior annotation practices (by researchers) becomes a line item on current funding requests (how many annotations were accepted by publishers of your prior research?), we will continue to see annotations to be a low priority item.

Perhaps I should suggest that as a study area for the NIH?

Publishers, researchers who build annotation software, annotated data sources and their maintainers, are all likely to be interested.

Would you be interested as well?

July 12, 2012

OpenCalais

Filed under: Annotation,OpenCalais — Patrick Durusau @ 6:21 pm

OpenCalais

From the introduction page:

The free OpenCalais service and open API is the fastest way to tag the people, places, facts and events in your content.  It can help you improve your SEO, increase your reader engagement, create search-engine-friendly ‘topic hubs’ and streamline content operations – saving you time and money.

OpenCalais is free to use in both commercial and non-commercial settings, but can only be used on public content (don’t run your confidential or competitive company information through it!). OpenCalais does not keep a copy of your content, but it does keep a copy of the metadata it extracts there from.

To repeat, OpenCalais is not a private service, and there is no secure, enterprise version that you can buy to operate behind a firewall. It is your responsibility to police the content that you submit, so make sure you are comfortable with our Terms of Service (TOS) before you jump in.

You can process up to 50,000 documents per day (blog posts, news stories, Web pages, etc.) free of charge.  If you need to process more than that – say you are an aggregator or a media monitoring service – then see this page to learn about Calais Professional. We offer a very affordable license.

OpenCalais’ early adopters include CBS Interactive / CNET, Huffington Post, Slate, Al Jazeera, The New Republic, The White House and more. Already more than 30,000 developers have signed up, and more than 50 publishers and 75 entrepreneurs are using the free service to help build their businesses.

You can read about the pioneering work of these publishers, entrepreneurs and developers here.

To get started, scroll to the bottom section of this page. To build OpenCalais into an existing site or publishing platform (CMS), you will need to work with your developers. 

I thought I had written about OpenCalais but it turns out it was just in quotes in other posts. Should know better than to rely on my memory. 😉

The 50,000 document per day limit sounds reasonable to me and should be enough for some interesting experiments. Perhaps even comparisons of the results from different tagging projects.

Not to say one is better than another but to identify spots on semantic margins where ambiguity may be found.

Historical documents should make interesting test subjects.

Being cautious the further back in history we reach, the less meaningful it is to say a word has a “correct” meaning. An author used it with a particular meaning but that passed from our ken with the passing of the author and their linguistic community. We can guess what may have been meant, but nothing more.

Semantator: annotating clinical narratives with semantic web ontologies

Filed under: Annotation,Ontology,Protégé,RDF,Semantator,Semantic Web — Patrick Durusau @ 2:40 pm

Semantator: annotating clinical narratives with semantic web ontologies by Dezhao Song, Christopher G. Chute, and Cui Tao. (AMIA Summits Transl Sci Proc. 2012;2012:20-9. Epub 2012 Mar 19.)

Abstract:

To facilitate clinical research, clinical data needs to be stored in a machine processable and understandable way. Manual annotating clinical data is time consuming. Automatic approaches (e.g., Natural Language Processing systems) have been adopted to convert such data into structured formats; however, the quality of such automatically extracted data may not always be satisfying. In this paper, we propose Semantator, a semi-automatic tool for document annotation with Semantic Web ontologies. With a loaded free text document and an ontology, Semantator supports the creation/deletion of ontology instances for any document fragment, linking/disconnecting instances with the properties in the ontology, and also enables automatic annotation by connecting to the NCBO annotator and cTAKES. By representing annotations in Semantic Web standards, Semantator supports reasoning based upon the underlying semantics of the owl:disjointWith and owl:equivalentClass predicates. We present discussions based on user experiences of using Semantator.

If you are an AMIA member, see above for the paper. If not, see: Semantator: annotating clinical narratives with semantic web ontologies (PDF file). And the software/webpage: Semantator.

Software is a plugin for Protege 4.1 or higher.

Looking at the extensive screen shots at the website, which has good documentation, the first question I would ask a potential user is: “Are you comfortable with Protege?” If they aren’t I suspect you are going to invest a lot of time in teaching them ontologies and Protege. Just an FYI.

Complex authoring tools, particularly for newbies, seem like a non-starter to me. For example, why not have a standalone entity extractor (but don’t call it that, call it “I See You (ISY)) that uses a preloaded entity file to recognize entities in a text. Where there is uncertainty, those are displayed in a different color, with drop down options on possible other entities. User get to pick one from the list (no write in ballots). Performs a step towards getting clean data for a second round with another one-trick-pony tool. User contributes, we all benefit.

Which brings me to the common shortfall of annotation solutions: the requirement that the text to be annotated be in plain text.

There are lot of “text” documents but what of those in Word, PDF, Postscript, PPT, Excel, to say nothing of other formats?

The past will not disappear for want of a robust annotation solution.

Nor should it.

June 13, 2012

Social Annotations in Web Search

Social Annotations in Web Search by Aditi Muralidharan,
Zoltan Gyongyi, and Ed H. Chi. (CHI 2012, May 5–10, 2012, Austin, Texas, USA)

Abstract:

We ask how to best present social annotations on search results, and attempt to find an answer through mixed-method eye-tracking and interview experiments. Current practice is anchored on the assumption that faces and names draw attention; the same presentation format is used independently of the social connection strength and the search query topic. The key findings of our experiments indicate room for improvement. First, only certain social contacts are useful sources of information, depending on the search topic. Second, faces lose their well-documented power to draw attention when rendered small as part of a social search result annotation. Third, and perhaps most surprisingly, social annotations go largely unnoticed by users in general due to selective, structured visual parsing behaviors specific to search result pages. We conclude by recommending improvements to the design and content of social annotations to make them more noticeable and useful.

The entire paper is worth your attention but the first paragraph of the conclusion gives much food for thought:

For content, three things are clear: not all friends are equal, not all topics benefit from the inclusion of social annotation, and users prefer different types of information from different people. For presentation, it seems that learned result-reading habits may cause blindness to social annotations. The obvious implication is that we need to adapt the content and presentation of social annotations to the specialized environment of web search.

The complexity and sublty of semantics on human side keeps bumping into the search/annotate with a hammer on the computer side.

Or as the authors say: “…users prefer different types of information from different people.”

Search engineers/designers who use their preferences/intuitions as the designs to push out to the larger user universe are always going to fall short.

Because all users have their own preferences and intuitions about searching and parsing search results. What is so surprising about that?

I have had discussions with programmers who would say: “But it will be better for users to do X (as opposed to Y) in the interface.”

Know what? Users are the only measure of the fitness of an interface or success of a search result.

A “pull” model (user preferences) based search engine will gut all existing (“push” model, engineer/programmer preference) search engines.


PS: You won’t discover the range of user preferences by study groups with 11 participants. Ask one of the national survey companies and have them select several thousand participants. Then refine which preferences get used the most. Won’t happen overnight but every precentage gain will be one the existing search engines won’t regain.

PPS: Speaking of interfaces, I would pay for a web browser that put webpages back under my control (the early WWW model).

Enabling me to defeat those awful “page is loading” ads from major IT vendors who should know better. As well as strip other crap out. It is a data stream that is being parsed. I should be able to clean it up before viewing. That could be a real “hit” and make page load times faster.

I first saw this article in a list of links from Greg Linden.

May 20, 2012

…Commenting on Legislation and Court Decisions

Filed under: Annotation,Law,Legal Informatics — Patrick Durusau @ 6:16 pm

Anderson Releases Prototype System Enabling Citizens to Comment on Legislation and Court Decisions

Legalinformatics brings news that:

Kerry Anderson of the African Legal Information Institute (AfricanLII) has released a prototype of a new software system enabling citizens to comment on legislation, regulations, and court decisions.

There are several initiatives like this one, which is encouraging from the perspective of crowd-sourcing data for annotation.

May 18, 2012

Interannotator Agreement for Chunking Tasks liked Named Entities and Phrases

Filed under: Annotation,LingPipe,Natural Language Processing — Patrick Durusau @ 2:40 pm

Interannotator Agreement for Chunking Tasks liked Named Entities and Phrases

Bob Carpenter writes:

Krishna writes,

I have a question about using the chunking evaluation class for inter annotation agreement : how can you use it when the annotators might have missing chunks I.e., if one of the files contains more chunks than the other.

The answer’s not immediately obvious because the usual application of interannotator agreement statistics is to classification tasks (including things like part-of-speech tagging) that have a fixed number of items being annotated.

An issue that is likely to come up in crowd sourcing analysis/annotation of text as well.

May 9, 2012

GATE Teamware: Collaborative Annotation Factories (HOT!)

GATE Teamware: Collaborative Annotation Factories

From the webpage:

Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.

It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)

GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:

  • Loading document collections (a “corpus” or “corpora”)
  • Creating re-usable project templates
  • Initiating projects based on templates
  • Assigning project roles to specific users
  • Monitoring progress and various project statistics in real time
  • Reporting of project status, annotator activity and statistics
  • Applying GATE-based processing routines (automatic annotations or post-annotation processing)

I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.

Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.

Then I read:

Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from GATECloud.net.

Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”

Let me know if you take the plunge!

Crowdsourced Legal Case Annotation

Filed under: Annotation,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 12:38 pm

Crowdsourced Legal Case Annotation

From the post:

This is an academic research study on legal informatics (information processing of the law). The study uses an online, collaborative tool to crowdsource the annotation of legal cases. The task is similar to legal professionals’ annotation of cases. The result will be a public corpus of searchable, richly annotated legal cases that can be further processed, analysed, or queried for conceptual annotations.

Adam and Wim are computer scientists who are interested in language, law, and the Internet.

We are inviting people to participate in this collaborative task. This is a beta version of the exercise, and we welcome comments on how to improve it. Please read through this blog post, look at the video, and get in contact.

Non-trivial annotation of complex source documents.

What you do with the annotations, such as create topic maps, etc. would be a separate step.

The early evidence for the enhancement of our own work, based on the work of others, Picking the Brains of Strangers…, should make this approach even more exciting.

PS: I saw this at Legal Informatics but wanted to point you directly to the source article.
Just musing for a moment but what if the conclusion on collaboration and access is that by restricting access we impoverish not only others, but ourselves as well?

March 18, 2012

Annotations in Data Streams

Filed under: Annotation,Data Streams — Patrick Durusau @ 8:52 pm

Annotations in Data Streams by Amit Chakrabarti, Graham Cormode, Andrew McGregor, and Justin Thaler.

Abstract:

The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms can be further reduced by enlisting a more powerful “helper” who can annotate the stream as it is read. We do not wish to blindly trust the helper, so we require that the algorithm be convinced of having computed a correct answer. We show upper bounds that achieve a non-trivial tradeoff between the amount of annotation used and the space required to verify it. We also prove lower bounds on such tradeoffs, often nearly matching the upper bounds, via notions related to Merlin-Arthur communication complexity. Our results cover the classic data stream problems of selection, frequency moments, and fundamental graph problems such as triangle-freeness and connectivity. Our work is also part of a growing trend — including recent studies of multi-pass streaming, read/write streams and randomly ordered streams — of asking more complexity-theoretic questions about data stream processing. It is a recognition that, in addition to practical relevance, the data stream model raises many interesting theoretical questions in its own right.

I have a fairly simple question as I start to read this paper: When is digital data not a stream?

When it is read from a memory device, it is a stream.

When it is read into a memory device, it is a stream.

When it is read into a cache on a CPU, it is a stream.

When it is read from the cache by a CPU, it is a stream.

When it is placed back in a cache by a CPU, it is a stream.

What would you call digital data on a storage device? May not be a stream but you can’t look at it without it becoming a stream. Yes?

March 12, 2012

Cross Validation vs. Inter-Annotator Agreement

Filed under: Annotation,LingPipe,Linguistics — Patrick Durusau @ 8:05 pm

Cross Validation vs. Inter-Annotator Agreement by Bob Carpenter.

From the post:

Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Lessons in the use of LingPipe tools!

If you are annotating texts or anticipate annotating texts, read this post.

February 29, 2012

Will the Circle Be Unbroken? Interactive Annotation!

I have to agree with Bob Carpenter, the title is a bit much:

Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

From the post:

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

Both the paper and Bob’s review merit close reading.

February 21, 2012

Making sense of Wikipedia categories

Filed under: Annotation,Classification,Wikipedia — Patrick Durusau @ 8:00 pm

Making sense of Wikipedia categories

Hal Daume III writes:

Wikipedia’s category hierarchy forms a graph. It’s definitely cyclic (Category:Ethology belongs to Category:Behavior, which in turn belongs to Category:Ethology).

At any rate, did you know that “Chicago Stags coaches” are a subcategory of “Natural sciences”? If you don’t believe me, go to the Wikipedia entry for the Natural sciences category, and expand the following list of subcategories:

(subcategories omitted)

I guess it kind of makes sense. There are some other fun ones, like “Rhaeto-Romance languages”, “American World War I flying aces” and “1911 films”. Of course, these are all quite deep in the “hierarchy” (all of those are at depth 15 or higher).

Hal examines several strategies and concludes asking:

Has anyone else tried and succeed at using the Wikipedia category structure?

Some other questions:

Is Hal right that hand annotation doesn’t “scale?”

I have heard that more times than I can count but never seen any studies cited to support it.

After all, Wikipedia was manually edited and produced. Yes? No automated process created its content. So, what is the barrier to hand annotation?

If you think about it, the same could be said about email but most email (yes?) is written by hand. Not produced by automated processes (well, except for spam), so why can’t it be hand annotated? Or at least why can’t we capture semantics of email at the point of composition and annotate it there by automated means?

Hand annotation may not scale for sensor data or financial data streams but is hand annotation needed for such sources?

Hand annotation may not scale for say twitter posts by non-English speakers. But only for agencies with very short-sighted if not actively bigoted hiring/contracting practices.

Has anyone loaded the Wikipedia categories into a graph database? What sort of interface would you suggest for trial arrangement of the categories?

PS: If you are interested in discussing how-to establish assisted annotation for twitter, email or other data streams, with or without user awareness, send me a post.

February 6, 2012

Wikimeta Project’s Evolution…

Filed under: Annotation,Data Mining,Semantic Annotation,Semantic Web — Patrick Durusau @ 6:58 pm

Wikimeta Project’s Evolution Includes Commercial Ambitions and Focus On Text-Mining, Semantic Annotation Robustness by Jennifer Zaino.

From the post:

Wikimeta, the semantic tagging and annotation architecture for incorporating semantic knowledge within documents, websites, content management systems, blogs and applications, this month is incorporating itself as a company called Wikimeta Technologies. Wikimeta, which has a heritage linked with the NLGbAse project, last year was provided as its own web service.

The Semantic Web Blog interviews Dr. Eric Charton about Wikimeta and its future plans.

More interesting that the average interview piece. I have a weakness for academic projects and Wikimeta certainly has the credentials in that regard.

On the other hand, when I read statements like:

So when we said Wikimeta makes over 94 percent of good semantic annotation in the three first ranked suggested annotations, this is tested, evaluated, published, peer-reviewed and reproducible by third parties.

I have to wonder what standard for “…good semantic annotation…” was in play and for what application would 94 percent be acceptable?

Annotation of nuclear power plant documentation? Drug interaction documentation? Jet engine repair manual? Chemical reaction warning on product? None of those sound like 94% right situations.

That isn’t a criticism of this project but of the notion that “correctness” of semantic annotation can be measured separate and apart from some particular use case.

It could be the case that 94% correct is unnecessary if we are talking about the content of Access Hollywood.

And your particular use case may lie somewhere in between those two extremes.

Do read the interview as this sound like it will be an interesting project, whatever your thoughts on “correctness.”

December 3, 2011

AO: Annotation Ontology

Filed under: Annotation,Ontology — Patrick Durusau @ 8:22 pm

AO: Annotation Ontology

From the description:

The Annotation Ontology is a vocabulary for performing several types of annotation – comment, entities annotation (or semantic tags), textual annotation (classic tags), notes, examples, erratum… – on any kind of electronic document (text, images, audio, tables…) and document parts. AO is not providing any domain ontology but it is fostering the reuse of the existing ones for not breaking the principle of scalability of the Semantic Web.

Anita de Waard mentioned this in her How to Execute the Research Paper.

Interesting work but you have to realize that all ontologies evolve (except for those that aren’t used) and that not everyone uses the same one.

Still, it is the sort of thing you will encounter in topic maps work so you need to be aware of it.

How to Execute the Research Paper

Filed under: Annotation,Biomedical,Dynamic Updating,Linked Data,RDF — Patrick Durusau @ 8:21 pm

How to Execute the Research Paper by Anita de Waard.

I had to create the category, “dynamic updating,” to at least partially capture what Anita describes in this presentation. I would have loved to be present to see it in person!

The gist of the presentation is that we need to create mechanisms to support research papers being dynamically linked to the literature and other resources. One example that Anita uses is linking a patient’s medical records to reports in literature with professional tools for the diagnostician.

It isn’t clear how Linked Data (no matter how generously described by Jeni Tennison) could be the second technology for making research papers linked to other data. In part because as Jeni points out, URIs are simply more names for some subject. We don’t know if that name is for the resource or something the resource represents. Makes reliable linking rather difficult.

BTW, the web lost its ability to grow in a “gradual and sustainable way” when RDF/Linked Data introduced the notion that URIs cannot be allowed to fail. If you try to reason based on something that fails, the reasoner falls on its side. Not nearly as robust as allowing semantic 404’s.

Anita’s third step, an integrated workflow is certainly the goal to which we should be striving. I am less convinced about the mechanisms, such as generating linked data stores in addition to the documents we already have, are the way forward. For documents, for instance, why do we need to repeat data they already possess? Why can’t documents represent their contents themselves? Oh, because that isn’t how Linked Data/RDF stores work.

Still, I would highly recommend this slide deck and that you catch any presentation by Anita that you can.

November 16, 2011

“VCF annotation” with the NHLBI GO Exome Sequencing Project (JAX-WS)

Filed under: Annotation,Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 8:17 pm

“VCF annotation” with the NHLBI GO Exome Sequencing Project (JAX-WS) by Pierre Lindenbaum.

From the post:

The NHLBI Exome Sequencing Project (ESP) has released a web service to query their data. “The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.“.

In the current post, I’ll show how I’ve used this web service to annotate a VCF file with this information.

The web service provided by the ESP is based on the SOAP protocol.

Important news/post for several reasons:

First and foremost, “for the potential to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.”

Second, thanks to Pierre, we have a fully worked example of how to perform the annotation.

Last but not least, the NHLBI Exome Sequencing Project (ESP) did not try to go it alone for the annotations. It did what it does well and then offered the data up for other to use/extend it, hopefully to be used/extended by others.

I can’t count the number of projects of varying sorts that I have seen that tried to do every feature, every annotation, every imaging, every transcription, on their own. All of which resulted in being less than they could have been with greater openness.

I am not suggesting that vendors need to give away data. Vendors for the most part support all of us. It is disingenuous to pretend otherwise. So vendors making money means we get to pay our bills, buy books and computers, etc.

What I am suggesting is that vendors, researches and users need to work towards (yelling at each other doesn’t count) towards commercially viable solutions that enable greater collaboration with regard to research and data.

Otherwise we will have impoverished data sets that are never quite what they could be and vendors will be many many times over the real cost of developing data. Those two conditions don’t benefit anyone. “You, me, them.” (Blues Brothers) 😉

September 25, 2011

Modeling Item Difficulty for Annotations of Multinomial Classifications

Filed under: Annotation,Classification,LingPipe,Linguistics — Patrick Durusau @ 7:49 pm

Modeling Item Difficulty for Annotations of Multinomial Classifications by Bob Carpenter

From the post:

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

For your convenience, links for the “…tutorial for LREC with Massimo Poesio” can be found at: LREC 2010 Tutorial: Modeling Data Annotation.

July 5, 2011

Additive Semantic Apps.

Filed under: Annotation,Authoring Semantics,Conferences,Semantics — Patrick Durusau @ 1:40 pm

10 Ways to make your Semantic App. addictive – Revisited

People after my own heart! Let’s drop all the pretense! We want people to use our apps to the exclusion of other apps. We want people to give up sleep to use our apps! We want people to call in sick, forget to eat, forget to put out the cat…, sorry, got carried away. 😉

Seriously, creating apps that people “buy into” is critical for the success of any app and no less so for semantic apps.

The less colorful summary about the workshop says:

In many application scenarios useful semantic content can hardly be created (fully) automatically, but motivating people to become an active part of this endeavor is still an art more than a science. In this tutorial we will look into fundamental design issues of semantic-content authoring technology – and of the applications deploying such technology – in order to find out which incentives speak to people to become engaged with the Semantic Web, and to determine the ways these incentives can be transferred into technology design. We will present how methods and techniques from areas as diverse as participation management, usability engineering, mechanism design, social computing, and game mechanics can be jointly applied to analyze semantically enabled applications, and subsequently design incentives-compatible variants thereof. The discussion will be framed by three case studies on the topics of enterprise knowledge management, media and entertainment, and IT ecosystems, in which combinations of these methods and techniques has led to increased user participation in creating useful semantic descriptions of various types of digital resources – text documents, images, videos and Web services and APIs. Furthermore, we will revisit the best practices and guidelines that have been at the core of an earlier version of this tutorial at the previous edition of the ISWC in 2010, following the empirical findings and insights gained during the operation of the three case studies just mentioned. These guidelines provide IT developers with a baseline to create technology and end-user applications that are not just functional, but facilitate and encourage user participation that supports the further development of the Semantic Web.

Well, they can say: “…facilitate and encourage user participation…” but I’m in favor of addition. 😉

BTW, notice the Revisited in the title?

You can see the slides from last year, 10 Ways to make your Semantic App. addictive, while you are waiting for this year’s workshop. (I am searching for videos but so far have come up empty. Maybe the organizers can film the presentations this year?)

Date: October 23 or 24, half day
Place: Bonn, Germany, Maritim Bonn

INSEMTIVES

Filed under: Annotation,Authoring Semantics,Games,Semantics — Patrick Durusau @ 1:40 pm

INSEMTIVES: Incentives for Semantics

From the about:

The objective of INSEMTIVES is to bridge the gap bet­ween human and computational intell­igence in the current semantic content authoring R&D land­scape. The project aims at pro­ducing metho­dologies, methods and tools that enable the massive creation and feasible manage­ment of semantic cont­ent in order to facilitate the world-­wide up­take of semantic tech­nologies.

You have to hunt for it (better navigation needed?) but there is a gaming kit for INSEMTIVES at SourceForge.

A mother lode of resources on methods for the creation of semantic content that aren’t boring. 😉

OntoGame

Filed under: Annotation,Games,Semantics — Patrick Durusau @ 1:39 pm

OntoGame: Games for Creation of Semantic Content

From the about page:

OntoGame’s goal is to build games that can be used to create semantic content, a process that often can not be solved automatically but requires the help of humans. Games are a good way to wrap up and hide this complex process of semantic content creation and can attract a great number of people.

If you are tired of over-engineered interfaces for semantic annotation of content with menus, context sensitive help that you can’t turn off, and dizzying choices for ever step, you will find OntoGame a refreshing step in another direction.

Current games include:

  • Tubelink
  • Seafish
  • SpotTheLink
  • OntoPronto
  • OntoTube

Tubelink and Seafish are single player games so I tried both of those.

Tubelink is an interesting idea, a video plays and you select tags for items that appear in the video and place them in a crystal ball. Placing the tags in the crystal ball at the time the item appears in the video results in more points. Allegedly (I never got this far) if you put enough tags in the crystal ball it bursts and you go to the next level. All I ever got was a video of a car stereo with a very distracting sound track. A button to skip the current video would be a useful addition.

Seafish has floating images, some of which are similar to a photo you are shown on the lower right. You have to separate the ones that are similar from those that are not by “catching” them and placing them in separate baskets. The thumbnail images enlarge when you hover over them.

Neither one of them is “Tetris” nor “Grand Theft Auto IV“, but gaming for the sake of gaming has developed over decades. Gaming for a useful purpose should be encouraged. It will catch up soon enough.

« Newer Posts

Powered by WordPress