Archive for the ‘Curation’ Category

Awesome Deep Learning – Value-Add Curation?

Monday, December 28th, 2015

Awesome Deep Learning by Christos Christofidis.

Tweeted by Gregory Piatetsky as:

Awesome Curated #DeepLearning resources on #GitHub: books, courses, lectures, researchers…

What will you find there? (As of 28 December 2015):

  • Courses – 15
  • Datasets – 114
  • Free Online Books – 8
  • Frameworks – 35
  • Miscellaneous – 26
  • Papers – 32
  • Researchers – 96
  • Tutorials – 13
  • Videos and Lectures – 16
  • Websites – 24

By my count, that’s 359 resources.

We know from detailed analysis of PubMed search logs, that 80% of searchers choose a link from the first twenty “hits” returned for a search.

You could assume that out of “23 million user sessions and more than 58 million user queries” PubMed searchers and/or PubMed itself or both transcend the accuracy of searching observed in other contexts. That seems rather unlikely.

The authors note:

Two interesting phenomena are observed: first, the number of clicks for the documents in the later pages degrades exponentially (Figure 8). Second, PubMed users are more likely to click the first and last returned citation of each result page (Figure 9). This suggests that rather than simply following the retrieval order of PubMed, users are influenced by the results page format when selecting returned citations.

Result page format seems like a poor basis for choosing search results, in addition to being in the top twenty (20) results.

Eliminating all the cruft from search results to give you 359 resources is a value-add, but what value-add should added to this list of resources?

What are the top five (5) value-adds on your list?

Serious question because we have tools far beyond what were available to curators in the 1960’s but there is little (if any) curation to match of the Reader’s Guide to Periodical Literature.

There are sample pages from the 2014 Reader’s Guide to Periodical Literature online.

Here is a screen-shot of some of its contents:


If you can, tell me what search you would use to return that sort of result for “abortion” as a subject.

Nothing come to mind?

Just to get you started, would pointing to algorithms across these 359 resources be helpful? Would you want to know more than algorithm N occurs in resource Y? Some of the more popular ones may occur in every resource. How helpful is that?

So I repeat my earlier question:

What are the top five (5) value-adds on your list?

Please forward, repost, reblog, tweet. Thanks!

Requirements For A Twitter Client

Sunday, October 18th, 2015

Kurt Cagle writes of needed improvements to Twitter’s “Moments,” in Project Voyager and Moments: Close, but not quite there yet saying:

This week has seen a pair of announcements that are likely to significantly shake up social media as its currently known. Earlier this week, Twitter debuted its Moments, a news service where the highlights of the week are brought together into a curated news aggregator.

However, this is 2015. What is of interest to me – topics such as Data Science, Semantics, Astronomy, Climate Change and so forth, are likely not going to be of interest to others. Similarly, I really have no time for cute pictures of dogs (cats, maybe), the state of the World Series race, the latest political races or other “general” interest topics. In other words, I want to be able to curate content my way, even if the quality is not necessarily the highest, than I do have other people who I do not know decide to curate to the lowest possible denominator.

A very small change, on the other hand, could make a huge difference for Moments for myself and many others. Allow users to aggregate a set of hash tags under a single “Paper section banner” – #datascience, #data, #science, #visualization, #analytics, #stochastics, etc. – could all go under the Data Science banner. Even better yet, throw in a bit of semantics to find every topic within two hops topically to the central terms and use these (with some kind of weighting factor) as well. Rank these tweets according to fitness, then when I come to Twitter I can “read” my twitter paper just by typing in the appropriate headers (or have them auto-populate a list).

My exclusion list would include cats, shootings, bombings, natural disasters, general news and other ephemera that will be replaced by another screaming headline next week, if not tomorrow.

Starting with Kurt’s suggested improvements, a Twitter client should offer:

  • User-based aggregation based upon # tags
  • Learning semantics (Kurt’s two-hop for example)
  • Deduping tweets for user set period, day, week, month, other
  • User determined sorting of tweets by time/date, author, retweets, favorites
  • Exclusion of tweets without URLs
  • Filtering of tweets based on sender (included by # tags), etc. and perhaps regex

I have looked but not found any Twitter client that comes even close.

Other requirements?

Posts from 140 #DataScience Blogs

Sunday, September 13th, 2015

Kirk Borne posted a link to:, referring to it as:

Recent posts from 150+ #DataScience Blogs worldwide, curated by @dsguidebiz #BigData #Analytics

By count of the sources listed on, the number of sources is 140, as of September 13, 2015.

A wealth of posts and videos!

Everyone who takes advantage of this listing, however, will have to go through the same lists of posts by category.

That repetition, even with searching, seems like a giant time sink to me.


Looking after Datasets

Tuesday, September 1st, 2015

Looking after Datasets by Antony Unwin.

Some examples that Antony uses to illustrate the problems with datasets in R:

You might think that supplying a dataset in an R package would be a simple matter: You include the file, you write a short general description mentioning the background and giving the source, you define the variables. Perhaps you provide some sample analyses and discuss the results briefly. Kevin Wright's agridat package is exemplary in these respects.

As it happens, there are a couple of other issues that turn out to be important. Is the dataset or a version of it already in R and is the name you want to use for the dataset already taken? At this point the experienced R user will correctly guess that some datasets have the same name but are quite different (e.g., movies, melanoma) and that some datasets appear in many different versions under many different names. The best example I know is the Titanic dataset, which is availble in the datasets package. You will also find titanic (COUNT, prLogistic, msme), titanic.dat (exactLoglinTest), titan.Dat (elrm), titgrp (COUNT), etitanic (earth), ptitanic (rpart.plot), Lifeboats (vcd), TitanicMat (RelativeRisk), Titanicp (vcdExtra), TitanicSurvival (effects), Whitestar (alr4), and one package, plotrix, includes a manually entered version of the dataset in one of its help examples. The datasets differ on whether the crew is included or not, on the number of cases, on information provided, on formatting, and on discussion, if any, of analyses. Versions with the same names in different packages are not identical. There may be others I have missed.

The issue came up because I was looking for a dataset of the month for the website of my book "Graphical Data Analysis with R". The plan is to choose a dataset from one of the recently released or revised R packages and publish a brief graphical analysis to illustrate and reinforce the ideas presented in the book while showing some interesting information about the data. The dataset finch in dynRB looked rather nice: five species of finch with nine continuous variables and just under 150 cases. It looked promising and what’s more it is related to Darwin’s work and there was what looked like an original reference from 1904.

As if Antony’s list of issues wasn’t enough, how do you capture your understanding of a problem with a dataset?

That is you have discovered the meaning of a variable that isn’t recorded with the dataset. Where are you going to put that information?

You could modify the original dataset to capture that new information but then people will have to discover your version of the original dataset. Not to mention you need to avoid stepping on something else in the original dataset.

Antony concludes:

…returning to Moore’s definition of data, wouldn’t it be a help to distinguish proper datasets from mere sets of numbers in R?

Most people have an intersecting idea of a “proper dataset” but I would spend less time trying to define that and more on capturing the context of whatever appears to me to be a “proper dataset.”

More data is never a bad thing.

Gathering, Extracting, Analyzing Chemistry Datasets

Wednesday, April 22nd, 2015

Activities at the Royal Society of Chemistry to gather, extract and analyze big datasets in chemistry by Antony Williams.

If you are looking for a quick summary of efforts to combine existing knowledge resources in chemistry, you can do far worse than Antony’s 118 slides on the subject (2015).

I want to call special attention to Slide 107 in his slide deck:


True enough, extraction is problematic, expensive, inaccurate, etc., all the things Antony describes. And I would strongly second all of what he implies is the better practice.

However, extraction isn’t just a necessity for today or for a few years, extraction is going to be necessary so long as we keep records about chemistry or any other subject.

Think about all the legacy materials on chemistry that exist in hard copy format just for the past two centuries. To say nothing of all of still older materials. It is more than unfortunate to abandon all that information simply because “modern” digital formats are easier to manipulate.

That was’t what Antony meant to imply but even after all materials have been extracted and exist in some form of digital format, that doesn’t mean the era of “extraction” will have ended.

You may not remember when atomic chemistry used “punch cards” to record isotopes:


An isotope file on punched cards. George M. Murphy J. Chem. Educ., 1947, 24 (11), p 556 DOI: 10.1021/ed024p556 Publication Date: November 1947.

Today we would represent that record in…NoSQL?

Are you confident that in another sixty-eight (68) years we will still be using NoSQL?

We have to choose from the choices available to us today, but we should not deceive ourselves into thinking our solution will be seen as the “best” solution in the future. New data will be discovered, new processes invented, new requirements will emerge, all of which will be clamoring for a “new” solution.

Extraction will persist as long as we keep recording information in the face of changing formats and requirements. We can improve that process but I don’t think we will ever completely avoid it.

A Beginners Guide to Content Creation

Sunday, December 28th, 2014

A Beginners Guide to Content Creation by Kristina Cisnero.

From the post:

From Songza to reddit, content curation is a huge part of the social web as we know it. We’re all on the same mission to find the absolute best material to enjoy and to share with our followers. This is especially true for businesses, whose customers and broader online audience follow them based on an expectation of quality content in return.

What is content curation?

In simple terms, the process of content curation is the act of sorting through large amounts of content on the web and presenting the best posts in a meaningful and organized way. The process can include sifting, sorting, arranging, and placing found content into specific themes, and then publishing that information.

In other words, content curation is very different from content marketing. Content curation doesn’t include creating new content; it’s the act of discovering, compiling, and sharing existing content with your online followers. Content curation is becoming an important tactic for any marketing department to maintain a successful online presence. Not only that, but content curation allows you to provide extra value to your brand’s audience and customers, which is key to building those lasting relationships with loyal fans.

It had not occurred to me that “content curation” might need definition. Kristina not only defines “content curation” but also illustrates why it is a value-add.

Being written in a web context, curation is defined relative to web content but curation can include (particularly with a topic map), any content of any form at any location. Some content may be more accessible than other content but web accessibility isn’t a requirement for curation. (Unless that is one of your requirements.)

Curated content can save your staff time and provide accurate results. Not to mention enabling informal knowledge to persist despite personnel changes. (Corporate memory)

Big Data – A curated list of big data frameworks, resources and tools

Sunday, September 28th, 2014

Big Data – A curated list of big data frameworks, resources and tools by Andrea Mostosi.

From the post:

“Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.

Four hundred and eighty-four (484) resources by my count.

An impressive collection but HyperGraphDB is missing from this list.

Others that you can name off hand?

I don’t think the solution to the many partial “Big Data” lists of software, techniques and other resources is to create yet another list of the same. That would be a duplicated (and doomed) effort.



Tools and Resources Development Fund [bioscience ODF UK]

Friday, July 25th, 2014

Tools and Resources Development Fund

Application deadline: 17 September 2014, 4pm

From the summary:

Our Tools and Resources Development Fund (TRDF) aims to pump prime the next generation of tools, technologies and resources that will be required by bioscience researchers in scientific areas within our remit. It is anticipated that successful grants will not exceed £150k (£187k FEC) (ref 1) and a fast-track, light touch peer review process will operate to enable researchers to respond rapidly to emerging challenges and opportunities.

Projects are expected to have a maximum value of £150k (ref 1). The duration of projects should be between 6 and 18 months, although community networks to develop standards could be supported for up to 3 years.

A number of different types of proposal are eligible for consideration.

  • New approaches to the analysis, modelling and interpretation of research data in the biological sciences, including development of software tools and algorithms. Of particular interest will be proposals that address challenges arising from emerging new types of data and proposals that address known problems associated with data handling (e.g. next generation sequencing, high-throughput phenotyping, the extraction of data from challenging biological images, metagenomics).
  • New frameworks for the curation, sharing, and re-use/re-purposing of research data in the biological sciences, including embedding data citation mechanisms (e.g. persistent identifiers for datasets within research workflows) and novel data management planning (DMP) implementations (e.g. integration of DMP tools within research workflows)
  • Community approaches to the sharing of research data including the development of standards (this could include coordinating UK input into international standards development activities).
  • Approaches designed to exploit the latest computational technology to further biological research; for example, to facilitate the use of cloud computing approaches or high performance computing architectures.

Projects may extend existing software resources; however, the call is designed to support novel tools and methods. Incremental improvement and maintenance of existing software that does not provide new functionality or significant performance improvements (e.g. by migration to an advanced computing environment) does not fall within the scope of the call.

Very timely since the UK announcement that OpenDocument Format (ODF) is among the open standards:

The standards set out the document file formats that are expected to be used across all government bodies. Government will begin using open formats that will ensure that citizens and people working in government can use the applications that best meet their needs when they are viewing or working on documents together. (Open document formats selected to meet user needs)

ODF as a format supports RDFa as metadata but lacks an implementation that makes full use of that capability.

Imagine biocuration that:

  • Starts with authors writing a text and is delivered to
  • Publishers, who can proof or augment the author’s biocuration
  • Results are curated on on publication (not months or years later)
  • Results are immediately available for collation with other results.

The only way to match the explosive growth of bioscience publications with equally explosive growth of bioscience curation, is to use tools the user already knows. Like word processing software.

Please pass this along and let me know of other grants or funding opportunities where adaptation of office standards or software could change the fundamentals of workflow.

BioCreative Resources (and proceedings)

Wednesday, November 13th, 2013

BioCreative Resources (and proceedings)

From the Overview page:

The growing interest in information retrieval (IR), information extraction (IE) and text mining applied to the biological literature is related to the increasing accumulation of scientific literature (PubMed has currently (2005) over 15,000,000 entries) as well as the accelerated discovery of biological information obtained through characterization of biological entities (such as genes and proteins) using high-through put and large scale experimental techniques [1].

Computational techniques which process the biomedical literature are useful to enhance the efficient access to relevant textual information for biologists, bioinformaticians as well as for database curators. Many systems have been implemented which address the identification of gene/protein mentions in text or the extraction of text-based protein-protein interactions and of functional annotations using information extraction and text mining approaches [2].

To be able to evaluate performance of existing tools, as well as to allow comparison between different strategies, common evaluation standards as well as data sets are crucial. In the past, most of the implementations have focused on different problems, often using private data sets. As a result, it has been difficult to determine how good the existing systems were or to reproduce the results. It is thus cumbersome to determine whether the systems would scale to real applications, and what performance could be expected using a different evaluation data set [3-4].

The importance of assessing and comparing different computational methods have been realized previously by both, the bioinformatics and the NLP communities. Researchers in natural language processing (NLP) and information extraction (IE) have, for many years now, used common evaluations to accelerate their research progress, e.g., via the Message Understanding Conferences (MUCs) [5] and the Text Retrieval Conferences (TREC) [6]. This not only resulted in the formulation of common goals but also made it possible to compare different systems and gave a certain transparency to the field. With the introduction of a common evaluation and standardized evaluation metrics, it has become possible to compare approaches, to assess what techniques did and did not work, and to make progress. This progress has resulted in the creation of standard tools available to the general research community.

The field of bioinformatics also has a tradition of competitions, for example, in protein structure prediction (CASP [7]) or gene predictions in entire genomes (at the “Genome Based Gene Structure Determination” symposium held on the Wellcome Trust Genome Campus).

There has been a lot of activity in the field of text mining in biology, including sessions at the Pacific Symposium of Biocomputing (PSB [8]), the Intelligent Systems for Molecular Biology (ISMB) and European Conference on Computational Biology (ECCB) conferences [9] as well workshops and sessions on language and biology in computational linguistics (the Association of Computational Linguistics BioNLP SIGs).

A small number of complementary evaluations of text mining systems in biology have been recently carried out, starting with the KDD cup [10] and the genomics track at the TREC conference [11]. Therefore we decided to set up the first BioCreAtIvE challenge which was concerned with the identification of gene mentions in text [12], to link texts to actual gene entries, as provided by existing biological databases, [13] as well as extraction of human gene product (Gene Ontology) annotations from full text articles [14]. The success of this first challenge evaluation as well as the lessons learned from it motivated us to carry out the second BioCreAtIvE, which should allow us to monitor improvements and build on the experience and data derived from the first BioCreAtIvE challenge. As in the previous BioCreAtIvE, the main focus is on biologically relevant tasks, which should result in benefits for the biomedical text mining community, the biology and biological database community, as well as the bioinformatics community.

A gold mine of resources if you are interested in bioinformatics, curation or IR in general.

Including the BioCreative Proceedings for 2013:

BioCreative IV Proceedings vol. 1

BioCreative IV Proceedings vol. 2

CoIN: a network analysis for document triage

Wednesday, November 13th, 2013

CoIN: a network analysis for document triage by Yi-Yu Hsu and Hung-Yu Kao. (Database (2013) 2013 : bat076 doi: 10.1093/database/bat076)


In recent years, there was a rapid increase in the number of medical articles. The number of articles in PubMed has increased exponentially. Thus, the workload for biocurators has also increased exponentially. Under these circumstances, a system that can automatically determine in advance which article has a higher priority for curation can effectively reduce the workload of biocurators. Determining how to effectively find the articles required by biocurators has become an important task. In the triage task of BioCreative 2012, we proposed the Co-occurrence Interaction Nexus (CoIN) for learning and exploring relations in articles. We constructed a co-occurrence analysis system, which is applicable to PubMed articles and suitable for gene, chemical and disease queries. CoIN uses co-occurrence features and their network centralities to assess the influence of curatable articles from the Comparative Toxicogenomics Database. The experimental results show that our network-based approach combined with co-occurrence features can effectively classify curatable and non-curatable articles. CoIN also allows biocurators to survey the ranking lists for specific queries without reviewing meaningless information. At BioCreative 2012, CoIN achieved a 0.778 mean average precision in the triage task, thus finishing in second place out of all participants.

Database URL:

From the introduction:

Network analysis concerns the relationships between processing entities. For example, the nodes in a social network are people, and the links are the friendships between the nodes. If we apply these concepts to the ACT, PubMed articles are the nodes, while the co-occurrences of gene–disease, gene–chemical and chemical–disease relationships are the links. Network analysis provides a visual map and a graph-based technique for determining co-occurrence relationships. These graphical properties, such as size, degree, centralities and similar features, are important. By examining the graphical properties, we can gain a global understanding of the likely behavior of the network. For this purpose, this work focuses on two themes concerning the applications of biocuration: using the co-occurrence–based approach to obtain a normalized co-occurrence score and using the network-based approach to measure network properties, e.g. betweenness and PageRank. CoIN integrates co-occurrence features and network centralities when curating articles. The proposed method combines the co-occurrence frequency with the network construction from text. The co-occurrence networks are further analyzed to obtain the linking and shortest path features of the network centralities.

The authors’ ultimately conclude that the network-based approaches perform better than collocation-based approaches.

If this post sounds hauntingly familiar, you may be thinking about Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information, which was the first place finisher at BioCreative 2012 with a mean average precision (MAP) score of 0.8030.

9th International Digital Curation Conference

Tuesday, October 15th, 2013

Commodity, catalyst or change-agent? Data-driven transformations in research, education, business & society.

From the post:

24 – 27 February 2014
Omni San Francisco Hotel, San Francisco



The 9th International Digital Curation Conference (IDCC) will be held from Monday 24 February to Thursday 27 February 2014 at the Omni San Francisco Hotel (at Montgomery).

The Omni hotel is in the heart of downtown San Francisco. It is located right on the cable car line and is only a short walk to Union Square, the San Francisco neighborhood that has become a mecca for high-end shopping and art galleries.

This year the IDCC will focus on how data-driven developments are changing the world around us, recognising that the growing volume and complexity of data provides institutions, researchers, businesses and communities with a range of exciting opportunities and challenges. The Conference will explore the expanding portfolio of tools and data services, as well as the diverse skills that are essential to explore, manage, use and benefit from valuable data assets. The programme will reflect cultural, technical and economic perspectives and will illustrate the progress made in this arena in recent months

There will be a programme of workshops on Monday 24 and Thursday 27 February. The main conference programme will run from Tuesday 25 – Wednesday 26 February.

Registration will open in October (but it doesn’t say when in October).

While you are waiting:

Our last IDCC took place in Amsterdam, 14-17 January 2013. If you were not able to attend you can now access all the presentations, videos and photos online, and much more!


Preliminary evaluation of the CellFinder literature…

Friday, April 19th, 2013

Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts by Mariana Neves, Alexander Damaschun, Nancy Mah, Fritz Lekschas, Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas Kurtz, and Ulf Leser. (Database (2013) 2013 : bat020 doi: 10.1093/database/bat020)


Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ∼50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction.

Database URL:

Another extremely useful data curation project.

Do you get the impression that curation projects will continue to be outrun by data production?

And that will be the case, even with machine assistance?

Is there an alternative to falling further and further behind?

Such as abandoning some content (CNN?) to simply forever go uncurated? Or the same to be true for government documents/reports?

I am sure we all have different suggestions for what data to dump alongside the road to make room for the “important” stuff.

Suggestions on solutions other than simply dumping data?

… Preservation and Stewardship of Scholarly Works, 2012 Supplement

Tuesday, March 19th, 2013

Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement by Charles W. Bailey, Jr.

From the webpage:

In a rapidly changing technological environment, the difficult task of ensuring long-term access to digital information is increasingly important. The Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement presents over 130 English-language articles, books, and technical reports published in 2012 that are useful in understanding digital curation and preservation. This selective bibliography covers digital curation and preservation copyright issues, digital formats (e.g., media, e-journals, research data), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns.

It is a supplement to the Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, which covers over 650 works published from 2000 through 2011. All included works are in English. The bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings.

The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e-prints and published articles may not be identical.

The bibliography is available under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Supplement to “the” starting point for research on digital curation.

Curating Inorganics? No. (ChEMBL)

Monday, March 18th, 2013

The results are in – inorganics are out!

From the ChEMBL-og blog which “covers the activities of the Computational Chemical Biology Group at the EMBL-EBI in Hinxton.

From the post:

A few weeks ago we ran a small poll on how we should deal with inorganic molecules – not just simple sodium salts, but things like organoplatinums, and other compounds with dative bonds, unusual electronic states, etc. The results from you were clear, there was little interest in having a lot of our curation time spent on these. We will continue to collect structures from the source journals, and they will be in the full database, but we won’t try and curate the structures, or display them in the interface. They will be appropriately flagged, and nothing will get lost. So there it is, democracy in action.

So for ChEMBL 16 expect fewer issues when you try and load our structures in your own pipelines and systems.

Just an FYI that inorganic compounds are not being curated at ChEMBL.

If you decide to undertake such work, contacting ChEMBL to coordinate collection, etc., would be a good first step.

Crowdsourced Chemistry… [Documents vs. Data]

Monday, March 18th, 2013

Crowdsourced Chemistry Why Online Chemistry Data Needs Your Help by Antony Williams. (video)

From the description:

This is the Ignite talk that I gave at ScienceOnline2010 #sci010 in the Research Triangle Park in North Carolina on January 16th 2010. This was supposed to be a 5 minute talk highlighting the quality of chemistry data on the internet. Ok, it was a little tongue in cheek because it was an after dinner talk and late at night but the data are real, the problem is real and the need for data curation of chemistry data online is real. On ChemSpider we have provided a platform to deposit and curate data. Other videos will show that in the future.

Great demonstration of the need for curation in chemistry.

And of the impact that re-usable information can have on the quality of information.

The errors in chemical descriptions you see in this video could be corrected in:

  • In an article.
  • In a monograph.
  • In a webpage.
  • In an online resource that can be incorporated by reference.

Which one do you think would propagate the corrected information more quickly?

Documents are a great way to convey information to a reader.

They are an incredibly poor way to store/transmit information.

Every reader has to extract the information in a document for themselves.

Not to mention that data is fixed, unless it has incorporated information by reference.

Funny isn’t it? We are still storing data as we did when clay tablets were the medium of choice.

Isn’t it time we separated presentation (documents) from storage/transmission (data)?

Studying PubMed usages in the field…

Monday, March 11th, 2013

Studying PubMed usages in the field for complex problem solving: Implications for tool design by Barbara Mirel, Jennifer Steiner Tonks, Jean Song, Fan Meng, Weijian Xuan, Rafiqa Ameziane. (Mirel, B., Tonks, J. S., Song, J., Meng, F., Xuan, W. and Ameziane, R. (2013), Studying PubMed usages in the field for complex problem solving: Implications for tool design. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22796)


Many recent studies on MEDLINE-based information seeking have shed light on scientists’ behaviors and associated tool innovations that may improve efficiency and effectiveness. Few, if any, studies, however, examine scientists’ problem-solving uses of PubMed in actual contexts of work and corresponding needs for better tool support. Addressing this gap, we conducted a field study of novice scientists (14 upper-level undergraduate majors in molecular biology) as they engaged in a problem-solving activity with PubMed in a laboratory setting. Findings reveal many common stages and patterns of information seeking across users as well as variations, especially variations in cognitive search styles. Based on these findings, we suggest tool improvements that both confirm and qualify many results found in other recent studies. Our findings highlight the need to use results from context-rich studies to inform decisions in tool design about when to offer improved features to users.

From the introduction:

For example, our findings confirm that additional conceptual information integrated into retrieved results could expedite getting to relevance. Yet—as a qualification—evidence from our field cases suggests that presentations of this information need to be strategically apportioned and staged or they may inadvertently become counterproductive due to cognitive overload.

Curated data raises its ugly head, again.

Topic maps curate data and search results.

Search engines don’t curate data or search results.

How important is it for your doctor to find the right answers? In a timely manner?

Research Data Symposium – Columbia

Saturday, March 9th, 2013

Research Data Symposium – Columbia.

Posters from the Research Data Symposium, held at Columbia University, February 27, 2013.

Subject to the limitations of the poster genre but useful as a quick overview of current projects and directions.

Shedding Light on the Dark Data in the Long Tail of Science

Friday, March 1st, 2013

Shedding Light on the Dark Data in the Long Tail of Science by P. Bryan Heidorn. (P. Bryan Heidorn. “Shedding Light on the Dark Data in the Long Tail of Science.” Library Trends 57.2 (2008): 280-299. Project MUSE. Web. 28 Feb. 2013. .)


One of the primary outputs of the scientific enterprise is data, but many institutions such as libraries that are charged with preserving and disseminating scholarly output have largely ignored this form of documentation of scholarly activity. This paper focuses on a particularly troublesome class of data, termed dark data. “Dark data” is not carefully indexed and stored so it becomes nearly invisible to scientists and other potential users and therefore is more likely to remain underutilized and eventually lost. The article discusses how the concepts from long-tail economics can be used to understand potential solutions for better curation of this data. The paper describes why this data is critical to scientific progress, some of the properties of this data, as well as some social and technical barriers to proper management of this class of data. Many potentially useful institutional, social, and technical solutions are under development and are introduced in the last sections of the paper, but these solutions are largely unproven and require additional research and development.

From the article:

In this paper we will use the term dark data to refer to any data that is not easily found by potential users. Dark data may be positive or negative research findings or from either “large” or “small” science. Like dark matter, this dark data on the basis of volume may be more important than that which can be easily seen. The challenge for science policy is to develop institutions and practices such as institutional repositories, which make this data useful for society.

Dark Data = Any data that is not easily found by potential users.

A number of causes are discussed, not the least of which is our old friend, the Tower of Babel.

A final barrier that cannot be overlooked is the Digital Tower of Babel that we have created with seemingly countless proprietary as well as open data formats. This can include versions of the same software products that are incompatible. Some of these formats are very efficient for the individual applications for which they were designed including word processing, databases, spreadsheets, and others, but they are ineffective to support interoperability and preservation.

As you know already, I don’t think the answer to data curation, long term, lies in uniform formats.

Uniform formats are very useful but are domain, project and time bound.

The questions always are:

“What do we do when we change data formats?”

“Do we dump data in old formats that we spent $$$ developing?”

“Do we migrate data in old formats, assuming anyone remembers the old format?”

“Do we document and map across old and new formats, preparing for the next ‘new’ format?”

None of the answers are automatic or free.

But it is better to make in informed choice than a default one of letting potentially valuable data rot.

Looking out for the little guy: Small data curation

Friday, March 1st, 2013

Looking out for the little guy: Small data curation by Katherine Goold Akers. (Akers, K. G. (2013), Looking out for the little guy: Small data curation. Bul. Am. Soc. Info. Sci. Tech., 39: 58–59. doi: 10.1002/bult.2013.1720390317)


While big data and its management are in the spotlight, a vast number of important research projects generate relatively small amounts of data that are nonetheless valuable yet rarely preserved. Such studies are often focused precursors to follow-up work and generate less noisy data than grand scale projects. Yet smaller quantity does not equate to simpler management. Data from smaller studies may be captured in a variety of file formats with no standard approach to documentation, metadata or preparation for archiving or reuse, making its curation even more challenging than for big data. As the information managers most likely to encounter small datasets, academic librarians should cooperate to develop workable strategies to document, organize, preserve and disseminate local small datasets so that valuable scholarly information can be discovered and shared.

A reminder that for every “big data” project in need of curation, there are many more smaller, less well known projects that need the same services.

Since topic maps don’t require global or even regional agreement on ontology or methodological issues, it should be easier for academic librarians to create topic maps to curate small datasets.

When it is necessary or desired to merge small datasets that were curated with different topic map assumptions, new topics can be created that merge the data that existed in separate topic maps.

But only when necessary and at the point of merging.

To say it another way, topic maps need not anticipate or fear the future. Tomorrow will take care of itself.

Unlike “now I am awake” approaches, that must fear the next moment of consciousness will bring change.

Interactive Text Mining

Sunday, January 20th, 2013

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task Cecilia N. Arighi, et. al. (Database (2013) 2013 : bas056 doi: 10.1093/database/bas056)


In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.

Curation is an aspect of topic map authoring, albeit with the latter capturing information for later merging with other sources of information.

Definitely an article you will want to read if you are designing text mining as part of a topic map solution.

Research Data Curation Bibliography

Wednesday, January 16th, 2013

Research Data Curation Bibliography (version 2) by Charles W. Bailey.

From the introduction:

The Research Data Curation Bibliography includes selected English-language articles, books, and technical reports that are useful in understanding the curation of digital research data in academic and other research institutions. For broader coverage of the digital curation literature, see the author's Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works,which presents over 650 English-language articles, books, and technical reports.

The "digital curation" concept is still evolving. In "Digital Curation and Trusted Repositories: Steps toward Success," Christopher A. Lee and Helen R. Tibbo define digital curation as follows:

Digital curation involves selection and appraisal by creators and archivists; evolving provision of intellectual access; redundant storage; data transformations; and, for some materials, a commitment to long-term preservation. Digital curation is stewardship that provides for the reproducibility and re-use of authentic digital data and other digital assets. Development of trustworthy and durable digital repositories; principles of sound metadata creation and capture; use of open standards for file formats and data encoding; and the promotion of information management literacy are all essential to the longevity of digital resources and the success of curation efforts.

This bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, interviews, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings. Coverage of technical reports is very selective.

Most sources have been published from 2000 through 2012; however, a limited number of earlier key sources are also included. The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Such links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e prints and published articles may not be identical.

An archive of prior versions of the bibliography is available.

If you are a beginning library student, take the time to know the work of Charles Bailey. He has consistently made a positive contribution for researchers from very early in the so-called digital revolution.

To the extent that you want to design topic maps for data curation, long or short term, the 200+ items in this bibliography will introduce you to some of the issues you will be facing.

The Xenbase literature curation process

Saturday, January 12th, 2013

The Xenbase literature curation process by Jeff B. Bowes, Kevin A. Snyder, Christina James-Zorn, Virgilio G. Ponferrada, Chris J. Jarabek, Kevin A. Burns, Bishnu Bhattacharyya, Aaron M. Zorn and Peter D. Vize.


Xenbase ( is the model organism database for Xenopus tropicalis and Xenopus laevis, two frog species used as model systems for developmental and cell biology. Xenbase curation processes centre on associating papers with genes and extracting gene expression patterns. Papers from PubMed with the keyword ‘Xenopus’ are imported into Xenbase and split into two curation tracks. In the first track, papers are automatically associated with genes and anatomy terms, images and captions are semi-automatically imported and gene expression patterns found in those images are manually annotated using controlled vocabularies. In the second track, full text of the same papers are downloaded and indexed by a number of controlled vocabularies and made available to users via the Textpresso search engine and text mining tool.

Which curation workflow will work best for your topic map activities will depend upon a number of factors.

What would you adopt, adapt or alter from the curation workflow in this article?

How would you evaluate the effectiveness of any of your changes?

Give me human editors and the New York Times

Friday, November 30th, 2012

Techmeme founder: Give me human editors and the New York Times by Jeff John Roberts.

From the post:

At the event in New York, which was hosted by media company Outbrain, Rivera explained to Business Insider’s Steve Kovach why algorithms will never be able to curate as effectively as humans.

“A lot of people who think they can go all the way with the automated approach fail to realize a news story has become obsolete,” said Rivera, explaining that an article can be quickly superseded even if it receives a million links or tweets.

This is why Rivera now relies on human editors to shepherd the headlines that bubble up and swat down the inappropriate ones. He argues any serious tech or political news provider will always have to do the same.

Rivera is also not enthused about social-based news platforms — sites like LinkedIn Today or Flipboard that assemble news stories based on what your friends are sharing on social media. Asked if Techmeme will offer a social-based news feed, Rivera said don’t count on it.

“People like to go to the New York Times and look at what’s on the front page because they have a lot of trust in what editors decide and they know other people read it. We want to do the same thing,” he said. “There’s value in being divorced from your friends … I’d rather see what’s on the front of the New York Times.”

Are you trapped in a social media echo chamber?

Escape with the New York Times.

I first saw this in a tweet by Peter Cooper.

Collaborative biocuration… [Pre-Topic Map Tasks]

Monday, November 26th, 2012

Collaborative biocuration—text-mining development task for document prioritization for curation by Thomas C. Wiegers, Allan Peter Davis and Carolyn J. Mattingly. (Database (2012) 2012 : bas037 doi: 10.1093/database/bas037)


The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems for the biological domain. The ‘BioCreative Workshop 2012’ subcommittee identified three areas, or tracks, that comprised independent, but complementary aspects of data curation in which they sought community input: literature triage (Track I); curation workflow (Track II) and text mining/natural language processing (NLP) systems (Track III). Track I participants were invited to develop tools or systems that would effectively triage and prioritize articles for curation and present results in a prototype web interface. Training and test datasets were derived from the Comparative Toxicogenomics Database (CTD; and consisted of manuscripts from which chemical–gene–disease data were manually curated. A total of seven groups participated in Track I. For the triage component, the effectiveness of participant systems was measured by aggregate gene, disease and chemical ‘named-entity recognition’ (NER) across articles; the effectiveness of ‘information retrieval’ (IR) was also measured based on ‘mean average precision’ (MAP). Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%. Each participating group also developed a prototype web interface; these interfaces were evaluated based on functionality and ease-of-use by CTD’s biocuration project manager. In this article, we present a detailed description of the challenge and a summary of the results.

The results:

“Top recall scores for gene, disease and chemical NER were 49, 65 and 82%, respectively; the top MAP score was 80%.”

indicate there is plenty of room for improvement. Perhaps even commercially viable improvement.

In hindsight, not talking about how to make a topic map along with ISO 13250, may have been a mistake. Even admitting there are multiple ways to get there, a technical report outlining one or two ways would have made the process more transparent.

Answering the question: “What can you say with a topic map?” with “Anything you want.” was, a truthful answer but not a helpful one.

I should try to crib something from one of those “how to write a research paper” guides. I haven’t looked at one in years but the process is remarkably similar to what would result in a topic map.

Some of the mechanics are different but the underlying intellectual process is quite similar. Everyone who has been to college (at least of my age), had a course that talked about writing research papers. So it should be familiar terminology.


eGIFT: Mining Gene Information from the Literature

Thursday, November 22nd, 2012

eGIFT: Mining Gene Information from the Literature by Catalina O Tudor, Carl J Schmidt and K Vijay-Shanker.



With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.


In this paper, we present eGIFT ( webcite), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene’s literature to its frequency of occurrence in documents about genes in general. To retrieve a gene’s documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT’s information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT’s iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.


Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.


Another lesson for topic map authoring interfaces: Offer domain specific search capabilities.

Using a ****** search appliance is little better than a poke with a sharp stick in most domains. The user is left to their own devices to sort out ambiguities, discover synonyms, again and again.

Your search interface may report > 900,000 “hits,” but anything beyond the first 20 or so are wasted.

(If you get sick, get something that comes up in the first 20 “hits” in PubMed. Where most researchers stop.)

Developing a biocuration workflow for AgBase… [Authoring Interfaces]

Thursday, November 22nd, 2012

Developing a biocuration workflow for AgBase, a non-model organism database by Lakshmi Pillai, Philippe Chouvarine, Catalina O. Tudor, Carl J. Schmidt, K. Vijay-Shanker and Fiona M. McCarthy.


AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as ‘in progress’ or ‘completed’; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in AgBase.

Database URL:

Another approach to biocuration. I will be posting on eGift separately but do note this is a domain specific tool.

The authors did not set out to create the universal curation tool but one suited to their specific data and requirements.

I think there is an important lesson here for semantic authoring interfaces. Word processors offer very generic interfaces but consequently little in the way of structure. Authoring annotated information requires more structure and that requires domain specifics.

Now there is an idea, create topic map authoring interfaces on top of a common skeleton, instead of hard coding interfaces as users “should” use the tool.

Prioritizing PubMed articles…

Wednesday, November 21st, 2012

Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information by Sun Kim, Won Kim, Chih-Hsuan Wei, Zhiyong Lu and W. John Wilbur.


The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein–protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.

An interesting summary of entity recognition issues in bioinformatics occurs in this article:

The second problem is that chemical and disease mentions should be identified along with gene mentions. Named entity recognition (NER) has been a main research topic for a long time in the biomedical text-mining community. The common strategy for NER is either to apply certain rules based on dictionaries and natural language processing techniques (5–7) or to apply machine learning approaches such as support vector machines (SVMs) and conditional random fields (8–10). However, most NER systems are class specific, i.e. they are designed to find only objects of one particular class or set of classes (11). This is natural because chemical, gene and disease names have specialized terminologies and complex naming conventions. In particular, gene names are difficult to detect because of synonyms, homonyms, abbreviations and ambiguities (12,13). Moreover, there are no specific rules of how to name a gene that are actually followed in practice (14). Chemicals have systematic naming conventions, but finding chemical names from text is still not easy because there are various ways to express chemicals (15,16). For example, they can be mentioned as IUPAC names, brand names, generic names or even molecular formulas. However, disease names in literature are more standardized (17) compared with gene and chemical names. Hence, using terminological resources such as Medical Subject Headings (MeSH) and Unified Medical Language System (UMLS) Metathesaurus help boost the identification performance (17,18). But, a major drawback of identifying disease names from text is that they often use general English terms.

Having a common representative for a group of identifiers for a single entity, should simplify the creation of mappings between entities.


Accelerating literature curation with text-mining tools:…

Monday, November 19th, 2012

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts by Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Tanya Z. Berardini, Eva Huala, Hung-Yu Kao and Zhiyong Lu.


Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated.

Database URL:

Presentation on PubTator (slides, PDF).

Hmmm, curating abstracts. That sounds like annotating subjects in documents doesn’t it? Or something very close. 😉

If we start off with a set of subjects, that eases topic map authoring because users are assisted by automatic creation of topic map machinery. Creation triggered by identification of subjects and associations.

Users don’t have to start with bare ground to build a topic map.

Clever users build (and sell) forms, frames, components and modules that serve as the scaffolding for other topic maps.

… text mining in the FlyBase genetic literature curation workflow

Sunday, November 18th, 2012

Opportunities for text mining in the FlyBase genetic literature curation workflow by Peter McQuilton. (Database (2012) 2012 : bas039 doi: 10.1093/database/bas039)


FlyBase is the model organism database for Drosophila genetic and genomic information. Over the last 20 years, FlyBase has had to adapt and change to keep abreast of advances in biology and database design. We are continually looking for ways to improve curation efficiency and efficacy. Genetic literature curation focuses on the extraction of genetic entities (e.g. genes, mutant alleles, transgenic constructs) and their associated phenotypes and Gene Ontology terms from the published literature. Over 2000 Drosophila research articles are now published every year. These articles are becoming ever more data-rich and there is a growing need for text mining to shoulder some of the burden of paper triage and data extraction. In this article, we describe our curation workflow, along with some of the problems and bottlenecks therein, and highlight the opportunities for text mining. We do so in the hope of encouraging the BioCreative community to help us to develop effective methods to mine this torrent of information.

Database URL:

Would you believe that ambiguity is problem #1 and describing relationships is another one?

The most common problem encountered during curation is an ambiguous genetic entity (gene, mutant allele, transgene, etc.). This situation can arise when no unique identifier (such as a FlyBase gene identifier (FBgn) or a computed gene (CG) number for genes), or an accurate and explicit reference for a mutant or transgenic line is given. Ambiguity is a particular problem when a generic symbol/ name is used (e.g. ‘Actin’ or UAS-Notch), or when a symbol/ name is used that is a synonym for a different entity (e.g. ‘ras’ is the current FlyBase symbol for the ‘raspberry’ gene, FBgn0003204, but is often used in the literature to refer to the ‘Ras85D’ gene, FBgn0003205). A further issue is that some symbols only differ in case-sensitivity for the first character, for example, the genes symbols ‘dl’ (dorsal) and ‘Dl’ (Delta). These ambiguities can usually be resolved by searching for associated details about the entity in the article (e.g. the use of a specific mutant allele can identify the gene being discussed) or by consulting the supplemental information for additional details. Sometimes we have to do some analysis ourselves, such as performing a BLAST search using any sequence data present in the article or supplementary files or executing an in-house script to report those entities used by a specified author in previously curated articles. As a final step, if we cannot resolve a problem, we email the corresponding author for clarification. If the ambiguity still cannot be resolved, then a curator will either associate a generic/unspecified entry for that entity with the article, or else omit the entity and add a (non-public) note to the curation record explaining the situation, with the hope that future publications will resolve the issue.

One of the more esoteric problems found in curation is the fact that multiple relationships exist between the curated data types. For example, the ‘dppEP2232 allele’ is caused by the ‘P{EP}dppEP2232 insertion’ and disrupts the ‘dpp gene’. This can cause problems for text-mining assisted curation, as the data can be attributed to the wrong object due to sentence structure or the requirement of back- ground or contextual knowledge found in other parts of the article. In cases like this, detailed knowledge of the FlyBase proforma and curation rules, as well as a good knowledge of Drosophila biology, is necessary to ensure the correct proforma field is filled in. This is one of the reasons why we believe text-mining methods will assist manual curation rather than replace it in the near term.

I like the “manual curation” line. Curation is a task best performed by a sentient being.

Manual Gene Ontology annotation workflow

Sunday, November 4th, 2012

Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database by Harold J. Drabkin, Judith A. Blake and for the Mouse Genome Informatics Database. Database (2012) 2012 : bas045 doi: 10.1093/database/bas045.


The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource ( The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.

Semantic uniformity is achievable, in a limited enough sphere, provided you are willing to pay the price for it.

It has a high rate of return over less carefully curated content.

The project is producing high quality results, although hampered by a lack of resources.

My question is whether a similar high quality of results could be achieved with less semantically consistent curation by distributed contributors?

Harnessing the community of those interested in such a resource. And refining those less semantically consistent entries into higher quality annotations.

Pointers to examples of such projects?