Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 10, 2014

multiMiR R package and database:…

Filed under: Bioinformatics,Biomedical,MySQL,R — Patrick Durusau @ 7:37 pm

The multiMiR R package and database: integration of microRNA–target interactions along with their disease and drug associations by Yuanbin Ru, et al. ( Nucl. Acids Res. (2014) doi: 10.1093/nar/gku631)

Abstract:

microRNAs (miRNAs) regulate expression by promoting degradation or repressing translation of target transcripts. miRNA target sites have been catalogued in databases based on experimental validation and computational prediction using various algorithms. Several online resources provide collections of multiple databases but need to be imported into other software, such as R, for processing, tabulation, graphing and computation. Currently available miRNA target site packages in R are limited in the number of databases, types of databases and flexibility. We present multiMiR, a new miRNA–target interaction R package and database, which includes several novel features not available in existing R packages: (i) compilation of nearly 50 million records in human and mouse from 14 different databases, more than any other collection; (ii) expansion of databases to those based on disease annotation and drug microRNAresponse, in addition to many experimental and computational databases; and (iii) user-defined cutoffs for predicted binding strength to provide the most confident selection. Case studies are reported on various biomedical applications including mouse models of alcohol consumption, studies of chronic obstructive pulmonary disease in human subjects, and human cell line models of bladder cancer metastasis. We also demonstrate how multiMiR was used to generate testable hypotheses that were pursued experimentally.

Amazing what you can do with R and a MySQL database!

The authors briefly describe their “cleaning” process for the consolidation of these databases on page 2 but then note on page 4:

For many of the databases, the links are available. However, in Supplementary Table S2 we have listed the databases where links may be broken due to outdated identifiers in those databases. We also listed the databases that do not have the option to search by miR NA-gene pairs.

Perhaps due to editing standards (available for free lance work) I have allergy to terms like “many,” especially when it is possible to enumerate the “many.”

In this particular case, you have to download and consult Supplementary Table S2, which reads:

S2

The explanation for this table reads:

For each database, the columns indicate whether external links are available to include as part of multiMiR, whether those databases use identifiers that are updated and whether the links are based on miRNA-gene pairs. For those database that do not have updated identifiers, some links may be broken. For the other databases, where you can only search by miRNA or gene but not pairs, the links are provided by gene, except for ElMMo which is by miRNA because of its database structure.

Counting I see ten (10) databases with a blank under “Undated Identifiers” or Search by miRNA-gene,” or both.

I guess ten (10) out of fourteen (14) qualifies as “many,” but saying seventy-one percent (71%) of the databases in this study lack either “Updated Identifiers,” “Search by miRNA-gene,” or both, would have been more informative.

Potential records with these issues? EIMMo, version 4 has human (50M) and mouse (15M), MicroCosm / miRBase human (879054), and miRanda (assuming human, Good mirSVR score, Conserved miRNA), 1097069. For the rest you can consult Supplemental Table 1, which lists URLs for the databases and dates of access, but where multiple human options are available, not which one(s) were selected.

The number of records for each database that may have these problems also merits mention in the description of the data.

I can’t comment on the usefulness of this R package for exploring the data but the condition of the data it explores needs more prominent mention.

August 8, 2014

Genomics Standards Consortium

Filed under: Bioinformatics,Genomics — Patrick Durusau @ 4:13 pm

Genomics Standards Consortium

From the homepage:

The Genomic Standards Consortium ( GSC) is an open-membership working body formed in September 2005. The goal of this International community is to promote mechanisms that standardize the description of genomes and the exchange and integration of genomic data.

This was cited in Genomic Encyclopedia of Bacteria….

If you are interested in the “exchange and integration of genomic data,” you will find a number of projects of interest to you.

Naming issues are everywhere but they get more attention, at least for the moment, in science and related areas.

I would not push topic maps syntax but I would suggest that capturing what a reasonable person thinks when identifying a subject, their inner checklist of properties as it were, will assist others in comparing their internal checklist of properties.

If that “inner” list isn’t written down, there is nothing on which to make a comparison.

Genomic Encyclopedia of Bacteria…

Filed under: Bioinformatics,Biology,Genomics — Patrick Durusau @ 4:03 pm

Genomic Encyclopedia of Bacteria and Archaea: Sequencing a Myriad of Type Strains by Nikos C. Kyrpides, et al. (Kyrpides NC, Hugenholtz P, Eisen JA, Woyke T, Göker M, et al. (2014) Genomic Encyclopedia of Bacteria and Archaea: Sequencing a Myriad of Type Strains. PLoS Biol 12(8): e1001920. doi:10.1371/journal.pbio.1001920)

Abstract:

Microbes hold the key to life. They hold the secrets to our past (as the descendants of the earliest forms of life) and the prospects for our future (as we mine their genes for solutions to some of the planet’s most pressing problems, from global warming to antibiotic resistance). However, the piecemeal approach that has defined efforts to study microbial genetic diversity for over 20 years and in over 30,000 genome projects risks squandering that promise. These efforts have covered less than 20% of the diversity of the cultured archaeal and bacterial species, which represent just 15% of the overall known prokaryotic diversity. Here we call for the funding of a systematic effort to produce a comprehensive genomic catalog of all cultured Bacteria and Archaea by sequencing, where available, the type strain of each species with a validly published name (currently~11,000). This effort will provide an unprecedented level of coverage of our planet’s genetic diversity, allow for the large-scale discovery of novel genes and functions, and lead to an improved understanding of microbial evolution and function in the environment.

While I am a standards advocate, I have to disagree with some of the claims for standards:

Accurate estimates of diversity will require not only standards for data but also standard operating procedures for all phases of data generation and collection [33],[34]. Indeed, sequencing all archaeal and bacterial type strains as a unified international effort will provide an ideal opportunity to implement international standards in sequencing, assembly, finishing, annotation, and metadata collection, as well as achieve consistent annotation of the environmental sources of these type strains using a standard such as minimum information about any (X) sequence (MixS) [27],[29]. Methods need to be rigorously challenged and validated to ensure that the results generated are accurate and likely reproducible, without having to reproduce each point. With only a few exceptions [27],[29], such standards do not yet exist, but they are in development under the auspices of the Genomics Standards Consortium (e.g., the M5 initiative) (http://gensc.org/gc_wiki/index.php/M5) [35]. Without the vehicle of a grand-challenge project such as this one, adoption of international standards will be much less likely.

Some standardization will no doubt be beneficial but for the data that is collected, a topic map informed approach where critical subjects are identified not be surface tokens but by key/value pairs would be much better.

In part because there is always legacy data and too little time and funding to back fit every change in present terminology to past names. Or should I say it hasn’t happen outside of one specialized chemical index that comes to mind.

August 5, 2014

Bioinformatics Data and Microsoft Word

Filed under: Bioinformatics,Microsoft — Patrick Durusau @ 4:25 pm

Is there ever a valid reason for storing bioinformatics data in a Microsoft Word document? by Keith Bradnam.

You already know the answer from the title so I will skip to the conclusion:

This is not an acceptable practice! Use of Microsoft Word to store bioinformatics data will only ever result in unhappiness, frustration, and anger.

I think Keith, myself and many others who make the same or similar points are missing one critical issue:

Why is MS Word (or Excel) so much easier to use than other applications for bioinformatics?

Or perhaps even more to the point:

Why hasn’t bioinformatics lobbied for extensions to MS Word or Excel to work with their workflow?

For the most part, users aren’t really interested in a personal relationship with their computer or a religious experience with their software. They want to get some non-hardware/non-software task done. (full stop)

Rather than trying to fix users, why don’t we try to fix their tools?

Shouldn’t I be able to create a new MS Word or OpenOffice document, indicate that it contains gene names and simply type them in? And have them intelligently extracted for use with genome databases?

“Fixing” users isn’t a winning strategy. Let’s trying fixing their tools. No promises but we know the other approach fails.

August 1, 2014

COSMOS: Python library for massively parallel workflows

Filed under: Bioinformatics,Parallel Programming,Python,Workflow — Patrick Durusau @ 10:11 am

COSMOS: Python library for massively parallel workflows by Erik Gafni, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu385 )

Abstract:

Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu.

Contact: dpwall@stanford.edu or peter_tonellato@hms.harvard.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

A very good abstract but for pitching purposes, I would have chosen the first paragraph of the introduction:

The growing deluge of data from next-generation sequencers leads to analyses lasting hundreds or thousands of compute hours per specimen, requiring massive computing clusters or cloud infrastructure. Existing computational tools like Pegasus (Deelman et al., 2005) and more recent efforts like Galaxy (Goecks et al., 2010) and Bpipe (Sadedin et al., 2012) allow the creation and execution of complex workflows. However, few projects have succeeded in describing complicated workflows in a simple, but powerful, language that generalizes to thousands of input files; fewer still are able to deploy workflows onto distributed resource management systems (DRMs) such as Platform Load Sharing Facility (LSF) or Sun Grid Engine that stitch together clusters of thousands of compute cores. Here we describe COSMOS, a Python library developed to address these and other needs.

That paragraph highlights the bioinformatics aspects of COSMOS but also hints at a language that might be adapted to other “massively parallel workflows.” Workflows may differ details but the need to efficiently and effectively define them is a common problem.

July 31, 2014

Bio-Linux 8 – Released July 2014

Filed under: Bio-Linux,Bioinformatics,Linux OS — Patrick Durusau @ 7:29 am

Bio-Linux 8 – Released July 2014

About Bio-Linux:

Bio-Linux 8 is a powerful, free bioinformatics workstation platform that can be installed on anything from a laptop to a large server, or run as a virtual machine. Bio-Linux 8 adds more than 250 bioinformatics packages to an Ubuntu Linux 14.04 LTS base, providing around 50 graphical applications and several hundred command line tools. The Galaxy environment for browser-based data analysis and workflow construction is also incorporated in Bio-Linux 8.

Bio-Linux 8 represents the continued commitment of NERC to maintain the platform, and comes with many updated and additional tools and libraries. With this release we support pre-prepared VM images for use with VirtualBox, VMWare or Parallels. Virtualised Bio-Linux will power the EOS Cloud, which is in development for launch in 2015.

You can install Bio-Linux on your machine, either as the only operating system, or as part of a dual-boot set-up which allows you to use your current system and Bio-Linux on the same hardware.

Bio-Linux can also run Live from a DVD or a USB stick. This runs in the memory of your machine and does not involve installing anything. This is a great, no-hassle way to try out Bio-Linux, demonstrate or teach with it, or to work with it when you are on the move.

Bio-Linux is built on open source systems and software, and so is free to to install and use. See What’s new on Bio-Linux 8. Also, check out the 2006 paper on Bio-Linux and open source systems for biologists.

Great news if you are handling biological data!

Not to mention being a good example of multiple delivery methods, you can use Bio-Linux 8 as your OS, run it from a VM, DVD or USB stick.

How is your software delivered?

July 30, 2014

Expanded 19th-century Medical Collection

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:06 am

Wellcome Library and Jisc announce partners in 19th-century medical books digitisation project

From the post:

The libraries of six universities have joined the partnership – UCL (University College London), the University of Leeds, the University of Glasgow, the London School of Hygiene & Tropical Medicine, King’s College London and the University of Bristol – along with the libraries of the Royal College of Physicians of London, the Royal College of Physicians of Edinburgh and the Royal College of Surgeons of England.

Approximately 15 million pages of printed books and pamphlets from all ten partners will be digitised over a period of two years and will be made freely available to researchers and the public under an open licence. By pooling their collections the partners will create a comprehensive online library. The content will be available on multiple platforms to broaden access, including the Internet Archive, the Wellcome Library and Jisc Historic Books.

The project’s focus is on books and pamphlets from the 19th century that are on the subject of medicine or its related disciplines. This will include works relating to the medical sciences, consumer health, sport and fitness, as well as different kinds of medical practice, from phrenology to hydrotherapy. Works on food and nutrition will also feature: around 1400 cookery books from the University of Leeds are among those lined up for digitisation. They, along with works from the other partner institutions, will be transported to the Wellcome Library in London where a team from the Internet Archive will undertake the digitisation work. The project will build on the success of the US-based Medical Heritage Library consortium, of which the Wellcome Library is a part, which has already digitised over 50 000 books and pamphlets.

Digital coverage of the 19th century is taking another leap forward!

Given the changes in medical terminology (and practices!) since the 19th century, this should be a gold mine for topic map applications.

July 27, 2014

Ten habits of highly effective data:…

Filed under: Bioinformatics,Funding — Patrick Durusau @ 2:45 pm

Ten habits of highly effective data: Helping your dataset achieve its full potential by Anita de Waard.

Anita gives all the high minded and very legitimate reasons for creating highly effective data, with examples.

Read her slides to pick up the rhetoric you need and leads on how to create highly effective data.

Let me add one concern to drive your interest in creating highly effective data:

Funders want researchers to create highly effective data.

Enough said?

Answers to creating highly effective data continue to evolve but not attempting to create highly effective data is a losing proposal.

July 25, 2014

Tools and Resources Development Fund [bioscience ODF UK]

Filed under: Bioinformatics,Biomedical,Curation,Funding — Patrick Durusau @ 2:05 pm

Tools and Resources Development Fund

Application deadline: 17 September 2014, 4pm

From the summary:

Our Tools and Resources Development Fund (TRDF) aims to pump prime the next generation of tools, technologies and resources that will be required by bioscience researchers in scientific areas within our remit. It is anticipated that successful grants will not exceed £150k (£187k FEC) (ref 1) and a fast-track, light touch peer review process will operate to enable researchers to respond rapidly to emerging challenges and opportunities.

Projects are expected to have a maximum value of £150k (ref 1). The duration of projects should be between 6 and 18 months, although community networks to develop standards could be supported for up to 3 years.

A number of different types of proposal are eligible for consideration.

  • New approaches to the analysis, modelling and interpretation of research data in the biological sciences, including development of software tools and algorithms. Of particular interest will be proposals that address challenges arising from emerging new types of data and proposals that address known problems associated with data handling (e.g. next generation sequencing, high-throughput phenotyping, the extraction of data from challenging biological images, metagenomics).
  • New frameworks for the curation, sharing, and re-use/re-purposing of research data in the biological sciences, including embedding data citation mechanisms (e.g. persistent identifiers for datasets within research workflows) and novel data management planning (DMP) implementations (e.g. integration of DMP tools within research workflows)
  • Community approaches to the sharing of research data including the development of standards (this could include coordinating UK input into international standards development activities).
  • Approaches designed to exploit the latest computational technology to further biological research; for example, to facilitate the use of cloud computing approaches or high performance computing architectures.

Projects may extend existing software resources; however, the call is designed to support novel tools and methods. Incremental improvement and maintenance of existing software that does not provide new functionality or significant performance improvements (e.g. by migration to an advanced computing environment) does not fall within the scope of the call.

Very timely since the UK announcement that OpenDocument Format (ODF) is among the open standards:

The standards set out the document file formats that are expected to be used across all government bodies. Government will begin using open formats that will ensure that citizens and people working in government can use the applications that best meet their needs when they are viewing or working on documents together. (Open document formats selected to meet user needs)

ODF as a format supports RDFa as metadata but lacks an implementation that makes full use of that capability.

Imagine biocuration that:

  • Starts with authors writing a text and is delivered to
  • Publishers, who can proof or augment the author’s biocuration
  • Results are curated on on publication (not months or years later)
  • Results are immediately available for collation with other results.

The only way to match the explosive growth of bioscience publications with equally explosive growth of bioscience curation, is to use tools the user already knows. Like word processing software.

Please pass this along and let me know of other grants or funding opportunities where adaptation of office standards or software could change the fundamentals of workflow.

July 21, 2014

You’re not allowed bioinformatics anymore

Filed under: Bioinformatics,Collaboration — Patrick Durusau @ 5:48 pm

You’re not allowed bioinformatics anymore by Mick Watson.

Bump this to the head of your polemic reading list! Excellent writing.

To be fair, collaboration with others is a two-way street.

That is both communities in this tale needed to be reaching out to the other on a continuous basis. It isn’t enough that you offered once or twice and were rebuffed so now you will wait them out.

Successful collaborations don’t start with grudges and bad attitudes about prior failures to collaborate.

I know of two organizations that share common members, operate in the same area and despite both being more than a century old, have had only one, brief, collaborative project.

The collaboration fell apart because leadership in both was waiting for the other to call.

It is hard to sustain a collaboration when both parties considered themselves to be the center of the universe. (I have it on good authority neither one of them are the center of the universe.)

I can’t promise fame, professional success, etc., but reaching out and genuinely collaborating with others will advance your field of endeavor. Promise.

Enjoy the story.

I first saw this in a tweet by Neil Saunders.

Christmas in July?

Filed under: Bioinformatics,Genome,Genomics — Patrick Durusau @ 4:18 pm

It won’t be Christmas in July but bioinformatics folks will feel like it with the release of the full annotation of the human genome assembly (GRCh38) due to drop at the end of July 2014.

Dan Murphy covers progress on the annotation and information about the upcoming release in: The new human annotation is almost here!

This is an important big data set.

How would you integrate it with other data sets?

I first saw this in a tweet by Neil Saunders.

July 19, 2014

Ad-hoc Biocuration Workflows?

Filed under: Bioinformatics,Text Mining — Patrick Durusau @ 6:54 pm

Text-mining-assisted biocuration workflows in Argo by Rafal Rak, et al. (Database (2014) 2014 : bau070 doi: 10.1093/database/bau070)

Abstract:

Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced.

Database URL: http://argo.nactem.ac.uk

From the introduction:

Data curation from biomedical literature had been traditionally carried out as an entirely manual effort, in which a curator handpicks relevant documents and creates annotations for elements of interest from scratch. To increase the efficiency of this task, text-mining methodologies have been integrated into curation pipelines. In curating the Biomolecular Interaction Network Database (1), a protein–protein interaction extraction system was used and was shown to be effective in reducing the curation work-load by 70% (2). Similarly, a usability study revealed that the time needed to curate FlyBase records (3) was reduced by 20% with the use of a gene mention recognizer (4). Textpresso (5), a text-mining tool that marks up biomedical entities of interest, was used to semi-automatically curate mentions of Caenorhabditis elegans proteins from the literature and brought about an 8-fold increase in curation efficiency (6). More recently, the series of BioCreative workshops (http://www.biocreative.org) have fostered the synergy between biocuration efforts and text-mining solutions. The user-interactive track of the latest workshop saw nine Web-based systems featuring rich graphical user interfaces designed to perform text-mining-assisted biocuration tasks. The tasks can be broadly categorized into the selection of documents for curation, the annotation of mentions of relevant biological entities in text and the annotation of interactions between biological entities (7).

Argo is a truly impressive text-mining-assisted biocuration application but the first line of a biocuration article needs to read:

Data curation from biomedical literature had been traditionally carried out as an entirely ad-hoc effort, after the author has submitted their paper for publication.

There is an enormous backlog of material that desperately needs biocuration and Argo (and other systems) have a vital role to play in that effort.

However, the situation of ad-hoc biocuration is never going to improve unless and until biocuration is addressed in the authoring of papers to appear in biomedical literature.

Who better to answer questions or ambiguities that appear in biocuration than the author of papers?

That would require working to extend MS Office and Apache OpenOffice, to name two of the more common authoring platforms.

But the return would be higher quality publications earlier in the publication cycle, which would enable publishers to provide enhanced services based upon higher quality products and enhance tracing and searching of the end products.

No offense to ad-hoc efforts but higher quality sooner in the publication process seems like an unbeatable deal.

First complex, then simple

Filed under: Bioinformatics,Data Analysis,Data Mining,Data Models — Patrick Durusau @ 4:18 pm

First complex, then simple by James D Malley and Jason H Moore. (BioData Mining 2014, 7:13)

Abstract:

At the start of a data analysis project it is often suggested that the researcher look first at multiple simple models. That is, always begin with simple, one variable at a time analyses, such as multiple single-variable tests for association or significance. Then, later, somehow (how?) pull all the separate pieces together into a single comprehensive framework, an inclusive data narrative. For detecting true compound effects with more than just marginal associations, this is easily defeated with simple examples. But more critically, it is looking through the data telescope from wrong end.

I would have titled this article: “Data First, Models Later.”

That is the author’s start with no formal theories about what data will prove and upon finding signals in the data, then generate simple models to explain the signals.

I am sure their questions of the data are driven by a suspicion of what the data may prove, but that isn’t the same thing as asking questions designed to prove a model generated before the data is queried.

July 16, 2014

…[S]emantically enriched open pharmacological space…

Filed under: Bioinformatics,Biomedical,Drug Discovery,Integration,Semantics — Patrick Durusau @ 2:25 pm

Scientific competency questions as the basis for semantically enriched open pharmacological space development by Kamal Azzaoui, et al. (Drug Discovery Today, Volume 18, Issues 17–18, September 2013, Pages 843–852)

Abstract:

Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compound–target–pathway–disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.

Pharmacology may not be your space but this is a good example of what it takes for semantic integration of resources in a complex area.

Despite the “…you too can be a brain surgeon with our new web-based app…” from various sources, semantic integration has been, is and will remain difficult under the best of circumstances.

I don’t say that to discourage anyone but to avoid the let-down when integration projects don’t provide easy returns.

It is far better to plan for incremental and measurable benefits along the way than to fashion grandiose goals that are ever receding on the horizon.

I first saw this in a tweet by ChemConnector.

July 12, 2014

Train online with EMBL-EBI

Filed under: Bioinformatics,Information Retrieval — Patrick Durusau @ 1:54 pm

Train online with EMBL-EBI

From the webpage:

Train online provides free courses on Europe’s most widely used data resources, created by experts at EMBL-EBI and collaborating institutes. You do not need to have any previous experience of bioinformatics to benefit from this training. We want to help you to be a highly competent user of our data resources; we are not trying to train you to become a bioinformatician.

You can use Train online to learn in your own time and at your own pace. You can repeat the courses as many times as you like, or just complete part of a course if you want to brush up on how to perform a specific task.

An interesting collection of training materials on bioinformatics resources.

As the webpage says, it won’t train you to be a bioinformatician but it can make you a more effective user of the resource covered.

Keep in mind if you are working in a bioinformatics project or are interested in how other domains organize their information.

I first saw this in a tweet by Neil Saunders which pointed to: Scaling up bioinformatics training online by Ewan Birney.

July 6, 2014

Finding needles in haystacks:…

Filed under: Bioinformatics,Biology,Names,Taxonomy — Patrick Durusau @ 4:54 pm

Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi by Conrad L. Schoch, et al. (Database (2014) 2014 : bau061 doi: 10.1093/database/bau061).

Abstract:

DNA phylogenetic comparisons have shown that morphology-based species recognition often underestimates fungal diversity. Therefore, the need for accurate DNA sequence data, tied to both correct taxonomic names and clearly annotated specimen data, has never been greater. Furthermore, the growing number of molecular ecology and microbiome projects using high-throughput sequencing require fast and effective methods for en masse species assignments. In this article, we focus on selecting and re-annotating a set of marker reference sequences that represent each currently accepted order of Fungi. The particular focus is on sequences from the internal transcribed spacer region in the nuclear ribosomal cistron, derived from type specimens and/or ex-type cultures. Re-annotated and verified sequences were deposited in a curated public database at the National Center for Biotechnology Information (NCBI), namely the RefSeq Targeted Loci (RTL) database, and will be visible during routine sequence similarity searches with NR_prefixed accession numbers. A set of standards and protocols is proposed to improve the data quality of new sequences, and we suggest how type and other reference sequences can be used to improve identification of Fungi.

Database URL: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA177353

If you are interested in projects to update and correct existing databases, this is the article for you.

Fungi may not be on your regular reading list but consider one aspect of the problem described:

It is projected that there are ~400 000 fungal names already in existence. Although only 100 000 are accepted taxonomically, it still makes updates to the existing taxonomic structure a continuous task. It is also clear that these named fungi represent only a fraction of the estimated total, 1–6 million fungal species (93–95).

I would say that computer science isn’t the only discipline where “naming things” is hard.

You?

PS: The other lesson from this paper (and many others) is that semantic accuracy is not easy nor is it cheap. Anyone who says differently is lying.

July 2, 2014

circlize implements and enhances circular visualization in R

Filed under: Bioinformatics,Genomics,Multidimensional,R,Visualization — Patrick Durusau @ 6:03 pm

circlize implements and enhances circular visualization in R by Zuguang Gu, et al.

Abstract:

Summary: Circular layout is an efficient way for the visualization of huge amounts of genomic information. Here we present the circlize package, which provides an implementation of circular layout generation in R as well as an enhancement of available software. The flexibility of this package is based on the usage of low-level graphics functions such that self-defined high-level graphics can be easily implemented by users for specific purposes. Together with the seamless connection between the powerful computational and visual environment in R, circlize gives users more convenience and freedom to design figures for better understanding genomic patterns behind multi-dimensional data.

Availability and implementation: circlize is available at the Comprehensive R Archive Network (CRAN): http://cran.r-project.org/web/packages/circlize/

The article is behind a paywall but fortunately, the R code is not!

I suspect I know which one will get more “hits.” 😉

Useful for exploring multidimensional data as well as presenting multidimensional data encoded using a topic map.

Sometimes displaying information as nodes and edges isn’t the best display.

Remember the map of Napoleon’s invasion of Russia?

napoleon - russia

You could display the same information with nodes (topics) and associations (edges) but it would not be nearly as compelling.

Although, you could make the same map a “cover” for the topics (read people) associated with segments of the map, enabling a reader to take in the whole map and then drill down to the detail for any location or individual.

It would still be a topic map, even though its primary rendering would not be as nodes and edges.

Verticalize

Filed under: Bioinformatics,Data Mining,Text Mining — Patrick Durusau @ 3:05 pm

Verticalize by Pierre Lindenbum.

From the webpage:

Simple tool to verticalize text delimited files.

Pierre works in bioinformatics and is the author of many useful tools.

Definitely one for the *nix toolbox.

June 21, 2014

Online Bioinformatics / Computational Biology

Filed under: Bioinformatics,Computational Biology,Computer Science — Patrick Durusau @ 8:14 pm

An Annotated Online Bioinformatics / Computational Biology Curriculum by Stephen Turner.

From the post:

Two years ago David Searls published an article in PLoS Comp Bio describing a series of online courses in bioinformatics. Yesterday, the same author published an updated version, “A New Online Computational Biology Curriculum,” (PLoS Comput Biol 10(6): e1003662. doi: 10.1371/journal.pcbi.1003662).

This updated curriculum has a supplemental PDF describing hundreds of video courses that are foundational to a good understanding of computational biology and bioinformatics. The table of contents embedded into the PDF’s metadata (Adobe Reader: View>Navigation Panels>Bookmarks; Apple Preview: View>Table of Contents) breaks the curriculum down into 11 “departments” with links to online courses in each subject area:

  1. Mathematics Department
  2. Computer Science Department
  3. Data Science Department
  4. Chemistry Department
  5. Biology Department
  6. Computational Biology Department
  7. Evolutionary Biology Department
  8. Systems Biology Department
  9. Neurosciences Department
  10. Translational Sciences Department
  11. Humanities Department

The key term here is annotated. That is the author isn’t just listing courses from someone else’s list but has some experience with the course.

Should be a great resource whether you are a CS person looking at bioinformatics/computational biology or if you are a bioinformatics person trying to communicate with the CS side.

Enjoy!

Egas:…

Filed under: Bioinformatics,Biomedical,Medical Informatics,Text Mining — Patrick Durusau @ 7:42 pm

Egas: a collaborative and interactive document curation platform by David Campos, el al.

Abstract:

With the overwhelming amount of biomedical textual information being produced, several manual curation efforts have been set up to extract and store concepts and their relationships into structured resources. As manual annotation is a demanding and expensive task, computerized solutions were developed to perform such tasks automatically. However, high-end information extraction techniques are still not widely used by biomedical research communities, mainly because of the lack of standards and limitations in usability. Interactive annotation tools intend to fill this gap, taking advantage of automatic techniques and existing knowledge bases to assist expert curators in their daily tasks. This article presents Egas, a web-based platform for biomedical text mining and assisted curation with highly usable interfaces for manual and automatic in-line annotation of concepts and relations. A comprehensive set of de facto standard knowledge bases are integrated and indexed to provide straightforward concept normalization features. Real-time collaboration and conversation functionalities allow discussing details of the annotation task as well as providing instant feedback of curator’s interactions. Egas also provides interfaces for on-demand management of the annotation task settings and guidelines, and supports standard formats and literature services to import and export documents. By taking advantage of Egas, we participated in the BioCreative IV interactive annotation task, targeting the assisted identification of protein–protein interactions described in PubMed abstracts related to neuropathological disorders. When evaluated by expert curators, it obtained positive scores in terms of usability, reliability and performance. These results, together with the provided innovative features, place Egas as a state-of-the-art solution for fast and accurate curation of information, facilitating the task of creating and updating knowledge bases and annotated resources.

Database URL: http://bioinformatics.ua.pt/egas

Read this article and/or visit the webpage and tell me this doesn’t have topic map editor written all over it!

Domain specific to be sure but any decent interface for authoring topic maps is going to be domain specific.

Very, very impressive!

I am following up with the team to check on the availability of the software.

A controlled vocabulary for pathway entities and events

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 7:24 pm

A controlled vocabulary for pathway entities and events by Steve Jupe, et al.

Abstract:

Entities involved in pathways and the events they participate in require descriptive and unambiguous names that are often not available in the literature or elsewhere. Reactome is a manually curated open-source resource of human pathways. It is accessible via a website, available as downloads in standard reusable formats and via Representational State Transfer (REST)-ful and Simple Object Access Protocol (SOAP) application programming interfaces (APIs). We have devised a controlled vocabulary (CV) that creates concise, unambiguous and unique names for reactions (pathway events) and all the molecular entities they involve. The CV could be reapplied in any situation where names are used for pathway entities and events. Adoption of this CV would significantly improve naming consistency and readability, with consequent benefits for searching and data mining within and between databases.

Database URL: http://www.reactome.org

There is no doubt that “unambiguous and unique names for reactions (pathway events) and all the molecular entities they involve” would have all the benefits listed by the authors.

Unfortunately, the experience of the HUGO Gene Nomenclature Committee, for example, has been that “other” names for genes are used and then the HUGO designation is created. Making the HUGO designation only one of several names a gene may have.

Another phrase for “universal name” is “an additional name.”

It is an impressive effort and should be useful in disambiguating the additional names for pathway entities and events.

FYI, from the homepage of the database:

Reactome is a free, open-source, curated and peer reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education.

June 6, 2014

A Methodology for Empirical Analysis of LOD Datasets

Filed under: Bioinformatics,Biomedical,LOD — Patrick Durusau @ 6:52 pm

A Methodology for Empirical Analysis of LOD Datasets by Vit Novacek.

Abstract:

CoCoE stands for Complexity, Coherence and Entropy, and presents an extensible methodology for empirical analysis of Linked Open Data (i.e., RDF graphs). CoCoE can offer answers to questions like: Is dataset A better than B for knowledge discovery since it is more complex and informative?, Is dataset X better than Y for simple value lookups due its flatter structure?, etc. In order to address such questions, we introduce a set of well-founded measures based on complementary notions from distributional semantics, network analysis and information theory. These measures are part of a specific implementation of the CoCoE methodology that is available for download. Last but not least, we illustrate CoCoE by its application to selected biomedical RDF datasets. (emphasis in original)

A deeply interesting work on the formal characteristics of LOD datasets but as we learned in Community detection in networks:… a relationship between a typology (another formal characteristic) and some hidden fact(s) may or may not exist.

Or to put it another way, formal characteristics are useful for rough evaluation of data sets but cannot replace a grounded actor considering their meaning. That would be you.

I first saw this in a tweet by Marin Dimitrov

May 10, 2014

The Encyclopedia of Life v2:…

Filed under: Bioinformatics,Biology,Encyclopedia,Semantic Inconsistency — Patrick Durusau @ 4:11 pm

The Encyclopedia of Life v2: Providing Global Access to Knowledge About Life on Earth by Cynthia S. Parr, et al. (Biodiversity Data Journal 2: e1079 (29 Apr 2014) doi: 10.3897/BDJ.2.e1079)

Abstract:

The Encyclopedia of Life (EOL, http://eol.org) aims to provide unprecedented global access to a broad range of information about life on Earth. It currently contains 3.5 million distinct pages for taxa and provides content for 1.3 million of those pages. The content is primarily contributed by EOL content partners (providers) that have a more limited geographic, taxonomic or topical scope. EOL aggregates these data and automatically integrates them based on associated scientific names and other classification information. EOL also provides interfaces for curation and direct content addition. All materials in EOL are either in the public domain or licensed under a Creative Commons license. In addition to the web interface, EOL is also accessible through an Application Programming Interface.

In this paper, we review recent developments added for Version 2 of the web site and subsequent releases through Version 2.2, which have made EOL more engaging, personal, accessible and internationalizable. We outline the core features and technical architecture of the system. We summarize milestones achieved so far by EOL to present results of the current system implementation and establish benchmarks upon which to judge future improvements.

We have shown that it is possible to successfully integrate large amounts of descriptive biodiversity data from diverse sources into a robust, standards-based, dynamic, and scalable infrastructure. Increasing global participation and the emergence of EOL-powered applications demonstrate that EOL is becoming a significant resource for anyone interested in biological diversity.

This section on the organization of the taxonomy for the Encyclopedia of Life v2 seems particularly relevant:

Resource documents made available by content partners define the text and multimedia being provided as well as the taxa to which the content refers, the associations between content and taxa, and the associations among taxa (i.e. taxonomies). Expert taxonomists often disagree about the best classification for a given group of organisms, and there is no universal taxonomy for partners to adhere to (Patterson et al. 2008, Rotman et al. 2012a, Yoon and Rose 2001). As an aggregator, EOL accepts all taxonomic viewpoints from partners and attempts to assign them to existing Taxon Pages, or create new Taxon Pages when necessary. A reconciliation algorithm uses incoming taxon information, previously indexed data, and assertions from our curators to determine the best aggregation strategy. (links omitted)

Integration of information without agreement on a single view of the information. (Have we heard this before?)

If you think of the taxon pages as proxies, it is easier to see the topic map aspects of this project.

May 3, 2014

Human Sense Making

Filed under: Bioinformatics,Interface Research/Design,Sense,Sensemaking,Workflow — Patrick Durusau @ 12:38 pm

Scientists’ sense making when hypothesizing about disease mechanisms from expression data and their needs for visualization support by Barbara Mirel and Carsten Görg.

Abstract:

A common class of biomedical analysis is to explore expression data from high throughput experiments for the purpose of uncovering functional relationships that can lead to a hypothesis about mechanisms of a disease. We call this analysis expression driven, -omics hypothesizing. In it, scientists use interactive data visualizations and read deeply in the research literature. Little is known, however, about the actual flow of reasoning and behaviors (sense making) that scientists enact in this analysis, end-to-end. Understanding this flow is important because if bioinformatics tools are to be truly useful they must support it. Sense making models of visual analytics in other domains have been developed and used to inform the design of useful and usable tools. We believe they would be helpful in bioinformatics. To characterize the sense making involved in expression-driven, -omics hypothesizing, we conducted an in-depth observational study of one scientist as she engaged in this analysis over six months. From findings, we abstracted a preliminary sense making model. Here we describe its stages and suggest guidelines for developing visualization tools that we derived from this case. A single case cannot be generalized. But we offer our findings, sense making model and case-based tool guidelines as a first step toward increasing interest and further research in the bioinformatics field on scientists’ analytical workflows and their implications for tool design.

From the introduction:

In other domains, improvements in data visualization designs have relied on models of analysts’ actual sense making for a complex analysis [2]. A sense making model captures analysts’ cumulative, looped (not linear) “process [es] of searching for a representation and encoding data in that representation to answer task-specific questions” relevant to an open-ended problem [3]: 269. As an end-to-end flow of application-level tasks, a sense making model may portray and categorize analytical intentions, associated tasks, corresponding moves and strategies, informational inputs and outputs, and progression and iteration over time. The importance of sense making models is twofold: (1) If an analytical problem is poorly understood developers are likely to design for the wrong questions, and tool utility suffers; and (2) if developers do not have a holistic understanding of the entire analytical process, developed tools may be useful for one specific part of the process but will not integrate effectively in the overall workflow [4,5].

As the authors admit, one case isn’t enough to be generalized but their methodology, with its focus on the work flow of a scientist, is a refreshing break from imagined and/or “ideal” work flows for scientists.

Until now semantic software has followed someone’s projection of an “ideal” work flow.

The next generation of semantic software should follow the actual work flows of people working with their data.

I first saw this in a tweet by Neil Saunders

April 22, 2014

Innovations in peer review:…

Filed under: Bioinformatics,Biomedical,Peer Review,Publishing — Patrick Durusau @ 9:54 am

Innovations in peer review: join a discussion with our Editors by Shreeya Nanda.

From the post:

Innovation may not be an adjective often associated with peer review, indeed commentators have claimed that peer review slows innovation and creativity in science. Preconceptions aside, publishers are attempting to shake things up a little, with various innovations in peer review, and these are the focus of a panel discussion at BioMed Central’s Editors’ Conference on Wednesday 23 April in Doha, Qatar. This follows our spirited discussion at the Experimental Biology conference in Boston last year.

The discussion last year focussed on the limitations of the traditional peer review model (you can see a video here). This year we want to talk about innovations in the field and the ways in which the limitations are being addressed. Specifically, we will focus on open peer review, portable peer review – in which we help authors transfer their manuscript, often with reviewers’ reports, to a more appropriate journal – and decoupled peer review, which is undertaken by a company or organisation independent of, or on contract from, a journal.

We will be live tweeting from the session at 11.15am local time (9.15am BST), so if you want to join the discussion or put questions to our panellists, please follow #BMCEds14. If you want to brush up on any or all of the models that we’ll be discussing, have a look at some of the content from around BioMed Central’s journals, blogs and Biome below:

This post includes pointers to a number of useful resources concerning the debate around peer review.

But there are oddities as well. First, the claim that peer review “slows innovation and creativity in science,” considering recent reports that peer review is no better than random chance for grants (…lotteries to pick NIH research-grant recipients and the not infrequent reports of false papers, fraud in actual papers, and a general inability to replicate research described in papers (Reproducible Research/(Mapping?)).

A claim doesn’t have to appear on the alt.fringe.peer.review newsgroup (imaginary newsgroup) in order to be questionable on its face.

Secondly, despite the invitation to follow and participate on Twitter, holding the meeting in Qartar means potential attendees from the United States will have to rise at:

Eastern 4:15 AM (last year’s location)

Central 3:15 AM

Mountain 2:15 AM

Western 1:15 AM

I wonder what the participation levels will be from Boston last year as compared to Qatar this year?

Nothing against non-United States locations but non-junket locations, such as major educational/research hubs, should be the sites for such meetings.

April 21, 2014

Names are not (always) useful

Filed under: Bioinformatics,Biology,Taxonomy — Patrick Durusau @ 7:30 pm

PhyloCode names are not useful for phylogenetic synthesis

From the post:

Which brings me to the title of this post. In the PhyloCode, taxonomic names are not hypothetical concepts that can be refuted or refined by data-driven tests. Instead, they are definitions involving specifiers (designated specimens) that are simply applied to source trees that include those specifiers. This is problematic for synthesis because if two source trees differ in topology, and/or they fail to include the appropriate specifiers, it may be impossible to answer the basic question I began with: do the trees share any clades (taxa) in common? If taxa are functions of phylogenetic topology, then there can be no taxonomic basis for meaningfully comparing source trees that either differ in topology, or do not permit the application of taxon definitions. (emphasis added)

If you substitute “names” for “taxa” then it is easy to see my point in Plato, Shiva and A Social Graph about nodes that are “abstract concept devoid of interpretation.” There is nothing to compare.

This isn’t a new problem but a very old one that keeps being repeated.

For processing reasons it may be useful to act as though taxa (or names) are simply given. A digital or print index need not struggle to find a grounding for the terms it reports. For some purposes, that is completely unnecessary.

On the other hand, we should not forget the lack of grounding is purely a convenience for processing or other reasons. We can choose differently should an occasion merit it.

April 20, 2014

Google Genomics Preview

Filed under: Bioinformatics,Genomics — Patrick Durusau @ 3:40 pm

Google Genomics Preview by Kevin.

From the post:

Welcome to the Google Genomics Preview! You’ve been approved for early access to the API.

The goal of the Genomics API is to encourage interoperability and build a foundation to store, process, search, analyze and share tens of petabytes of genomic data.

We’ve loaded sample data from public BAM files:

  • The complete 1000 Genomes Project
  • Selections from the Personal Genome Project

How to get started:

You will need to obtain an invitation to being playing.

Don’t be disappointed that Google is moving into genomics.

After all, gathering data and supplying a processing back-end for it is a critical task but not a terribly imaginative one.

The analysis you perform and the uses you enable, that’s the part that takes imagination.

March 25, 2014

…[S]uffix array construction algorithms

Filed under: Bioinformatics,String Matching,Suffix Array,Suffix Tree — Patrick Durusau @ 6:20 pm

A bioinformatician’s guide to the forefront of suffix array construction algorithms by Anish Man Singh Shrestha, Martin C. Frith, and Paul Horton.

Abstract:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

If this doesn’t sound like a real page turner, consider the authors’ concluding paragraph:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

Reminds me that computer science departments need to start offering courses in “string theory,” to capitalize on the popularity of that phrase. 😉

March 18, 2014

Think globally and solve locally:… (GraphChi News)

Filed under: Bioinformatics,GraphChi,Graphs — Patrick Durusau @ 6:55 pm

Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction by Marco Mesiti, Matteo Re, and Giorgio Valentini.

Abstract:

Background: Network-based learning algorithms for Automated Function Prediction (AFP) are negatively a ffected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their e ffectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satis able with simple off -the-shelf computers.

Results: We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off -the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the fi rst work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins.

Conclusions: The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.

The biomolecular network material may be deep wading but you will find GraphChi making a significant difference in this use case.

What I found particularly interesting in Table 7 (page 20) was the low impact that additional RAM has on GraphChi.

I take that to mean that GraphChi can run efficiently on low-end boxes (4 GB RAM).

Yes?

I first saw this in a tweet by Aapo Kyrola.

February 26, 2014

The Small Tools Manifesto For Bioinformatics [Security Too?]

Filed under: Bioinformatics,Cybersecurity,Security — Patrick Durusau @ 5:24 pm

The Small Tools Manifesto For Bioinformatics

From the post:

This MANIFESTO describes motives, rules and recommendations for designing software and pipelines for current day biological and biomedical research.

Large scale data acquisition in research has led to fundamental challenges in (1) scaling of calculations, (2) full data integration and (3) data interaction and visualisation. We think that, because of researchers reaching out to turn-key solutions, the research community is losing sight of the importance of building software on the shoulders of giants and providing solutions in a modular, flexible and open way.

This MANIFESTO counters current trends in bioinformatics where institutes and companies are creating monolithic software solutions aimed mostly at end-users. This MANIFESTO builds on the Unix computer tradition of providing small tools that can be used in a modular and pluggable way to create efficient computational solutions where individual parts can be easily replaced. The manifesto also counters current trends in software licensing which are not truly free and open source (FOSS). We think such a MANIFESTO is necessary, even though history suggests that software created with true FOSS licenses will ultimately prevail over less open licenses, including those licenses for academic use only.

Interesting that I should encounter this less than a week after Back to Basics: Beyond Network Hygiene by Felix ‘FX’ Lindner and Sandro Gaycken.

Linder and Gaycken’s first recommendation:

Therefore, our first recommendation is to significantly increase the granularity of the building blocks, making the individual building blocks significantly smaller than what is done today. (emphasis in original)

Think about it.

More granular building blocks means smaller blocks of code and fewer places for bugs to hide. Messaging between blocks allows for easy tracing of bad messages. Not to mention that small blocks can be repaired and/or replaced more easily than large monoliths.

The manifesto for small tools in bioinformatics is a great idea.

Shouldn’t we do the same for programming in general to enable robust computer security?*

*Note that computer security isn’t “in addition to” a computer system but enabled in the granular architecture itself.

I first saw this in a tweet by Vince Buffalo.

« Newer PostsOlder Posts »

Powered by WordPress