Bioinformatics « Another Word For It

December 28, 2011

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 9:35 pm

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations by Ryan K. Dale, Brent S. Pedersen and Aaron R. Quinlan.

Abstract:

Summary: pybedtools is a flexible Python software library for manipulating and exploring genomic datasets in many common formats. It provides an intuitive Python interface that extends upon the popular BEDTools genome arithmetic tools. The library is well documented and efficient, and allows researchers to quickly develop simple, yet powerful scripts that enable complex genomic analyses.

From the documentation:

Formats with different coordinate systems (e.g. BED vs GFF) are handled with uniform, well-defined semantics described in the documentation.

Starting to sound like HyTime isn’t it? Transposition between coordinate systems.

If you venture into this area with a topic map, something to keep in mind.

I first saw this in Christophe Lalanne’s A bag of tweets / Dec 2011.

Comments Off

December 26, 2011

Galaxy: Data Intensive Biology for Everyone

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:20 pm

Galaxy: Data Intensive Biology for Everyone (main site)

Galaxy 101: The first thing you should try (tutorial)

Work through the tutorial, keeping track of where you think subject identity (and any tests you care to suggest) would be useful.

I don’t know but suspect this is representative of what researchers in the field expect in terms of capabilities.

With a little effort I suspect it would make a nice basis to start a conversation about what subject identity could add that would be of interest.

Comments Off

December 25, 2011

New Entrez Genome

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:07 pm

New Entrez Genome Released on November 9, 2011

From the announcement:

Historically the Entrez Genome data model was designed for complete genomes of microorganisms (Archaea, Eubacteria, and Viruses) and a very few eukaryotic genomes such as human, yeast, worm, fly and thale cress (Arabidopsis thaliana). It also included individual complete genomes of organelles and plasmids. Despite the name, the Entrez Genome database record has been a chromosome (or organelle or plasmid) rather than a genome.

The new Genome resource uses a new data model where a single record provides information about the organism (usually a species), its genome structure, available assemblies and annotations, and related genome-scale projects such as transcriptome sequencing, epigenetic studies and variation analysis. As before, the Genome resource represents genomes from all major taxonomic groups: Archaea, Bacteria, Eukaryote, and Viruses. The old Genome database represented only Refseq genomes, while the new resource extends this scope to all genomes either provided by primary submitters (INSDC genomes) or curated by NCBI staff (RefSeq genomes).

The new Genome database shares a close relationship with the recently redesigned BioProject database (formerly Genome Project). Primary information about genome sequencing projects in the new Genome database is stored in the BioProject database. BioProject records of type “Organism Overview” have become Genome records with a Genome ID that maps uniquely to a BioProject ID. The new Genome database also includes all “genome sequencing” records in BioProject.

BTW, just in case you ever wonder about changes in identifiers causing problems:

The new Genome IDs cannot be directly mapped to the old Genome IDs because the data types are very different. Each old Genome ID represented a single sequence that can still be found in Entrez Nucleotide using standard Entrez searches or the E-utilities. We recommend that you convert old Genome IDs to Nucleotide GI numbers using the following remapping file available on the NCBI FTP site:
ftp://ftp.ncbi.nih.gov/genomes/old_genomeID2nucGI

The Genome site.

Comments Off

A Compressed Self-Index for Genomic Databases

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 6:07 pm

A Compressed Self-Index for Genomic Databases by Travis Gagie, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi.

Abstract:

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.

As the authors note, an area with rapidly increasing need for efficient effective indexing.

It would be a step forward to see a comparison of this method on a common genome set with:

A hash trie filter method for approximate string matching in genomic databases by Ye-In Chang, Jiun-Rung Chen, and Min-Tze Hsu. Applied Intelligence, Volume 33 Issue 1, August 2010, Kluwer Academic Publishers Hingham, MA, USA.
Bitmapped Vector Trie, Extreme Cleverness: Functional Data Structures in Scala (presentation by Daniel Spiewak), http://github.com/djspiewak/extreme-cleverness (source code).

I suppose I am presuming a common genome data set for indexing demonstrations.

Questions:

Is there a common genome data set for comparison of indexing techniques?
Are there other indexing techniques that should be included in a comparison?

Obviously important for topic maps used in genome projects.

But insights about identification of subjects that vary only slightly in one (or more) dimensions to identify different subjects, will be useful in other contexts.

An easy example would be isotopes. Let’s see, ah, celestial or other coordinate systems. Don’t know but would guess that spectra from stars/galaxies are largely common. (Do you know for sure?) What other data sets have subjects that are identified on the basis of small or incremental changes in a largely identical identifier?

Comments Off

A Faster Grammar-Based Self-Index

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 6:06 pm

A Faster Grammar-Based Self-Index by Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, Simon J. Puglisi.

Abstract:

To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straight-line programs and LZ77. In this paper we show how, given a balanced straight-line program for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can add $\Oh{z \log \log z}$ words and obtain a compressed self-index for $S$ such that, given a pattern (P [1..m]), we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh{m^2 + (m + \occ) \log \log n}$ time. All previous self-indexes are either larger or slower in the worst case.

Updated version of the paper I covered at: A Faster LZ77-Based Index.

In a very real sense, indexing is fundamental to information retrieval. That is to say that when information is placed in electronic storage, the only way to retrieve it is via indexing. The index may be one to one with a memory location and hence not terribly efficient, but the fact remains that an index is part of every information retrieval transaction.

Comments (1)

December 20, 2011

Standard Measures in Genomic Studies

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:25 pm

Standard Measures in Genomic Studies

This news story caught my eye and leads off simply enough:

Standards can make our lives better. We have standards for manufacturing many items — from car parts to nuts and bolts — that improve the reliability and compatibility of all sorts of widgets that we use in our daily lives. Without them, many tasks would be difficult, a bit like trying to fit a square peg into a round hole. There are even standard measures to collect information from participants of large population genomic studies that can be downloaded for free from the Consensus Measures for Phenotype and eXposures (PhenX) Toolkit [phenxtoolkit.org]. However, researchers will only adopt such standard measures if they can be used easily.

That is why the NHGRI’s Office of Population Genomics has launched a new effort called the PhenX Real-world Implementation and Sharing (PhenX RISING) program. The National Human Genome Research Institute (NHGRI) has awarded nearly $900,000, with an additional $100,000 from NIH Office of Behavioral and Social Sciences Research (OBSSR), to seven investigators to use and evaluate the standards. Each investigator will incorporate a variety of PhenX measures into their ongoing genome-wide association or large population study. These researchers will also make recommendations as to how to fine-tune the PhenX Toolkit.

OK, good for them, or at least the researchers who get the grants, but what does that have to do with topic maps?

Just a bit further the announcement says:

GWAS have identified more than a thousand associations between genetic variants and common diseases such as cancer and heart disease, but the majority of the studies do not share standard measures. PhenX standard measures are important because they allow researchers to more easily combine data from different studies to see if there are overlapping genetic factors between or among different diseases. This ability will improve researchers’ understanding of disease and may eventually be used to assess a patient’s genetic risk of getting a disease such as diabetes or cancer and to customize treatment.

OK, so there are existing studies that don’t share standard measures, there will be more studies while the PhenX RISING program goes on that don’t share standard measures and there may be future studies while PhenX RISING is being adjusted that don’t share standard measures.

Depending upon the nature of the measures that are not shared and the importance of mapping between these non-shared standards, this sounds like fertile ground for topic map prospecting.

Comments (1)

Talking Glossary of Genetic Terms

Filed under: Bioinformatics — Patrick Durusau @ 8:23 pm

Talking Glossary of Genetic Terms

From the webpage:

The National Human Genome Research Institute (NHGRI) created the Talking Glossary of Genetic Terms to help everyone understand the terms and concepts used in genetic research. In addition to definitions, specialists in the field of genetics share their descriptions of terms, and many terms include images, animation and links to related terms.

Getting Started:

Enter a search term or explore the list of terms by selecting a letter from the alphabet on the left and then select from the terms revealed. (A text-only version is available from here.)

The Talking Glossary

At the bottom of most pages in the Talking Glossary are links to help you get the most out of this glossary.

Linked information explains how to cite a term from the Glossary in a reference paper. Another link allows you to suggest a term currently not in the glossary that you feel would be a valuable addition. And there is a link to email any of the 200+ terms to a friend.

Useful resource, particularly the links to additional information.

Comments Off

December 19, 2011

GENIA Project

Filed under: Bioinformatics — Patrick Durusau @ 8:11 pm

GENIA Project: Mining literature for knowledge in molecular biology.

From the webpage:

The GENIA project seeks to automatically extract useful information from texts written by scientists to help overcome the problems caused by information overload. We intend that while the methods are customized for application in the micro-biology domain, the basic methods should be generalisable to knowledge acquisition in other scientific and engineering domains.

We are currently working on the key task of extracting event information about protein interactions. This type of information extraction requires the joint effort of many sources of knowledge, which we are now developing. These include a parser, ontology, thesaurus and domain dictionaries as well as supervised learning models.

Be aware that the project uses the acronym of “TM” for “text mining.” Anyone can clearly see that “TM” should be expand to “topic map.” 😉 Just teasing.

GENIA has a corpus of texts and a number of tools for mining texts.

Comments Off

Visions of a semantic molecular future

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:10 pm

Visions of a semantic molecular future

I already have a post on the Journal of Cheminformatics but this looked like it needed a separate post.

This thematic issue arose from a symposium held in the Unilever Centre [for Molecular Science Informatics, Department of Chemistry, University of Cambridge] on 2011-01-15/17 to celebrate the career of Peter Murray-Rust. From the programme:

This symposium addresses the creativity of the maturing Semantic Web to the unrealized potential of Molecular Science. The world is changing and we are in the middle of many revolutions: Cloud computing; the Semantic Web; the Fourth Paradigm (data-driven science); web democracy; weak AI; pervasive devices; citizen science; Open Knowledge. Technologies can develop in months to a level where individuals and small groups can change the world. However science is hamstrung by archaic approaches to the publication, redistribution and re-use of information and much of the vision is (just) out of reach. Social, as well as technical, advances are required to realize the full potential. We’ve asked leading scientists to let their imagination explore the possible and show us how to get there.

This is a starting point for all of us – the potential of working with the virtual world of scientists and citizens, coordinated through organizations such as the Open Knowledge Foundation and continuing connection with the Cambridge academic community makes this one of the central points for my future.

The pages in this document represent vibrant communities of practice which are growing and are offered to the world as contributions to a bsemantic molecular future.

We have combined talks from the symposium with work from the Murray-Rust group into 15 articles.

Quickly, just a couple of the articles with abstracts to get you interested:

“Openness as infrastructure”
John Wilbanks Journal of Cheminformatics 2011, 3:36 (14 October 2011)

The advent of open access to peer reviewed scholarly literature in the biomedical sciences creates the opening to examine scholarship in general, and chemistry in particular, to see where and how novel forms of network technology can accelerate the scientific method. This paper examines broad trends in information access and openness with an eye towards their applications in chemistry.

“Open Bibliography for Science, Technology, and Medicine”
Richard Jones, Mark MacGillivray, Peter Murray-Rust, Jim Pitman, Peter Sefton, Ben O’Steen, William Waites Journal of Cheminformatics 2011, 3:47 (14 October 2011)

The concept of Open Bibliography in science, technology and medicine (STM) is introduced as a combination of Open Source tools, Open specifications and Open bibliographic data. An Openly searchable and navigable network of bibliographic information and associated knowledge representations, a Bibliographic Knowledge Network, across all branches of Science, Technology and Medicine, has been designed and initiated. For this large scale endeavour, the engagement and cooperation of the multiple stakeholders in STM publishing – authors, librarians, publishers and administrators – is sought.

It should be interesting when generally realized that the information people have hoarded over the years isn’t important. It is the human mind that perceives, manipulates, and draws conclusions from information that gives it any value at all.

Comments Off

OpenHelix

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:10 pm

OpenHelix

From the about page:

More efficient use of the most relevant resources means quicker and more effective research. OpenHelix empowers researchers by

providing a search portal to find the most relevant genomics resource and training on those resources.

distributing extensive and effective tutorials and training materials on the most powerful and popular genomics resourcs.

contracting with resource providers to provide comprehensive, long-term training and outreach programs.

If you are interested in learning the field of genomics research, other than by returning to graduate/medical school, this site will figure high on your list of resources.

It offers a very impressive gathering of both commercial and non-commercial resources under one roof.

I haven’t taken any of the tutorials produced by OpenHelix and so would appreciate comments from anyone who has.

Bioinformatics is an important subject area for topic maps for several reasons:

First, the long term (comparatively speaking) interest in the use of computers and the use in fact of computers in biology indicates there is a need for information for which other people will spend money. There is a key phrase in that sentence, “…for which other people will spend money.” You are already spending your time working on topic maps so it is important to identify other people who are willing to part with cash for your software or assistance. Bioinformatics is a field where that is already known to happen, other people spend their money on software or expertise.

Second, for all of the progress on identification issues in bioinformatics, any bioinformatics journal you pick up, will have references to the need for greater integration of biological resources. There is plenty of opportunity now and as far as anyone can tell, for many tomorrows to follow.

Third, for good or ill, any progress in the field attracts a disproportionate amount of coverage. The public rarely reads or sees coverage of discoveries being less than what was initially reported. And not only health professionals hear such news so it would be good PR for topic maps.

Comments Off

Journal of Biomedical Semantics

Filed under: Bioinformatics,Biomedical,Semantics — Patrick Durusau @ 8:10 pm

Journal of Biomedical Semantics

From Aims and Scope:

Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas:

Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability.

Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.

Research in biology and biomedicine relies on various types of biomedical data, information, and knowledge, represented in databases with experimental and/or curated data, ontologies, literature, taxonomies, and so on. Semantics is essential for accessing, integrating, and analyzing such data. The ability to explicitly extract, assign, and manage semantic representations is crucial for making computational approaches in the biomedical domain productive for a large user community.

Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain, and comprises practical and theoretical advances in biomedical semantics research with implications for data analysis.

In recent years, the availability and use of electronic resources representing biomedical knowledge has greatly increased, covering ontologies, taxonomies, literature, databases, and bioinformatics services. These electronic resources contribute to advances in the biomedical domain and require interoperability between them through various semantic descriptors. In addition, the availability and integration of semantic resources is a key part in facilitating semantic web approaches for life sciences leading into reasoning and other advanced ways to analyse biomedical data.

Random items to whet your appetite:

The 2nd DBCLS BioHackathon: interoperable bioinformatics Web services for integrated applications
Toshiaki Katayama, Mark D Wilkinson, Rutger Vos, Takeshi Kawashima, Shuichi Kawashima, Mitsuteru Nakao, Yasunori Yamamoto, Hong-Woo Chun, Atsuko Yamaguchi, Shin Kawano, Jan Aerts, Kiyoko F Aoki-Kinoshita, Kazuharu Arakawa, Bruno Aranda, Raoul JP Bonnal, José M Fernández, Takatomo Fujisawa, Paul MK Gordon, Naohisa Goto, Syed Haider, Todd Harris, Takashi Hatakeyama, Isaac Ho, Masumi Itoh, Arek Kasprzyk, Nobuhiro Kido, Young-Joo Kim, Akira R Kinjo, Fumikazu Konishi, Yulia Kovarskaya Journal of Biomedical Semantics 2011, 2:4 (2 August 2011)

Simple tricks for improving pattern-based information extraction from the biomedical literature
Quang Nguyen, Domonkos Tikk, Ulf Leser Journal of Biomedical Semantics 2010, 1:9 (24 September 2010)

Rewriting and suppressing UMLS terms for improved biomedical term identification
Kristina M Hettne, Erik M van Mulligen, Martijn J Schuemie, Bob JA Schijvenaars, Jan A Kors Journal of Biomedical Semantics 2010, 1:5 (31 March 2010)

Comments Off

Journal of Computing Science and Engineering

Filed under: Bioinformatics,Computer Science,Linguistics,Machine Learning,Record Linkage — Patrick Durusau @ 8:09 pm

Journal of Computing Science and Engineering

From the webpage:

Journal of Computing Science and Engineering (JCSE) is a peer-reviewed quarterly journal that publishes high-quality papers on all aspects of computing science and engineering. The primary objective of JCSE is to be an authoritative international forum for delivering both theoretical and innovative applied researches in the field. JCSE publishes original research contributions, surveys, and experimental studies with scientific advances.

The scope of JCSE covers all topics related to computing science and engineering, with a special emphasis on the following areas: embedded computing, ubiquitous computing, convergence computing, green computing, smart and intelligent computing, and human computing.

I got here from following a sponsor link at a bioinformatics conference.

Then just picking at random from the current issue I see:

A Fast Algorithm for Korean Text Extraction and Segmentation from Subway Signboard Images Utilizing Smartphone Sensors by Igor Milevskiy, Jin-Young Ha.

Abstract:

We present a fast algorithm for Korean text extraction and segmentation from subway signboards using smart phone sensors in order to minimize computational time and memory usage. The algorithm can be used as preprocessing steps for optical character recognition (OCR): binarization, text location, and segmentation. An image of a signboard captured by smart phone camera while holding smart phone by an arbitrary angle is rotated by the detected angle, as if the image was taken by holding a smart phone horizontally. Binarization is only performed once on the subset of connected components instead of the whole image area, resulting in a large reduction in computational time. Text location is guided by user’s marker-line placed over the region of interest in binarized image via smart phone touch screen. Then, text segmentation utilizes the data of connected components received in the binarization step, and cuts the string into individual images for designated characters. The resulting data could be used as OCR input, hence solving the most difficult part of OCR on text area included in natural scene images. The experimental results showed that the binarization algorithm of our method is 3.5 and 3.7 times faster than Niblack and Sauvola adaptive-thresholding algorithms, respectively. In addition, our method achieved better quality than other methods.

Secure Blocking + Secure Matching = Secure Record Linkage by Alexandros Karakasidis, Vassilios S. Verykios.

Abstract:

Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching.

A Survey of Transfer and Multitask Learning in Bioinformatics by Qian Xu, Qiang Yang.

Abstract:

Machine learning and data mining have found many applications in biological domains, where we look to build predictive models based on labeled training data. However, in practice, high quality labeled data is scarce, and to label new data incurs high costs. Transfer and multitask learning offer an attractive alternative, by allowing useful knowledge to be extracted and transferred from data in auxiliary domains helps counter the lack of data problem in the target domain. In this article, we survey recent advances in transfer and multitask learning for bioinformatics applications. In particular, we survey several key bioinformatics application areas, including sequence classification, gene expression data analysis, biological network reconstruction and biomedical applications.

And the ones I didn’t list from the current issue are just as interesting and relevant to identity/mapping issues.

This journal is a good example of people who have deliberately reached further across disciplinary boundaries than most.

About the only excuse for not doing so left is the discomfort of being the newbie in a field not your own.

Is that a good enough reason to miss possible opportunities to make critical advances in your home field? (Only you can answer that for yourself. No one can answer it for you.)

Comments Off

Journal of Bioinformatics and Computational Biology (JBCB)

Filed under: Bioinformatics,Computational Biology — Patrick Durusau @ 8:08 pm

Journal of Bioinformatics and Computational Biology (JBCB)

From the Aims and Scope page:

The Journal of Bioinformatics and Computational Biology aims to publish high quality, original research articles, expository tutorial papers and review papers as well as short, critical comments on technical issues associated with the analysis of cellular information.

The research papers will be technical presentations of new assertions, discoveries and tools, intended for a narrower specialist community. The tutorials, reviews and critical commentary will be targeted at a broader readership of biologists who are interested in using computers but are not knowledgeable about scientific computing, and equally, computer scientists who have an interest in biology but are not familiar with current thrusts nor the language of biology. Such carefully chosen tutorials and articles should greatly accelerate the rate of entry of these new creative scientists into the field.

To give you an idea of the type of content you will find, consider:

A RE-EVALUATION OF BIOMEDICAL NAMED ENTITY–TERM RELATIONS by TOMOKO OHTA, SAMPO PYYSALO, JIN-DONG KIM, JUN’ICHI TSUJII. Volume: 8, Issue: 5(2010) pp. 917-928 DOI: 10.1142/S0219720010005014.

Abstract:

Text mining can support the interpretation of the enormous quantity of textual data produced in biomedical field. Recent developments in biomedical text mining include advances in the reliability of the recognition of named entities (NEs) such as specific genes and proteins, as well as movement toward richer representations of the associations of NEs. We argue that this shift in representation should be accompanied by the adoption of a more detailed model of the relations holding between NEs and other relevant domain terms. As a step toward this goal, we study NE–term relations with the aim of defining a detailed, broadly applicable set of relation types based on accepted domain standard concepts for use in corpus annotation and domain information extraction approaches.

as representative content.

Enjoy!

Comments Off

December 17, 2011

Broad Institute

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:30 am

Broad Institute

In their own words:

The Eli and Edythe L. Broad Institute of Harvard and MIT is founded on two core beliefs:

This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and treatment of disease.

To fulfill this mission, we need new kinds of research institutions, with a deeply collaborative spirit across disciplines and organizations, and having the capacity to tackle ambitious challenges.

The Broad Institute is essentially an “experiment” in a new way of doing science, empowering this generation of researchers to:

Act nimbly. Encouraging creativity often means moving quickly, and taking risks on new approaches and structures that often defy conventional wisdom.

Work boldly. Meeting the biomedical challenges of this generation requires the capacity to mount projects at any scale — from a single individual to teams of hundreds of scientists.

Share openly. Seizing scientific opportunities requires creating methods, tools and massive data sets — and making them available to the entire scientific community to rapidly accelerate biomedical advancement.

Reach globally. Biomedicine should address the medical challenges of the entire world, not just advanced economies, and include scientists in developing countries as equal partners whose knowledge and experience are critical to driving progress.

The Detecting Novel Associations in Large Data Sets software and data is from the Broad Institute.

Sounds like the sort of place that would be interested in enhancing research and sharing of information with topic maps.

Comments Off

December 16, 2011

Detecting Novel Associations in Large Data Sets

Filed under: Bioinformatics,Data Mining,Statistics — Patrick Durusau @ 8:23 am

Detecting Novel Associations in Large Data Sets by David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, Pardis C. Sabeti.

Abstract:

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Lay version: Tool detects patterns hidden in vast data sets by Haley Bridger.

Data and software: http://exploredata.net/.

From the article:

Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs—far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones? Data sets of this size are increasingly common in fields as varied as genomics, physics, political science, and economics, making this question an important and growing challenge (1, 2).

One way to begin exploring a large data set is to search for pairs of variables that are closely associated. To do this, we could calculate some measure of dependence for each pair, rank the pairs by their scores, and examine the top-scoring pairs. For this strategy to work, the statistic we use to measure dependence should have two heuristic properties: generality and equitability.

By generality, we mean that with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships (3). The latter condition is desirable because not only do relationships take many functional forms, but many important relationships—for example, a superposition of functions—are not well modeled by a function (4–7).

By equitability, we mean that the statistic should give similar scores to equally noisy relationships of different types. For example, we do not want noisy linear relationships to drive strong sinusoidal relationships from the top of the list. Equitability is difficult to formalize for associations in general but has a clear interpretation in the basic case of functional relationships: An equitable statistic should give similar scores to functional relationships with similar R2 values (given sufficient sample size).

Here, we describe an exploratory data analysis tool, the maximal information coefficient (MIC), that satisfies these two heuristic properties. We establish MIC’s generality through proofs, show its equitability on functional relationships through simulations, and observe that this translates into intuitively equitable behavior on more general associations. Furthermore, we illustrate that MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximal information-based nonparametric exploration. MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity. We demonstrate the application of MIC and MINE to data sets in health, baseball, genomics, and the human microbiota. (footnotes omitted)

As you can imagine the line:

MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity.

caught my eye.

I usually don’t post until the evening but this looks very important. I wanted everyone to have a chance to grab the data and software before the weekend.

New acronyms:

MIC – maximal information coefficient

MINE – maximal information-based nonparametric exploration

Good thing they chose acronyms we would not be likely to confuse with other usages. 😉

Full citation:

Science 16 December 2011:
Vol. 334 no. 6062 pp. 1518-1524
DOI: 10.1126/science.1205438

Comments (2)

December 12, 2011

NLM Plus

Filed under: Bioinformatics,Biomedical,Search Algorithms,Search Engines — Patrick Durusau @ 10:22 pm

NLM Plus

From the webpage:

NLMplus is an award winning Semantic Search Engine and Biomedical Knowledge Base application that showcases a variety of natural language processing tools to provide an improved level of access to the vast collection of biomedical data and services of the National Library of Medicine.

Utilizing its proprietary Web Knowledge Base, WebLib LLC can apply the universal search and semantic technology solutions demonstrated by NLMplus to libraries, businesses, and research organizations in all domains of science and technology and Web applications

Any medical librarians in the audience? Or ones you can forward this post to?

Curious what professional researchers make of NLM Plus? I don’t have the domain expertise to evaluate it.

Thanks!

Comments (1)

December 5, 2011

Medical Text Indexer (MTI)

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 7:42 pm

Medical Text Indexer (MTI) (formerly the Indexing Initiative System (IIS))

From the webpage:

The MTI system consists of software for applying alternative methods of discovering MeSH headings for citation titles and abstracts and then combining them into an ordered list of recommended indexing terms. The top portion of the diagram consists of three paths, or methods, for creating a list of recommended indexing terms: MetaMap Indexing, Trigrams and PubMed Related Citations. The first two paths actually compute UMLS Metathesaurus® concepts which are passed to the Restrict to MeSH process. The results from each path are weighted and combined using the Clustering process. The system is highly parameterized not only by path weights but also by several parameters specific to the Restrict to MeSH and Clustering processes.

A prototype MTI system described below had two additional indexing methods which were removed because their results were subsumed by the three remaining methods.

Deeply interesting and relevant work to topic maps.

Comments Off

MetaMap Portal

Filed under: Bioinformatics,Biomedical,MetaMap,Metathesaurus — Patrick Durusau @ 7:41 pm

MetaMap Portal

About MetaMap:

MetaMap is a highly configurable program developed by Dr. Alan (Lan) Aronson at the National Library of Medicine (NLM) to map biomedical text to the UMLS Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge-intensive approach based on symbolic, natural-language processing (NLP) and computational-linguistic techniques. Besides being applied for both IR and data-mining applications, MetaMap is one of the foundations of NLM’s Medical Text Indexer (MTI) which is being used for both semiautomatic and fully automatic indexing of biomedical literature at NLM. For more information on MetaMap and related research, see the SKR Research Information Site.

Improvement in the October 2011 Release:

MetaMap2011 includes some significant enhancements, most notably algorithmic improvements that enable MetaMap to very quickly process input text that had previously been computationally intractable.

These enhancements include:

Algorithmic Improvements

Candidate Set Pruning

Re-Organization of Additional Data Models

Single-character alphabetic tokens

Improved Treatment of Apostrophe-“s”

New XML Command-Line Options

Numbered Mappings

User-Defined Acronyms and Abbreviations

Starting with MetaMap 2011, MetaMap is now available for Windows XP and Windows 7.

One of several projects that sound very close to being topic map mining programs.

Comments Off

December 4, 2011

FACTA

Filed under: Associations,Bioinformatics,Biomedical,Concept Detection,Text Analytics — Patrick Durusau @ 8:16 pm

FACTA – Finding Associated Concepts with Text Analysis

From the Quick Start Guide:

FACTA is a simple text mining tool to help discover associations between biomedical concepts mentioned in MEDLINE articles. You can navigate these associations and their corresponding articles in a highly interactive manner. The system accepts an arbitrary query term and displays relevant concepts on the spot. A broad range of concepts are retrieved by the use of large-scale biomedical dictionaries containing the names of important concepts such as genes, proteins, diseases, and chemical compounds.

A very good example of an exploration tool that isn’t overly complex to use.

Comments Off

December 2, 2011

Toolset for Genomic Analysis, Data Management

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:56 pm

Toolset for Genomic Analysis, Data Management

From the post:

The informatics group at the Genome Institute at Washington University School of Medicine has released an integrated analysis and information-management system called the Genome Modeling System.

The system borrows concepts from traditional laboratory information-management systems — such as tracking methods and data-access interfaces, — and applies them to genomic analysis. The result is a standardized system that integrates both analysis and management capabilities, David Dooling, the assistant director of informatics at Wash U and one of the developers of GMS, explained to BioInform.

Not exactly. The tools that will make up the “Genome Modeling System” have been released but melding them into the full package is something we will see near the end of this year. (later in the article)

I remember the WU-FTPD software before it fell into disrepute so I have great expectations for this software. I will keep watch and post a note when it appears for download.

Comments Off

openSNP

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:51 pm

openSNP

Don’t recognize the name? I didn’t either when I came across it on Genoweb under the title Battle Over.

Then I read the homepage blurb:

openSNP allows customers of direct-to-customer genetic tests to publish their test results, find others with similar genetic variations, learn more about their results, find the latest primary literature on their variations and help scientists to find new associations.

I think we will be hearing more about openSNP in the not too distant future.

Sounds like a useful place to talk about topic maps. But in terms of their semantic impedances and their identifiers for subjects.

Hard to sell a product if we are fixing a “problem” that no one sees as a “problem.”

Comments Off

November 18, 2011

Deja vu: a Database of Highly Similar Citations

Filed under: Bioinformatics,Biomedical,Deja vu — Patrick Durusau @ 9:37 pm

Deja vu: a Database of Highly Similar Citations

From the webpage:

Deja vu is a database of extremely similar Medline citations. Many, but not all, of which contain instances of duplicate publication and potential plagiarism. Deja vu is a dynamic resource for the community, with manual curation ongoing continuously, and we welcome input and comments.

In the scientific research community plagiarism and multiple publications of the same data are considered unacceptable practices and can result in tremendous misunderstanding and waste of time and energy. Our peers and the public have high expectations for the performance and behavior of scientists during the execution and reporting of research. With little chance for discovery and decreasing budgets, yet sustained pressure to publish, or without a clear understanding of acceptable publication practices, the unethical practices of duplicate publication and plagiarism can be enticing to some. Until now, discovery has been through serendipity alone, so these practices have largely gone unchecked.

The application of text similarity searching can robustly detect highly similar text records, offering a new tool for ensuring integrity in scientific publications. Deja vu is a database of computationally identified, manually confirmed highly similar citations (abstracts and titles), as well as user provided commentary and evidence to affirm or deny a given documents putative categorization. It is available via the web and to other database curators for tagging of their indexed articles. The availability of a search tool, eTBLAST, by which journal submissions can be compared to existing databases to identify potential duplicate citations and intercept them before they are published, and this database of highly similar citations (or exhaustive searching and tagging within Medline and other databases) could be deterrents to this questionable scientific behavior and excellent examples of citations that are highly similar but represent very distinct research publications.

I would broaden the statement:

multiple publications of the same data are considered unacceptable practices and can result in tremendous misunderstanding and waste of time and energy.

to include repeating the same analysis or discoveries out of sheer ignorance of prior work.

Not as an ethical issue but one of “…waste of time and energy.”

Given the semantic diversity in all fields, work is repeated simply due to “tribes” as Jack Park calls them, using different terminology.

Will be using Deja vu to explore topics in *informatics, to discover related materials.

If you are already using Deja vu that way, your experience, observations, comments would be deeply appreciated.

Comments Off

November 16, 2011

“VCF annotation” with the NHLBI GO Exome Sequencing Project (JAX-WS)

Filed under: Annotation,Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 8:17 pm

“VCF annotation” with the NHLBI GO Exome Sequencing Project (JAX-WS) by Pierre Lindenbaum.

From the post:

The NHLBI Exome Sequencing Project (ESP) has released a web service to query their data. “The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.“.

In the current post, I’ll show how I’ve used this web service to annotate a VCF file with this information.

The web service provided by the ESP is based on the SOAP protocol.

Important news/post for several reasons:

First and foremost, “for the potential to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.”

Second, thanks to Pierre, we have a fully worked example of how to perform the annotation.

Last but not least, the NHLBI Exome Sequencing Project (ESP) did not try to go it alone for the annotations. It did what it does well and then offered the data up for other to use/extend it, hopefully to be used/extended by others.

I can’t count the number of projects of varying sorts that I have seen that tried to do every feature, every annotation, every imaging, every transcription, on their own. All of which resulted in being less than they could have been with greater openness.

I am not suggesting that vendors need to give away data. Vendors for the most part support all of us. It is disingenuous to pretend otherwise. So vendors making money means we get to pay our bills, buy books and computers, etc.

What I am suggesting is that vendors, researches and users need to work towards (yelling at each other doesn’t count) towards commercially viable solutions that enable greater collaboration with regard to research and data.

Otherwise we will have impoverished data sets that are never quite what they could be and vendors will be many many times over the real cost of developing data. Those two conditions don’t benefit anyone. “You, me, them.” (Blues Brothers) 😉

Comments Off

November 7, 2011

When Gamers Innovate

Filed under: Authoring Semantics,Authoring Topic Maps,Bioinformatics,Biomedical,Games,Interface Research/Design — Patrick Durusau @ 9:32 am

When Gamers Innovate

The problem (partially):

Typically, proteins have only one correct configuration. Trying to virtually simulate all of them to find the right one would require enormous computational resources and time.

On top of that there are factors concerning translational-regulation. As the protein chain is produced in a step-wise fashion on the ribosome, one end of a protein might start folding quicker and dictate how the opposite end should fold. Other factors to consider are chaperones (proteins which guide its misfolded partner into the right shape) and post-translation modifications (bits and pieces removed and/or added to the amino acids), which all make protein prediction even harder. That is why homology modelling or “machine learning” techniques tend to be more accurate. However, they all require similar proteins to be already analysed and cracked in the first place.

The solution:

Rather than locking another group of structural shamans in a basement to perform their biophysical black magic, the “Fold It” team created a game. It uses human brainpower, which is fuelled by high-octane logic and catalysed by giving it a competitive edge. Players challenge their three-dimensional problem-solving skills by trying to: 1) pack the protein 2) hide the hydrophobics and 3) clear the clashes.

Read the post or jump to the Foldit site.

Seems to me there are a lot of subject identity and relationship (association) issues that are a lot less complex that protein folding. Not that topic mappers should shy away from protein folding but we should be more imaginative about our authoring interfaces. Yes?

Comments Off

November 5, 2011

Expression cartography of human tissues using self organizing maps

Filed under: Bioinformatics,Biomedical,Self Organizing Maps (SOMs),Self-Organizing — Patrick Durusau @ 6:39 pm

Expression cartography of human tissues using self organizing maps by Henry Wirth; Markus Löffler; Martin von Bergen; Hans Binder. (BMC Bioinformatics. 2011;12:306)

Abstract:

Parallel high-throughput microarray and sequencing experiments produce vast quantities of multidimensional data which must be arranged and analyzed in a concerted way. One approach to addressing this challenge is the machine learning technique known as self organizing maps (SOMs). SOMs enable a parallel sample- and gene-centered view of genomic data combined with strong visualization and second-level analysis capabilities. The paper aims at bridging the gap between the potency of SOM-machine learning to reduce dimension of high-dimensional data on one hand and practical applications with special emphasis on gene expression analysis on the other hand.

A nice introduction to self organizing maps (SOMs) in a bioinformatics context. Think of them as being yet another way to discover subjects about which people want to make statements and to attach data and analysis.

Comments Off

November 4, 2011

Paper about “BioStar” published in PLoS Computational Biology

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:09 pm

Paper about “BioStar” published in PLoS Computational Biology by Pierre Lindenbaum.

I have mentioned Biostar.

Pierre links to the paper, a blog entry about the paper and has collected tweets about it.

Be forewarned about the slides if you are sensitive to remarks comparing twelve year olds and politicians. Personally I think twelve year olds have just been insulted. 😉

Comments Off

November 1, 2011

Parallel approaches in next-generation sequencing analysis pipelines

Filed under: Bioinformatics,Parallel Programming,Parallelism — Patrick Durusau @ 3:34 pm

Parallel approaches in next-generation sequencing analysis pipelines

From the post:

My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I’ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.

Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single Hi-Seq run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.

The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support powerful multi-core servers, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the embarassingly parallel nature of the computation. An AMQP messaging queue allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.

The message passing based parallelism sounds a lot like Storm doesn’t it? Will message passing be what frees us from the constraints of architecture? Wondering what sort of performance “hit” we will take when not working really close to the metal? But, then the “metal” may become the basis for such message passing systems. Not quite yet but perhaps not so far away either.

Comments Off

aliquote

Filed under: Bioinformatics,Data Mining — Patrick Durusau @ 3:33 pm

aliquote

One of the odder blogs I have encountered, particularly the “bag of tweets” postings.

What appear to be fairly high grade posting on data and bioinformatics topics. It is one that I will be watching and thought I would pass it along.

Comments Off

October 28, 2011

Network Modeling and Analysis in Health Informatics and Bioinformatics (NetMAHIB)

Filed under: Bioinformatics,Biomedical,Health care — Patrick Durusau @ 3:14 pm

Network Modeling and Analysis in Health Informatics and Bioinformatics (NetMAHIB) Editor-in-Chief: Reda Alhajj, University of Calgary.

From Springer, a new journal of health informatics and bioinformatics.

From the announcement:

NetMAHIB publishes original research articles and reviews reporting how graph theory, statistics, linear algebra and machine learning techniques can be effectively used for modelling and knowledge discovery in health informatics and bioinformatics. It aims at creating a synergy between these disciplines by providing a forum for disseminating the latest developments and research findings; hence results can be shared with readers across institutions, governments, researchers, students, and the industry. The journal emphasizes fundamental contributions on new methodologies, discoveries and techniques that have general applicability and which form the basis for network based modelling and knowledge discovery in health informatics and bioinformatics.

The NetMAHIB journal is proud to have an outstanding group of editors who widely and rigorously cover the multidisciplinary score of the journal. They are known to be research leaders in the field of Health Informatics and Bioinformatics. Further, the NetMAHIB journal is characterized by providing thorough constructive reviews by experts in the field and by the reduced turn-around time which allows research results to be disseminated and shared on timely basis. The target of the editors is to complete the first round of the refereeing process within about 8 to 10 weeks of submission. Accepted papers go to the online first list and are immediately made available for access by the research community.

Comments Off

October 25, 2011

Adding bed/wig data to dalliance genome browser

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:34 pm

Adding bed/wig data to dalliance genome browser

From the post:

I have been playing a bit with the dalliance genome browser. It is quite useful and I have started using it to generate links to send to researchers to show regions of interest we find from bioinformatics analyses.

I added a document to my github repo describing how to display a bed file in the browser. That rst is here and displayed in inline below.

It uses the UCSC binaries for creating BigWig/BigBed files because dalliance can request a subset of the data without downloading the entire file given the correct apache configuration (also described below).

This will require a recent version of dalliance because there was a bug in the BigBed parsing until recently.

Dalliance Data Tutorial

dalliance is a web-based scrolling genome-browser. It can display data from remote DAS servers or local or remote BigWig or BigBed files.

This will cover how to set up an html page that links to remote DAS services. It will also show how to create and serve BigWig and BigBed files.

Obviously of interest to the bioinformatics community (who are no doubt already aware of it) but I wanted to point out the ability to display data from remote servers/data sets.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 28, 2011

December 26, 2011

December 25, 2011

December 20, 2011

December 19, 2011

December 17, 2011

December 16, 2011

December 12, 2011

December 5, 2011

December 4, 2011

December 2, 2011

November 18, 2011

November 16, 2011

November 7, 2011

November 5, 2011

November 4, 2011

November 1, 2011

October 28, 2011

October 25, 2011