Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 3, 2012

Seal

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:53 pm

Seal

From the site:

Seal is a Hadoop-based distributed short read alignment and analysis toolkit. Currently Seal includes tools for: read demultiplexing, read alignment, duplicate read removal, sorting read mappings, and calculating statistics for empirical base quality recalibration. Seal scales, easily handling TB of data.

Features:

  • short read alignment (based on BWA)
  • duplicate read identification
  • sort read mappings
  • calculate empirical base quality recalibration tables
  • fast, scalable, reliable (runs on Hadoop)

Seal website with extensive documentation.

January 31, 2012

Inside the Variation Toolkit: Tools for Gene Ontology

Filed under: Bioinformatics,Biomedical,Gene Ontology — Patrick Durusau @ 4:33 pm

Inside the Variation Toolkit: Tools for Gene Ontology by Pierre Lindenbaum.

From the post:

GeneOntologyDbManager is a C++ tool that is part of my experimental Variation Toolkit.

This program is a set of tools for GeneOntology, it is based on the sqlite3 library.

Pierre walks through building and using his GeneOntologyDbManager.

Rather appropriate to mention an area (bioinformatics) that is exploding with information on the same day as GPU and database posts. Plus I am sure you will find the Gene Ontology useful for topic map purposes.

January 11, 2012

Bio4j release 0.7 is out !

Filed under: Bioinformatics,Biomedical,Cypher,Graphs,Gremlin,Medical Informatics,Visualization — Patrick Durusau @ 8:02 pm

Bio4j release 0.7 is out !

A quick list of the new features:

  • Expasy Enzyme database integration
  • Node type indexing
  • Amazon web services Availability in all Regions
  • New CloudFormation templates
  • Bio4j REST server
  • Explore you database with the Data browser
  • Run queries with Cypher
  • Querying Bio4j with Gremlin

Wait! Did I say Cypher and Gremlin!?

Looks like this graph querying stuff is spreading. 🙂

Even if you are not working in bioinformatics, Bio4j is worth more than a quick look.

January 9, 2012

SIMI 2012 : Semantic Interoperability in Medical Informatics

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 1:48 pm

SIMI 2012 : Semantic Interoperability in Medical Informatics

Dates:

When May 27, 2012 – May 27, 2012
Where Heraklion (Crete), Greece
Submission Deadline Mar 4, 2012
Notification Due Apr 1, 2012
Final Version Due Apr 15, 2012

From the call for papers:

To gather data on potential application to new diseases and disorders is increasingly to be not only a means for evaluating the effectiveness of new medicine and pharmaceutical formulas but also for experimenting on existing drugs and their appliance to new diseases and disorders. Although the wealth of published non-clinical and clinical information is increasing rapidly, the overall number of new active substances undergoing regulatory review is gradually falling, whereas pharmaceutical companies tend to prefer launching modified versions of existing drugs, which present reduced risk of failure and can generate generous profits. In the meanwhile, market numbers depict the great difficulty faced by clinical trials in successfully translating basic research into effective therapies for the patients. In fact, success rates, from first dose in man in clinical trials to registration of the drug and release in the market, are only about 11% across indications. But, even if a treatment reaches the broad patient population through healthcare, it may prove not to be as effective and/or safe as indicated in the clinical research findings.

Within this context, bridging basic science to clinical practice comprises a new scientific challenge which can result in successful clinical applications with low financial cost. The efficacy of clinical trials, in combination with the mitigation of patients’ health risks, requires the pursuit of a number of aspects that need to be addressed ranging from the aggregation of data from various heterogeneous distributed sources (such as electronic health records – EHRs, disease and drug data sources, etc) to the intelligent processing of this data based on the study-specific requirements for choosing the “right” target population for the therapy and in the end selecting the patients eligible for recruitment.

Data collection poses a significant challenge for investigators, due to the non-interoperable heterogeneous distributed data sources involved in the life sciences domain. A great amount of medical information crucial to the success of a clinical trial could be hidden inside a variety of information systems that do not share the same semantics and/or structure or adhere to widely deployed clinical data standards. Especially in the case of EHRs, the wealth of information within them, which could provide important information and allow of knowledge enrichment in the clinical trial domain (during test of hypothesis generation and study design) as well as act as a fast and reliable bridge between study requirements for recruitment and patients who would like to participate in them, still remains unlinked from the clinical trial lifecycle posing restrictions in the overall process. In addition, methods for efficient literature search and hypothesis validation are needed, so that principal investigators can research efficiently on new clinical trial cases.

The goal of the proposed workshop is to foster exchange of ideas and offer a suitable forum for discussions among researchers and developers on great challenges that are posed in the effort of combining information underlying the large number of heterogeneous data sources and knowledge bases in life sciences, including: – Strong multi-level (semantic, structural, syntactic, interface) heterogeneity issues in clinical research and healthcare domains – Semantic interoperability both at schema and data/instance level – Handling of unstructured information, i.e., literature articles – Reasoning on the wealth of existing data (published findings, background knowledge on diseases, drugs, targets, Electronic Health Records) can boost and enhance clinical research and clinical care processes – Acquisition/extraction of new knowledge from published information and Electronic Health Records – Enhanced matching between clinicians as well as patients΅¦ needs and available informational content

Apologies for the length of the quote but this is a tough nut that simply saying “topic maps,” isn’t going to solve. As described above, there is a set of domains, each with its own information gathering, processing and storage practices, none of which are going to change rapidly, or consistently.

Although I think topic maps can play a role in solving this sort of issue, it will be by being the “integration rain drop” that starts with some obvious integration issue and solves it and only it. Does not try to be a solution for every issue or requirement. Having solved one, it then spreads out to solve another one.

The key is going to be the delivery of clear and practical advantages in concrete situations.

One approach could be to identify current semantic integration efforts (which tend to have global aspirations) and effect semantic mappings between those solutions. Which has the advantage of allowing the advocates of those systems to continue while a topic map can offer other systems an integration of data from those parts.

International Symposium on Bioinformatics Research and Applications

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 1:30 pm

ISBRA 2012 : International Symposium on Bioinformatics Research and Applications

Dates:

When May 21, 2012 – May 23, 2012
Where Dallas, Texas
Submission Deadline Feb 6, 2012
Notification Due Mar 5, 2012
Final Version Due Mar 15, 2012

From the call for papers:

The International Symposium on Bioinformatics Research and Applications (ISBRA) provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications. Submissions presenting original research are solicited in all areas of bioinformatics and computational biology, including the development of experimental or commercial systems.

January 7, 2012

The Variation Toolkit

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:00 pm

The Variation Toolkit by Pierre Lindenbaum.

From the post:

During the last weeks, I’ve worked on an experimental C++ package named The Variation Toolkit (varkit). It was originally designed to provide some command lines equivalent to knime4bio but I’ve added more tools over time. Some of those tools are very simple-and-stupid ( fasta2tsv) , reinvent the wheel (“numericsplit“), are part of an answer to biostar, are some old tools (e.g. bam2wig) that have been moved to this package, but some others like “samplepersnp“, “groupbygene” might be useful to people.

The package is available at : http://code.google.com/p/variationtoolkit/.

See the post for documentation.

January 5, 2012

Interoperability Driven Integration of Biomedical Data Sources

Interoperability Driven Integration of Biomedical Data Sources by Douglas Teodoro, RĂ©my Choquet, Daniel Schober, Giovanni Mels, Emilie Pasche, Patrick Ruch, and Christian Lovis.

Abstract:

In this paper, we introduce a data integration methodology that promotes technical, syntactic and semantic interoperability for operational healthcare data sources. ETL processes provide access to different operational databases at the technical level. Furthermore, data instances have they syntax aligned according to biomedical terminologies using natural language processing. Finally, semantic web technologies are used to ensure common meaning and to provide ubiquitous access to the data. The system’s performance and solvability assessments were carried out using clinical questions against seven healthcare institutions distributed across Europe. The architecture managed to provide interoperability within the limited heterogeneous grid of hospitals. Preliminary scalability result tests are provided.

Appears in:

Studies in Health Technology and Informatics
Volume 169, 2011
User Centred Networked Health Care – Proceedings of MIE 2011
Edited by Anne Moen, Stig Kjær Andersen, Jos Aarts, Petter Hurlen
ISBN 978-1-60750-805-2

I have been unable to find a copy online, well, other than the publisher’s copy, at $20 for four pages. I have written to one of the authors requesting a personal use copy as I would like to report back on what it proposes.

January 3, 2012

Topical Classification of Biomedical Research Papers – Details

Filed under: Bioinformatics,Biomedical,Medical Informatics,MeSH,PubMed,Topic Maps — Patrick Durusau @ 5:11 pm

OK, I registered both on the site and for the contest.

From the Task:

Our team has invested a significant amount of time and effort to gather a corpus of documents containing 20,000 journal articles from the PubMed Central open-access subset. Each of those documents was labeled by biomedical experts from PubMed with several MeSH subheadings that can be viewed as different contexts or topics discussed in the text. With a use of our automatic tagging algorithm, which we will describe in details after completion of the contest, we associated all the documents with the most related MeSH terms (headings). The competition data consists of information about strengths of those bonds, expressed as numerical value. Intuitively, they can be interpreted as values of a rough membership function that measures a degree in which a term is present in a given text. The task for the participants is to devise algorithms capable of accurately predicting MeSH subheadings (topics) assigned by the experts, based on the association strengths of the automatically generated tags. Each document can be labeled with several subheadings and this number is not fixed. In order to ensure that participants who are not familiar with biomedicine, and with the MeSH ontology in particular, have equal chances as domain experts, the names of concepts and topical classifications are removed from data. Those names and relations between data columns, as well as a dictionary translating decision class identifiers into MeSH subheadings, can be provided on request after completion of the challenge.

Data format: The data set is provided in a tabular form as two tab-separated values files, namely trainingData.csv (the training set) and testData.csv (the test set). They can be downloaded only after a successful registration to the competition. Each row of those data files represents a single document and, in the consecutive columns, it contains integers ranging from 0 to 1000, expressing association strengths to corresponding MeSH terms. Additionally, there is a trainingLables.txt file, whose consecutive rows correspond to entries in the training set (trainingData.csv). Each row of that file is a list of topic identifiers (integers ranging from 1 to 83), separated by commas, which can be regarded as a generalized classification of a journal article. This information is not available for the test set and has to be predicted by participants.

It is worth noting that, due to nature of the considered problem, the data sets are highly dimensional – the number of columns roughly corresponds to the MeSH ontology size. The data sets are also sparse, since usually only a small fraction of the MeSH terms is assigned to a particular document by our tagging algorithm. Finally, a large number of data columns have little (or even none) non-zero values (corresponding concepts are rarely assigned to documents). It is up to participants to decide which of them are still useful for the task.

I am looking at it as an opportunity to learn a good bit about automatic text classification and what, if any, role that topic maps can play in such a scenario.

Suggestions as well as team members are most welcome!

January 2, 2012

Topical Classification of Biomedical Research Papers

Filed under: Bioinformatics,Biomedical,Contest,Medical Informatics,MeSH,PubMed — Patrick Durusau @ 6:36 pm

JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers

From the webpage:

JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers, is a special event of Joint Rough Sets Symposium (JRS 2012, http://sist.swjtu.edu.cn/JRS2012/) that will take place in Chengdu, China, August 17-20, 2012. The task is related to the problem of predicting topical classification of scientific publications in a field of biomedicine. Money prizes worth 1,500 USD will be awarded to the most successful teams. The contest is funded by the organizers of the JRS 2012 conference, Southwest Jiaotong University, with support from University of Warsaw, SYNAT project and TunedIT.

Introduction: Development of freely available biomedical databases allows users to search for documents containing highly specialized biomedical knowledge. Rapidly increasing size of scientific article meta-data and text repositories, such as MEDLINE [1] or PubMed Central (PMC) [2], emphasizes the growing need for accurate and scalable methods for automatic tagging and classification of textual data. For example, medical doctors often search through biomedical documents for information regarding diagnostics, drugs dosage and effect or possible complications resulting from specific treatments. In the queries, they use highly sophisticated terminology, that can be properly interpreted only with a use of a domain ontology, such as Medical Subject Headings (MeSH) [3]. In order to facilitate the searching process, documents in a database should be indexed with concepts from the ontology. Additionally, the search results could be grouped into clusters of documents, that correspond to meaningful topics matching different information needs. Such clusters should not necessarily be disjoint since one document may contain information related to several topics. In this data mining competition, we would like to raise both of the above mentioned problems, i.e. we are interested in identification of efficient algorithms for topical classification of biomedical research papers based on information about concepts from the MeSH ontology, that were automatically assigned by our tagging algorithm. In our opinion, this challenge may be appealing to all members of the Rough Set Community, as well as other data mining practitioners, due to its strong relations to well-founded subjects, such as generalized decision rules induction [4], feature extraction [5], soft and rough computing [6], semantic text mining [7], and scalable classification methods [8]. In order to ensure scientific value of this challenge, each of participating teams will be required to prepare a short report describing their approach. Those reports can be used for further validation of the results. Apart from prizes for top three teams, authors of selected solutions will be invited to prepare a paper for presentation at JRS 2012 special session devoted to the competition. Chosen papers will be published in the conference proceedings.

Data sets became available today.

This is one of those “praxis” opportunities for topic maps.

Using Bio4j + Neo4j Graph-algo component…

Filed under: Bio4j,Bioinformatics,Biomedical,Neo4j — Patrick Durusau @ 3:00 pm

Using Bio4j + Neo4j Graph-algo component for finding protein-protein interaction paths

From the post:

Today I managed to find some time to check out the Graph-algo component from Neo4j and after playing with it plus Bio4j a bit, I have to say it seems pretty cool.

For those who don’t know what I’m talking about, here you have the description you can find in Neo4j wiki:

This is a component that offers implementations of common graph algorithms on top of Neo4j. It is mostly focused around finding paths, like finding the shortest path between two nodes, but it also contains a few different centrality measures, like betweenness centrality for nodes.

The algorithm for finding the shortest path between two nodes caught my attention and I started to wonder how could I give it a try applying it to the data included in Bio4j.

Suggestions of other data sets where shortest path would yield interesting results?

BTW, isn’t the shortest path an artifact of the basis for nearness between nodes? Thinking that shortest path when expressed between gene fragments as relatedness would be different than physical distance. (see: Nearness key in microbe DNA swaps: Proximity trumps relatedness in influencing how often bacteria pick up each other’s genes.)

December 28, 2011

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 9:35 pm

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations by Ryan K. Dale, Brent S. Pedersen and Aaron R. Quinlan.

Abstract:

Summary: pybedtools is a flexible Python software library for manipulating and exploring genomic datasets in many common formats. It provides an intuitive Python interface that extends upon the popular BEDTools genome arithmetic tools. The library is well documented and efficient, and allows researchers to quickly develop simple, yet powerful scripts that enable complex genomic analyses.

From the documentation:

Formats with different coordinate systems (e.g. BED vs GFF) are handled with uniform, well-defined semantics described in the documentation.

Starting to sound like HyTime isn’t it? Transposition between coordinate systems.

If you venture into this area with a topic map, something to keep in mind.

I first saw this in Christophe Lalanne’s A bag of tweets / Dec 2011.

December 26, 2011

Galaxy: Data Intensive Biology for Everyone

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:20 pm

Galaxy: Data Intensive Biology for Everyone (main site)

Galaxy 101: The first thing you should try (tutorial)

Work through the tutorial, keeping track of where you think subject identity (and any tests you care to suggest) would be useful.

I don’t know but suspect this is representative of what researchers in the field expect in terms of capabilities.

With a little effort I suspect it would make a nice basis to start a conversation about what subject identity could add that would be of interest.

Mondeca helps to bring Electronic Patient Record to reality

Filed under: Biomedical,Data Integration,Health care,Medical Informatics — Patrick Durusau @ 8:13 pm

Mondeca helps to bring Electronic Patient Record to reality

This has been out for a while but I just saw it today.

From the post:

Data interoperability is one of the key issues in assembling unified Electronic Patient Records, both within and across healthcare providers. ASIP Santé, the French national healthcare agency responsible for implementing nation-wide healthcare management systems, has been charged to ensure such interoperability for the French national healthcare.

The task is a daunting one since most healthcare providers use their own custom terminologies and medical codes. This is due to a number of issues with standard terminologies: 1) standard terminologies take too long to be updated with the latest terms; 2) significant internal data, systems, and expertise rely on the usage of legacy custom terminologies; and 3) a part of the business domain is not covered by a standard terminology.

The only way forward was to align the local custom terminologies and codes with the standard ones. This way local data can be automatically converted into the standard representation, which will in turn allow to integrate it with the data coming from other healthcare providers.

I assume the alignment of local custom terminologies is an ongoing process so as the local terminologies change, re-alignment occurs as well?

Kudos to Mondeca for they played an active role in the early days of XTM and I suspect that experience has influenced (for the good), their approach to this project.

December 25, 2011

New Entrez Genome

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:07 pm

New Entrez Genome Released on November 9, 2011

From the announcement:

Historically the Entrez Genome data model was designed for complete genomes of microorganisms (Archaea, Eubacteria, and Viruses) and a very few eukaryotic genomes such as human, yeast, worm, fly and thale cress (Arabidopsis thaliana). It also included individual complete genomes of organelles and plasmids. Despite the name, the Entrez Genome database record has been a chromosome (or organelle or plasmid) rather than a genome.

The new Genome resource uses a new data model where a single record provides information about the organism (usually a species), its genome structure, available assemblies and annotations, and related genome-scale projects such as transcriptome sequencing, epigenetic studies and variation analysis. As before, the Genome resource represents genomes from all major taxonomic groups: Archaea, Bacteria, Eukaryote, and Viruses. The old Genome database represented only Refseq genomes, while the new resource extends this scope to all genomes either provided by primary submitters (INSDC genomes) or curated by NCBI staff (RefSeq genomes).

The new Genome database shares a close relationship with the recently redesigned BioProject database (formerly Genome Project). Primary information about genome sequencing projects in the new Genome database is stored in the BioProject database. BioProject records of type “Organism Overview” have become Genome records with a Genome ID that maps uniquely to a BioProject ID. The new Genome database also includes all “genome sequencing” records in BioProject.

BTW, just in case you ever wonder about changes in identifiers causing problems:

The new Genome IDs cannot be directly mapped to the old Genome IDs because the data types are very different. Each old Genome ID represented a single sequence that can still be found in Entrez Nucleotide using standard Entrez searches or the E-utilities. We recommend that you convert old Genome IDs to Nucleotide GI numbers using the following remapping file available on the NCBI FTP site:
ftp://ftp.ncbi.nih.gov/genomes/old_genomeID2nucGI

The Genome site.

A Compressed Self-Index for Genomic Databases

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 6:07 pm

A Compressed Self-Index for Genomic Databases by Travis Gagie, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi.

Abstract:

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals’ genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.

As the authors note, an area with rapidly increasing need for efficient effective indexing.

It would be a step forward to see a comparison of this method on a common genome set with:

I suppose I am presuming a common genome data set for indexing demonstrations.

Questions:

  • Is there a common genome data set for comparison of indexing techniques?
  • Are there other indexing techniques that should be included in a comparison?

Obviously important for topic maps used in genome projects.

But insights about identification of subjects that vary only slightly in one (or more) dimensions to identify different subjects, will be useful in other contexts.

An easy example would be isotopes. Let’s see, ah, celestial or other coordinate systems. Don’t know but would guess that spectra from stars/galaxies are largely common. (Do you know for sure?) What other data sets have subjects that are identified on the basis of small or incremental changes in a largely identical identifier?

A Faster Grammar-Based Self-Index

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 6:06 pm

A Faster Grammar-Based Self-Index by Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, Simon J. Puglisi.

Abstract:

To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straight-line programs and LZ77. In this paper we show how, given a balanced straight-line program for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases, we can add $\Oh{z \log \log z}$ words and obtain a compressed self-index for $S$ such that, given a pattern (P [1..m]), we can list the $\occ$ occurrences of $P$ in $S$ in $\Oh{m^2 + (m + \occ) \log \log n}$ time. All previous self-indexes are either larger or slower in the worst case.

Updated version of the paper I covered at: A Faster LZ77-Based Index.

In a very real sense, indexing is fundamental to information retrieval. That is to say that when information is placed in electronic storage, the only way to retrieve it is via indexing. The index may be one to one with a memory location and hence not terribly efficient, but the fact remains that an index is part of every information retrieval transaction.

December 20, 2011

Standard Measures in Genomic Studies

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:25 pm

Standard Measures in Genomic Studies

This news story caught my eye and leads off simply enough:

Standards can make our lives better. We have standards for manufacturing many items — from car parts to nuts and bolts — that improve the reliability and compatibility of all sorts of widgets that we use in our daily lives. Without them, many tasks would be difficult, a bit like trying to fit a square peg into a round hole. There are even standard measures to collect information from participants of large population genomic studies that can be downloaded for free from the Consensus Measures for Phenotype and eXposures (PhenX) Toolkit [phenxtoolkit.org]. However, researchers will only adopt such standard measures if they can be used easily.

That is why the NHGRI’s Office of Population Genomics has launched a new effort called the PhenX Real-world Implementation and Sharing (PhenX RISING) program. The National Human Genome Research Institute (NHGRI) has awarded nearly $900,000, with an additional $100,000 from NIH Office of Behavioral and Social Sciences Research (OBSSR), to seven investigators to use and evaluate the standards. Each investigator will incorporate a variety of PhenX measures into their ongoing genome-wide association or large population study. These researchers will also make recommendations as to how to fine-tune the PhenX Toolkit.

OK, good for them, or at least the researchers who get the grants, but what does that have to do with topic maps?

Just a bit further the announcement says:

GWAS have identified more than a thousand associations between genetic variants and common diseases such as cancer and heart disease, but the majority of the studies do not share standard measures. PhenX standard measures are important because they allow researchers to more easily combine data from different studies to see if there are overlapping genetic factors between or among different diseases. This ability will improve researchers’ understanding of disease and may eventually be used to assess a patient’s genetic risk of getting a disease such as diabetes or cancer and to customize treatment.

OK, so there are existing studies that don’t share standard measures, there will be more studies while the PhenX RISING program goes on that don’t share standard measures and there may be future studies while PhenX RISING is being adjusted that don’t share standard measures.

Depending upon the nature of the measures that are not shared and the importance of mapping between these non-shared standards, this sounds like fertile ground for topic map prospecting.

December 19, 2011

Visions of a semantic molecular future

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:10 pm

Visions of a semantic molecular future

I already have a post on the Journal of Cheminformatics but this looked like it needed a separate post.

This thematic issue arose from a symposium held in the Unilever Centre [for Molecular Science Informatics, Department of Chemistry, University of Cambridge] on 2011-01-15/17 to celebrate the career of Peter Murray-Rust. From the programme:

This symposium addresses the creativity of the maturing Semantic Web to the unrealized potential of Molecular Science. The world is changing and we are in the middle of many revolutions: Cloud computing; the Semantic Web; the Fourth Paradigm (data-driven science); web democracy; weak AI; pervasive devices; citizen science; Open Knowledge. Technologies can develop in months to a level where individuals and small groups can change the world. However science is hamstrung by archaic approaches to the publication, redistribution and re-use of information and much of the vision is (just) out of reach. Social, as well as technical, advances are required to realize the full potential. We’ve asked leading scientists to let their imagination explore the possible and show us how to get there.

This is a starting point for all of us – the potential of working with the virtual world of scientists and citizens, coordinated through organizations such as the Open Knowledge Foundation and continuing connection with the Cambridge academic community makes this one of the central points for my future.

The pages in this document represent vibrant communities of practice which are growing and are offered to the world as contributions to a bsemantic molecular future.

We have combined talks from the symposium with work from the Murray-Rust group into 15 articles.

Quickly, just a couple of the articles with abstracts to get you interested:

“Openness as infrastructure”
John Wilbanks Journal of Cheminformatics 2011, 3:36 (14 October 2011)

The advent of open access to peer reviewed scholarly literature in the biomedical sciences creates the opening to examine scholarship in general, and chemistry in particular, to see where and how novel forms of network technology can accelerate the scientific method. This paper examines broad trends in information access and openness with an eye towards their applications in chemistry.

“Open Bibliography for Science, Technology, and Medicine”
Richard Jones, Mark MacGillivray, Peter Murray-Rust, Jim Pitman, Peter Sefton, Ben O’Steen, William Waites Journal of Cheminformatics 2011, 3:47 (14 October 2011)

The concept of Open Bibliography in science, technology and medicine (STM) is introduced as a combination of Open Source tools, Open specifications and Open bibliographic data. An Openly searchable and navigable network of bibliographic information and associated knowledge representations, a Bibliographic Knowledge Network, across all branches of Science, Technology and Medicine, has been designed and initiated. For this large scale endeavour, the engagement and cooperation of the multiple stakeholders in STM publishing – authors, librarians, publishers and administrators – is sought.

It should be interesting when generally realized that the information people have hoarded over the years isn’t important. It is the human mind that perceives, manipulates, and draws conclusions from information that gives it any value at all.

OpenHelix

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:10 pm

OpenHelix

From the about page:

More efficient use of the most relevant resources means quicker and more effective research. OpenHelix empowers researchers by

  • providing a search portal to find the most relevant genomics resource and training on those resources.
  • distributing extensive and effective tutorials and training materials on the most powerful and popular genomics resourcs.
  • contracting with resource providers to provide comprehensive, long-term training and outreach programs.

If you are interested in learning the field of genomics research, other than by returning to graduate/medical school, this site will figure high on your list of resources.

It offers a very impressive gathering of both commercial and non-commercial resources under one roof.

I haven’t taken any of the tutorials produced by OpenHelix and so would appreciate comments from anyone who has.

Bioinformatics is an important subject area for topic maps for several reasons:

First, the long term (comparatively speaking) interest in the use of computers and the use in fact of computers in biology indicates there is a need for information for which other people will spend money. There is a key phrase in that sentence, “…for which other people will spend money.” You are already spending your time working on topic maps so it is important to identify other people who are willing to part with cash for your software or assistance. Bioinformatics is a field where that is already known to happen, other people spend their money on software or expertise.

Second, for all of the progress on identification issues in bioinformatics, any bioinformatics journal you pick up, will have references to the need for greater integration of biological resources. There is plenty of opportunity now and as far as anyone can tell, for many tomorrows to follow.

Third, for good or ill, any progress in the field attracts a disproportionate amount of coverage. The public rarely reads or sees coverage of discoveries being less than what was initially reported. And not only health professionals hear such news so it would be good PR for topic maps.

Journal of Biomedical Semantics

Filed under: Bioinformatics,Biomedical,Semantics — Patrick Durusau @ 8:10 pm

Journal of Biomedical Semantics

From Aims and Scope:

Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas:

Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability.

Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.

Research in biology and biomedicine relies on various types of biomedical data, information, and knowledge, represented in databases with experimental and/or curated data, ontologies, literature, taxonomies, and so on. Semantics is essential for accessing, integrating, and analyzing such data. The ability to explicitly extract, assign, and manage semantic representations is crucial for making computational approaches in the biomedical domain productive for a large user community.

Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain, and comprises practical and theoretical advances in biomedical semantics research with implications for data analysis.

In recent years, the availability and use of electronic resources representing biomedical knowledge has greatly increased, covering ontologies, taxonomies, literature, databases, and bioinformatics services. These electronic resources contribute to advances in the biomedical domain and require interoperability between them through various semantic descriptors. In addition, the availability and integration of semantic resources is a key part in facilitating semantic web approaches for life sciences leading into reasoning and other advanced ways to analyse biomedical data.

Random items to whet your appetite:

The 2nd DBCLS BioHackathon: interoperable bioinformatics Web services for integrated applications
Toshiaki Katayama, Mark D Wilkinson, Rutger Vos, Takeshi Kawashima, Shuichi Kawashima, Mitsuteru Nakao, Yasunori Yamamoto, Hong-Woo Chun, Atsuko Yamaguchi, Shin Kawano, Jan Aerts, Kiyoko F Aoki-Kinoshita, Kazuharu Arakawa, Bruno Aranda, Raoul JP Bonnal, José M Fernández, Takatomo Fujisawa, Paul MK Gordon, Naohisa Goto, Syed Haider, Todd Harris, Takashi Hatakeyama, Isaac Ho, Masumi Itoh, Arek Kasprzyk, Nobuhiro Kido, Young-Joo Kim, Akira R Kinjo, Fumikazu Konishi, Yulia Kovarskaya Journal of Biomedical Semantics 2011, 2:4 (2 August 2011)

Simple tricks for improving pattern-based information extraction from the biomedical literature
Quang Nguyen, Domonkos Tikk, Ulf Leser Journal of Biomedical Semantics 2010, 1:9 (24 September 2010)

Rewriting and suppressing UMLS terms for improved biomedical term identification
Kristina M Hettne, Erik M van Mulligen, Martijn J Schuemie, Bob JA Schijvenaars, Jan A Kors Journal of Biomedical Semantics 2010, 1:5 (31 March 2010)

December 17, 2011

Broad Institute

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:30 am

Broad Institute

In their own words:

The Eli and Edythe L. Broad Institute of Harvard and MIT is founded on two core beliefs:

  1. This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and treatment of disease.
  2. To fulfill this mission, we need new kinds of research institutions, with a deeply collaborative spirit across disciplines and organizations, and having the capacity to tackle ambitious challenges.

The Broad Institute is essentially an “experiment” in a new way of doing science, empowering this generation of researchers to:

  • Act nimbly. Encouraging creativity often means moving quickly, and taking risks on new approaches and structures that often defy conventional wisdom.
  • Work boldly. Meeting the biomedical challenges of this generation requires the capacity to mount projects at any scale — from a single individual to teams of hundreds of scientists.
  • Share openly. Seizing scientific opportunities requires creating methods, tools and massive data sets — and making them available to the entire scientific community to rapidly accelerate biomedical advancement.
  • Reach globally. Biomedicine should address the medical challenges of the entire world, not just advanced economies, and include scientists in developing countries as equal partners whose knowledge and experience are critical to driving progress.

The Detecting Novel Associations in Large Data Sets software and data is from the Broad Institute.

Sounds like the sort of place that would be interested in enhancing research and sharing of information with topic maps.

December 12, 2011

NLM Plus

Filed under: Bioinformatics,Biomedical,Search Algorithms,Search Engines — Patrick Durusau @ 10:22 pm

NLM Plus

From the webpage:

NLMplus is an award winning Semantic Search Engine and Biomedical Knowledge Base application that showcases a variety of natural language processing tools to provide an improved level of access to the vast collection of biomedical data and services of the National Library of Medicine.

Utilizing its proprietary Web Knowledge Base, WebLib LLC can apply the universal search and semantic technology solutions demonstrated by NLMplus to libraries, businesses, and research organizations in all domains of science and technology and Web applications

Any medical librarians in the audience? Or ones you can forward this post to?

Curious what professional researchers make of NLM Plus? I don’t have the domain expertise to evaluate it.

Thanks!

December 5, 2011

Medical Text Indexer (MTI)

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 7:42 pm

Medical Text Indexer (MTI) (formerly the Indexing Initiative System (IIS))

From the webpage:

The MTI system consists of software for applying alternative methods of discovering MeSH headings for citation titles and abstracts and then combining them into an ordered list of recommended indexing terms. The top portion of the diagram consists of three paths, or methods, for creating a list of recommended indexing terms: MetaMap Indexing, Trigrams and PubMed Related Citations. The first two paths actually compute UMLS Metathesaurus® concepts which are passed to the Restrict to MeSH process. The results from each path are weighted and combined using the Clustering process. The system is highly parameterized not only by path weights but also by several parameters specific to the Restrict to MeSH and Clustering processes.

A prototype MTI system described below had two additional indexing methods which were removed because their results were subsumed by the three remaining methods.

Deeply interesting and relevant work to topic maps.

MetaMap Portal

Filed under: Bioinformatics,Biomedical,MetaMap,Metathesaurus — Patrick Durusau @ 7:41 pm

MetaMap Portal

About MetaMap:

MetaMap is a highly configurable program developed by Dr. Alan (Lan) Aronson at the National Library of Medicine (NLM) to map biomedical text to the UMLS Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge-intensive approach based on symbolic, natural-language processing (NLP) and computational-linguistic techniques. Besides being applied for both IR and data-mining applications, MetaMap is one of the foundations of NLM’s Medical Text Indexer (MTI) which is being used for both semiautomatic and fully automatic indexing of biomedical literature at NLM. For more information on MetaMap and related research, see the SKR Research Information Site.

Improvement in the October 2011 Release:

MetaMap2011 includes some significant enhancements, most notably algorithmic improvements that enable MetaMap to very quickly process input text that had previously been computationally intractable.

These enhancements include:

  • Algorithmic Improvements
  • Candidate Set Pruning
  • Re-Organization of Additional Data Models
  • Single-character alphabetic tokens
  • Improved Treatment of Apostrophe-“s”
  • New XML Command-Line Options
  • Numbered Mappings
  • User-Defined Acronyms and Abbreviations

Starting with MetaMap 2011, MetaMap is now available for Windows XP and Windows 7.

One of several projects that sound very close to being topic map mining programs.

December 4, 2011

FACTA

Filed under: Associations,Bioinformatics,Biomedical,Concept Detection,Text Analytics — Patrick Durusau @ 8:16 pm

FACTA – Finding Associated Concepts with Text Analysis

From the Quick Start Guide:

FACTA is a simple text mining tool to help discover associations between biomedical concepts mentioned in MEDLINE articles. You can navigate these associations and their corresponding articles in a highly interactive manner. The system accepts an arbitrary query term and displays relevant concepts on the spot. A broad range of concepts are retrieved by the use of large-scale biomedical dictionaries containing the names of important concepts such as genes, proteins, diseases, and chemical compounds.

A very good example of an exploration tool that isn’t overly complex to use.

December 3, 2011

How to Execute the Research Paper

Filed under: Annotation,Biomedical,Dynamic Updating,Linked Data,RDF — Patrick Durusau @ 8:21 pm

How to Execute the Research Paper by Anita de Waard.

I had to create the category, “dynamic updating,” to at least partially capture what Anita describes in this presentation. I would have loved to be present to see it in person!

The gist of the presentation is that we need to create mechanisms to support research papers being dynamically linked to the literature and other resources. One example that Anita uses is linking a patient’s medical records to reports in literature with professional tools for the diagnostician.

It isn’t clear how Linked Data (no matter how generously described by Jeni Tennison) could be the second technology for making research papers linked to other data. In part because as Jeni points out, URIs are simply more names for some subject. We don’t know if that name is for the resource or something the resource represents. Makes reliable linking rather difficult.

BTW, the web lost its ability to grow in a “gradual and sustainable way” when RDF/Linked Data introduced the notion that URIs cannot be allowed to fail. If you try to reason based on something that fails, the reasoner falls on its side. Not nearly as robust as allowing semantic 404’s.

Anita’s third step, an integrated workflow is certainly the goal to which we should be striving. I am less convinced about the mechanisms, such as generating linked data stores in addition to the documents we already have, are the way forward. For documents, for instance, why do we need to repeat data they already possess? Why can’t documents represent their contents themselves? Oh, because that isn’t how Linked Data/RDF stores work.

Still, I would highly recommend this slide deck and that you catch any presentation by Anita that you can.

December 2, 2011

Toolset for Genomic Analysis, Data Management

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:56 pm

Toolset for Genomic Analysis, Data Management

From the post:

The informatics group at the Genome Institute at Washington University School of Medicine has released an integrated analysis and information-management system called the Genome Modeling System.

The system borrows concepts from traditional laboratory information-management systems — such as tracking methods and data-access interfaces, — and applies them to genomic analysis. The result is a standardized system that integrates both analysis and management capabilities, David Dooling, the assistant director of informatics at Wash U and one of the developers of GMS, explained to BioInform.

Not exactly. The tools that will make up the “Genome Modeling System” have been released but melding them into the full package is something we will see near the end of this year. (later in the article)

I remember the WU-FTPD software before it fell into disrepute so I have great expectations for this software. I will keep watch and post a note when it appears for download.

openSNP

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:51 pm

openSNP

Don’t recognize the name? I didn’t either when I came across it on Genoweb under the title Battle Over.

Then I read the homepage blurb:

openSNP allows customers of direct-to-customer genetic tests to publish their test results, find others with similar genetic variations, learn more about their results, find the latest primary literature on their variations and help scientists to find new associations.

I think we will be hearing more about openSNP in the not too distant future.

Sounds like a useful place to talk about topic maps. But in terms of their semantic impedances and their identifiers for subjects.

Hard to sell a product if we are fixing a “problem” that no one sees as a “problem.”

November 28, 2011

Interesting papers coming up at NIPS’11

Filed under: Biomedical,Conferences,Neural Networks — Patrick Durusau @ 7:13 pm

Interesting papers coming up at NIPS’11

Yaroslav Bulatov has tracked down papers that have been accepted for NIPS’11. Not abstracts or summaries but the actual papers.

Well worth a visit to take advantage of his efforts.

While looking at the NIPS’11 site (will post that tomorrow) I ran across a paper on a proposal for a “…array/matrix/n-dimensional base object implementations for GPUs.” Will post that tomorrow as well.

November 18, 2011

Deja vu: a Database of Highly Similar Citations

Filed under: Bioinformatics,Biomedical,Deja vu — Patrick Durusau @ 9:37 pm

Deja vu: a Database of Highly Similar Citations

From the webpage:

Deja vu is a database of extremely similar Medline citations. Many, but not all, of which contain instances of duplicate publication and potential plagiarism. Deja vu is a dynamic resource for the community, with manual curation ongoing continuously, and we welcome input and comments.

In the scientific research community plagiarism and multiple publications of the same data are considered unacceptable practices and can result in tremendous misunderstanding and waste of time and energy. Our peers and the public have high expectations for the performance and behavior of scientists during the execution and reporting of research. With little chance for discovery and decreasing budgets, yet sustained pressure to publish, or without a clear understanding of acceptable publication practices, the unethical practices of duplicate publication and plagiarism can be enticing to some. Until now, discovery has been through serendipity alone, so these practices have largely gone unchecked.

The application of text similarity searching can robustly detect highly similar text records, offering a new tool for ensuring integrity in scientific publications. Deja vu is a database of computationally identified, manually confirmed highly similar citations (abstracts and titles), as well as user provided commentary and evidence to affirm or deny a given documents putative categorization. It is available via the web and to other database curators for tagging of their indexed articles. The availability of a search tool, eTBLAST, by which journal submissions can be compared to existing databases to identify potential duplicate citations and intercept them before they are published, and this database of highly similar citations (or exhaustive searching and tagging within Medline and other databases) could be deterrents to this questionable scientific behavior and excellent examples of citations that are highly similar but represent very distinct research publications.

I would broaden the statement:

multiple publications of the same data are considered unacceptable practices and can result in tremendous misunderstanding and waste of time and energy.

to include repeating the same analysis or discoveries out of sheer ignorance of prior work.

Not as an ethical issue but one of “…waste of time and energy.”

Given the semantic diversity in all fields, work is repeated simply due to “tribes” as Jack Park calls them, using different terminology.

Will be using Deja vu to explore topics in *informatics, to discover related materials.

If you are already using Deja vu that way, your experience, observations, comments would be deeply appreciated.

« Newer PostsOlder Posts »

Powered by WordPress