Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 2, 2013

Big Data and Healthcare Infographic

Filed under: BigData,Health care,Medical Informatics — Patrick Durusau @ 3:09 pm

Big Data and Healthcare Infographic by Shar Steed.

From the post:

Big Data could revolutionize healthcare by replacing up to 80% of what doctors do while still maintaining over 91% accuracy. Please take a look at the infographic below to learn more.

An interesting graphic, even if I don’t buy the line that computers are better than doctors at:

Integrating and balancing considerations of patient symptoms, history, demeanor, environmental factors, and population management guidelines.

Noting that in the next graphic block, the 91% accuracy rate using a “diagnostic knowledge system” doesn’t say what sort of “clinical trials” were used.

Makes a difference if we are talking brain surgery or differential diagnosis versus seeing patients in an out-patient clinic.

Still, an interesting graphic.

Curious where you see semantic integration issues, large or small in this graphic?

January 29, 2013

Bad News From UK: … brows up, breasts down

Filed under: Data,Dataset,Humor,Medical Informatics — Patrick Durusau @ 6:51 pm

UK plastic surgery statistics 2012: brows up, breasts down by Ami Sedghi.

From the post:

Despite a recession and the government launching a review into cosmetic surgery following the breast implant scandal, plastic surgery procedures in the UK were up last year.

A total of 43,172 surgical procedures were carried out in 2012 according to the British Association of Aesthetic Plastic Surgeons (BAAPS), an increase of 0.2% on the previous year. Although there wasn’t a big change for overall procedures, anti-ageing treatments such as eyelid surgery and face lifts saw double digit increases.

Breast augmentation (otherwise known as ‘boob jobs’) were still the most popular procedure overall although the numbers dropped by 1.6% from 2011 to 2012. Last year’s stats took no account of the breast implant scandal so this is the first release of figures from BAAPS to suggest what impact the scandal has had on the popular procedure.

Just for comparison purposes:

Country Procedures Population Percent of Population Treated
UK 43,172 62,641,000 0.00068%
US 9,200,000 313,914,000 0.02900%

Perhaps beauty isn’t one of the claimed advantages of socialized medicine?

January 22, 2013

BioNLP-ST 2013

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 2:42 pm

BioNLP-ST 2013

Dates:

Training Data Release 12:00 IDLW, 17 Jan. 2013
Test Data Release 22 Mar. 2013
Result Submission 29 Mar. 2013
BioNLP’11 Workshop 8-9 Aug. 2013

From the website:

The BioNLP Shared Task (BioNLP-ST) series represents a community-wide trend in text-mining for biology toward fine-grained information extraction (IE). The two previous events, BioNLP-ST 2009 and 2011, attracted wide attention, with over 30 teams submitting final results. The tasks and their data have since served as the basis of numerous studies, released event extraction systems, and published datasets. The upcoming BioNLP-ST 2013 follows the general outline and goals of the previous tasks. It identifies biologically relevant extraction targets and proposes a linguistically motivated approach to event representation. The tasks in BioNLP-ST 2013 cover many new hot topics in biology that are close to biologists’ needs. BioNLP-ST 2013 broadens the scope of the text-mining application domains in biology by introducing new issues on cancer genetics and pathway curation. It also builds on the well-known previous datasets GENIA, LLL/BI and BB to propose more realistic tasks that considered previously, closer to the actual needs of biological data integration.

The first event in 2009 triggered active research in the community on a specific fine-grained IE task. Expanding on this, the second BioNLP-ST was organized under the theme “Generalization”, which was well received by participants, who introduced numerous systems that could be straightforwardly applied to multiple tasks. This time, the BioNLP-ST takes a step further and pursues the grand theme of “Knowledge base construction”, which is addressed in various ways: semantic web (GE, GRO), pathways (PC), molecular mechanisms of cancer (CG), regulation networks (GRN) and ontology population (GRO, BB).

As in previous events, manually annotated data will be provided for training, development and evaluation of information extraction methods. According to their relevance for biological studies, the annotations are either bound to specific expressions in the text or represented as structured knowledge. Many tools for the detailed evaluation and graphical visualization of annotations and system outputs will be available for participants. Support in performing linguistic processing will be provided to the participants in the form of analyses created by various state-of-the art tools on the dataset texts.

Participation to the task will be open to the academia, industry, and all other interested parties.

Tasks:

Quick question: Do you think there is semantically diverse data available for each of these tasks?

I first saw this at: BioNLP Shared Task: Text Mining for Biology Competition.

January 8, 2013

PLOS Computational Biology: Translational Bioinformatics

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 11:42 am

PLOS Computational Biology: Translational Bioinformatics. Maricel Kann, Guest Editor, and Fran Lewitter, PLOS Computational Biology Education Editor.

Following up on the collection where Biomedical Knowledge Integration appears, only to find:

Introduction to Translational Bioinformatics Collection by Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002796

Chapter 1: Biomedical Knowledge Integration by Philip R. O. Payne. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002826

Chapter 2: Data-Driven View of Disease Biology by Casey S. Greene and Olga G. Troyanskaya. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002816

Chapter 3: Small Molecules and Disease by David S. Wishart. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002805

Chapter 4: Protein Interactions and Disease by Mileidy W. Gonzalez by Maricel G. Kann. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002819

Chapter 5: Network Biology Approach to Complex Diseases by Dong-Yeon Cho, Yoo-Ah Kim and Teresa M. Przytycka. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002820

Chapter 6: Structural Variation and Medical Genomics by Benjamin J. Raphael. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002821

Chapter 7: Pharmacogenomics by Konrad J. Karczewski, Roxana Daneshjou and Russ B. Altman. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002817

Chapter 8: Biological Knowledge Assembly and Interpretation by Han Kim. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002858

Chapter 9: Analyses Using Disease Ontologies by Nigam H. Shah, Tyler Cole and Mark A. Musen. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002827

Chapter 10: Mining Genome-Wide Genetic Markers by Xiang Zhang, Shunping Huang, Zhaojun Zhang and Wei Wang. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002828

Chapter 11: Genome-Wide Association Studies by William S. Bush and Jason H. Moore. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002822

Chapter 12: Human Microbiome Analysis by Xochitl C. Morgan and Curtis Huttenhower. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002808

Chapter 13: Mining Electronic Health Records in the Genomics Era by Joshua C. Denny. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002823

Chapter 14: Cancer Genome AnalysisMiguel Vazquez, Victor de la Torre and Alfonso Valencia. PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002824

An example of scholarship at its best!

Biomedical Knowledge Integration

Filed under: Bioinformatics,Biomedical,Data Integration,Medical Informatics — Patrick Durusau @ 11:41 am

Biomedical Knowledge Integration by Philip R. O. Payne.

Abstract:

The modern biomedical research and healthcare delivery domains have seen an unparalleled increase in the rate of innovation and novel technologies over the past several decades. Catalyzed by paradigm-shifting public and private programs focusing upon the formation and delivery of genomic and personalized medicine, the need for high-throughput and integrative approaches to the collection, management, and analysis of heterogeneous data sets has become imperative. This need is particularly pressing in the translational bioinformatics domain, where many fundamental research questions require the integration of large scale, multi-dimensional clinical phenotype and bio-molecular data sets. Modern biomedical informatics theory and practice has demonstrated the distinct benefits associated with the use of knowledge-based systems in such contexts. A knowledge-based system can be defined as an intelligent agent that employs a computationally tractable knowledge base or repository in order to reason upon data in a targeted domain and reproduce expert performance relative to such reasoning operations. The ultimate goal of the design and use of such agents is to increase the reproducibility, scalability, and accessibility of complex reasoning tasks. Examples of the application of knowledge-based systems in biomedicine span a broad spectrum, from the execution of clinical decision support, to epidemiologic surveillance of public data sets for the purposes of detecting emerging infectious diseases, to the discovery of novel hypotheses in large-scale research data sets. In this chapter, we will review the basic theoretical frameworks that define core knowledge types and reasoning operations with particular emphasis on the applicability of such conceptual models within the biomedical domain, and then go on to introduce a number of prototypical data integration requirements and patterns relevant to the conduct of translational bioinformatics that can be addressed via the design and use of knowledge-based systems.

A chapter in “Translational Bioinformatics” collection for PLOS Computational Biology.

A very good survey of the knowledge integration area, which alas does not include topic maps. 🙁

Well, but it does include use cases at the end of the chapter that are biomedical specific.

Thinking those would be good cases to illustrate the use of topic maps for biomedical knowledge integration.

Yes?

January 5, 2013

Semantically enabling a genome-wide association study database

Filed under: Bioinformatics,Biomedical,Genomics,Medical Informatics,Ontology — Patrick Durusau @ 2:20 pm

Semantically enabling a genome-wide association study database by Tim Beck, Robert C Free, Gudmundur A Thorisson and Anthony J Brookes. Journal of Biomedical Semantics 2012, 3:9 doi:10.1186/2041-1480-3-9.

Abstract:

Background

The amount of data generated from genome-wide association studies (GWAS) has grown rapidly, but considerations for GWAS phenotype data reuse and interchange have not kept pace. This impacts on the work of GWAS Central — a free and open access resource for the advanced querying and comparison of summary-level genetic association data. The benefits of employing ontologies for standardising and structuring data are widely accepted. The complex spectrum of observed human phenotypes (and traits), and the requirement for cross-species phenotype comparisons, calls for reflection on the most appropriate solution for the organisation of human phenotype data. The Semantic Web provides standards for the possibility of further integration of GWAS data and the ability to contribute to the web of Linked Data.

Results

A pragmatic consideration when applying phenotype ontologies to GWAS data is the ability to retrieve all data, at the most granular level possible, from querying a single ontology graph. We found the Medical Subject Headings (MeSH) terminology suitable for describing all traits (diseases and medical signs and symptoms) at various levels of granularity and the Human Phenotype Ontology (HPO) most suitable for describing phenotypic abnormalities (medical signs and symptoms) at the most granular level. Diseases within MeSH are mapped to HPO to infer the phenotypic abnormalities associated with diseases. Building on the rich semantic phenotype annotation layer, we are able to make cross-species phenotype comparisons and publish a core subset of GWAS data as RDF nanopublications.

Conclusions

We present a methodology for applying phenotype annotations to a comprehensive genome-wide association dataset and for ensuring compatibility with the Semantic Web. The annotations are used to assist with cross-species genotype and phenotype comparisons. However, further processing and deconstructions of terms may be required to facilitate automatic phenotype comparisons. The provision of GWAS nanopublications enables a new dimension for exploring GWAS data, by way of intrinsic links to related data resources within the Linked Data web. The value of such annotation and integration will grow as more biomedical resources adopt the standards of the Semantic Web.

Rather than:

The benefits of employing ontologies for standardising and structuring data are widely accepted.

I would rephrase that to read:

The benefits and limitations of employing ontologies for standardising and structuring data are widely known.

Decades of use of relational database schemas, informal equivalents of ontologies, leave no doubt governing structures for data have benefits.

Less often acknowledged is those same governing structures impose limitations on data and what may be represented.

That’s not a dig at relational databases.

Just an observation that ontologies and their equivalents aren’t unalloyed precious metals.

December 27, 2012

Utopia Documents

Filed under: Annotation,Bioinformatics,Biomedical,Medical Informatics,PDF — Patrick Durusau @ 3:58 pm

Checking the “sponsored by” link for pdfx v1.0 and discovered: Utopia Documents.

From the homepage:

Reading, redefined.

Utopia Documents brings a fresh new perspective to reading the scientific literature, combining the convenience and reliability of the PDF with the flexibility and power of the web. Free for Linux, Mac and Windows.

Building Bridges

The scientific article has been described as a Story That Persuades With Data, but all too often the link between data and narrative is lost somewhere in the modern publishing process. Utopia Documents helps to rebuild these connections, linking articles to underlying datasets, and making it easy to access online resources relating to an article’s content.

A Living Resource

Published articles form the ‘minutes of science‘, creating a stable record of ideas and discoveries. But no idea exists in isolation, and just because something has been published doesn’t mean that the story is over. Utopia Documents reconnects PDFs with the ongoing discussion, keeping you up-to-date with the latest knowledge and metrics.

Comment

Make private notes for yourself, annotate a document for others to see or take part in an online discussion.

Explore article content

Looking for clarification of given terms? Or more information about them? Do just that, with integrated semantic search.

Interact with live data

Interact directly with curated database entries- play with molecular structures; edit sequence and alignment data; even plot and export tabular data.

A finger on the pulse

Stay up to date with the latest news. Utopia connects what you read with live data from Altmetric, Mendeley, CrossRef, Scibite and others.

A user can register for an account (enabling comments on documents) or use the application anonymously.

Presently focused on the life sciences but no impediment to expansion into computer science for example.

It doesn’t solve semantic diversity issues so an opportunity for topic maps there.

Doesn’t address the issue of documents being good at information delivery but not so good for information storage.

But issues of semantic diversity and information storage, are growth areas for Utopia Documents, not reservations about its use.

Suggest you start using and exploring Utopia Documents sooner rather than later!

December 15, 2012

Neuroscience Information Framework (NIF)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Neuroinformatics,Searching — Patrick Durusau @ 8:21 pm

Neuroscience Information Framework (NIF)

From the about page:

The Neuroscience Information Framework is a dynamic inventory of Web-based neuroscience resources: data, materials, and tools accessible via any computer connected to the Internet. An initiative of the NIH Blueprint for Neuroscience Research, NIF advances neuroscience research by enabling discovery and access to public research data and tools worldwide through an open source, networked environment.

Example of a subject specific information resource that provides much deeper coverage than possible with Google, for example.

If you aren’t trying to index everything, you can out perform more general search solutions.

November 12, 2012

An Ontological Representation of Biomedical Data Sources and Records [Data & Record as Subjects]

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology,RDF — Patrick Durusau @ 7:27 pm

An Ontological Representation of Biomedical Data Sources and Records by Michael Bada, Kevin Livingston, and Lawrence Hunter.

Abstract:

Large RDF-triple stores have been the basis of prominent recent attempts to integrate the vast quantities of data in semantically divergent databases. However, these repositories often conflate data-source records, which are information content entities, and the biomedical concepts and assertions denoted by them. We propose an ontological model for the representation of data sources and their records as an extension of the Information Artifact Ontology. Using this model, we have consistently represented the contents of 17 prominent biomedical databases as a 5.6-billion RDF-triple knowledge base, enabling querying and inference over this large store of integrated data.

Recognition of the need to treat data containers as subjects, along with the data they contain, is always refreshing.

In particular because the evolution of data sources can be captured, as the authors remark:

Our ontology is fully capable of handling the evolution of data sources: If the schema of a given data set is changed, a new instance of the schema is simply created, along with the instances of the fields of the new schema. If the data sets of a data source change (or a new set is made available), an instance for each new data set can be created, along with instances for its schema and fields. (Modeling of incremental change rather than creation of new instances may be desirable but poses significant representational challenges.) Additionally, using our model, if a researcher wishes to work with multiple versions of a given data source (e.g., to analyze some aspect of multiple versions of a given database), an instance for each version of the data source can be created. If different versions of a data source consist of different data sets (e.g., different file organizations) and/or different schemas and fields, the explicit representation of all of these elements and their linkages will make the respective structures of the disparate data-source versions unambiguous. Furthermore, it may be the case that only a subset of a data source needs to be represented; in such a case, only instances of the data sets, schemas, and fields of interest are created.

I first saw this in a tweet by Anita de Waard.

November 9, 2012

Semantic Technologies — Biomedical Informatics — Individualized Medicine

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology,Semantic Web — Patrick Durusau @ 11:14 am

Joint Workshop on Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine (SATBI+SWIM 2012) (In conjunction with International Semantic Web Conference (ISWC 2012) Boston, Massachusetts, U.S.A. November 11-15, 2012)

If you are at ISWC, consider attending.

To help with that choice, the accepted papers:

Jim McCusker, Jeongmin Lee, Chavon Thomas and Deborah L. McGuinness. Public Health Surveillance Using Global Health Explorer. [PDF]

Anita de Waard and Jodi Schneider. Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA). [PDF]

Alexander Baranya, Luis Landaeta, Alexandra La Cruz and Maria-Esther Vidal. A Workflow for Improving Medical Visualization of Semantically Annotated CT-Images. [PDF]

Derek Corrigan, Jean Karl Soler and Brendan Delaney. Development of an Ontological Model of Evidence for TRANSFoRm Utilizing Transition Project Data. [PDF]

Amina Chniti, Abdelali BOUSSADI, Patrice DEGOULET, Patrick Albert and Jean Charlet. Pharmaceutical Validation of Medication Orders Using an OWL Ontology and Business Rules. [PDF]

November 7, 2012

Next Level Doctor Social Graph

Filed under: Medical Informatics,Social Graphs — Patrick Durusau @ 4:48 pm

Next Level Doctor Social Graph by Fred Trotter.

From the webpage:

At Strata RX 2012, we (NotOnly Dev) released the Doctor Social Graph — a project that visually displays the connections between doctors, hospitals and other healthcare organizations in the US. This conglomeration of data set shows everything — from the connections between doctors who refer their patients to each other to any other data collected by state and national databases. It displays real names and and will eventually show every city.

This is THE data set that any academic, scientist, or health policy junkie could ever want to conduct almost any study.

Our goal is to empower the patient, make the system transparent and accountable, and release this data to the people who can use it to revitalize our health system.

This is important.

It is as simple as that.

The project seeks to collate data on relationships in the medical system.

Can’t tell the players (or plan for reform) without a program. This is potentially that program.

I once met a person being treated by a physician who owned a drug store. Under 25 years of age with 30 separate prescriptions.

This is a step towards correcting that sort of abuse.

Contribute, volunteer, help.

October 25, 2012

Health Design Challenge [$50K in Prizes – Deadline 30th Nov 2012]

Filed under: Challenges,Health care,Medical Informatics — Patrick Durusau @ 10:01 am

Health Design Challenge

More details at the site but:

ONC & VA invite you to rethink how the medical record is presented. We believe designers can use their talents to make health information patient-centered and improve the patient experience.

Being able to access your health information on demand can be lifesaving in an emergency situation, can help prevent medication errors, and can improve care coordination so everyone who is caring for you is on the same page. However, too often health information is presented in an unwieldy and unintelligible way that makes it hard for patients, their caregivers, and their physicians to use. There is an opportunity for talented designers to reshape the way health records are presented to create a better patient experience.

Learn more at http://healthdesignchallenge.com

The purpose of this effort is to improve the design of the medical record so it is more usable by and meaningful to patients, their families, and others who take care of them. This is an opportunity to take the plain-text Blue Button file and enrich it with visuals and a better layout. Innovators will be invited to submit their best designs for a medical record that can be printed and viewed digitally.

This effort will focus on the content defined by a format called the Continuity of Care Document (CCD). A CCD is a common template used to describe a patient’s health history and can be output by electronic medical record (EMR) software. Submitted designs should use the sections and fields found in a CCD. See http://blue-button.github.com/challenge/files/health-design-challenge-fields.pdf for CCD sections and fields.

Entrants will submit a design that:

  • Improves the visual layout and style of the information from the medical record
  • Makes it easier for a patient to manage his/her health
  • Enables a medical professional to digest information more efficiently
  • Aids a caregiver such as a family member or friend in his/her duties and responsibilities with respect to the patient

Entrants should be conscious of how the wide variety of personas will affect their design. Our healthcare system takes care of the following types of individuals:

  • An underserved inner-city parent with lower health literacy
  • A senior citizen that has a hard time reading
  • A young adult who is engaged with technology and mobile devices
  • An adult whose first language is not English
  • A patient with breast cancer receiving care from multiple providers
  • A busy mom managing her kids’ health and helping her aging parents

This is an opportunity for talented individuals to touch the lives of Americans across the country through design. The most innovative designs will be showcased in an online gallery and in a physical exhibit at the Annual ONC Meeting in Washington DC.

should be enough to capture your interest.

Winners will be announced December 12, 2012.

Only the design is required, no working code.

Still, a topic map frame of mind may give you more options than other approaches.

October 16, 2012

The “O” Word (Ontology) Isn’t Enough

Filed under: Bioinformatics,Biomedical,Gene Ontology,Genome,Medical Informatics,Ontology — Patrick Durusau @ 10:36 am

The Units Ontology makes reference to the Gene Ontology as an example of a successful web ontology effort.

As it should. The Gene Ontology (GO) is the only successful web ontology effort. A universe with one (1) inhabitant.

The GO has a number of differences from wannabe successful ontology candidates. (see the article below)

The first difference echoes loudly across the semantic engineering universe:

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers. Terms were created by those who had expertise in the domain, thus avoiding the huge effort that would have been required for a computer scientist to learn and organize large amounts of biological functional information. This also led to general acceptance of the terminology and its organization within the community. This is not to say that there have been no disagreements among biologists over the conceptualization, and there is of course a protocol for arriving at a consensus when there is such a disagreement. However, a model of a domain is more likely to conform to the shared view of a community if the modelers are within or at least consult to a large degree with members of that community.

Did you catch that first line?

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers.

Saying the “O” word, ontology, that will benefit everyone if they will just listen to you, isn’t enough.

There are other factors to consider:

A Short Study on the Success of the Gene Ontology by Michael Bada, Robert Stevens, Carole Goble, Yolanda Gil, Michael Ashburner, Judith A. Blake, J. Michael Cherry, Midori Harris, Suzanna Lewis.

Abstract:

While most ontologies have been used only by the groups who created them and for their initially defined purposes, the Gene Ontology (GO), an evolving structured controlled vocabulary of nearly 16,000 terms in the domain of biological functionality, has been widely used for annotation of biological-database entries and in biomedical research. As a set of learned lessons offered to other ontology developers, we list and briefly discuss the characteristics of GO that we believe are most responsible for its success: community involvement; clear goals; limited scope; simple, intuitive structure; continuous evolution; active curation; and early use.

October 15, 2012

Information needs of public health practitioners: a review of the literature [Indexing Needs]

Filed under: Biomedical,Indexing,Medical Informatics,Searching — Patrick Durusau @ 4:37 am

Information needs of public health practitioners: a review of the literature by Jennifer Ford and Helena Korjonen.

Abstract:

Objective

To review published literature covering the information needs of public health practitioners and papers highlighting gaps and potential solutions in order to summarise what is already known about this population and models tested to support them.

Methods

The search strategy included bibliographic databases LISTA, LISA, PubMed and Web of Knowledge. The results of this literature review were used to create two tables displaying published literature.

Findings

The literature highlighted that some research has taken place into different public health subgroups with consistent findings. Gaps in information provision have also been identified by looking at the information services provided.

Conclusion

There is a need for further research into information needs in subgroups of public health practitioners as this group is diverse, has different needs and needs varying information. Models of informatics that can support public health must be developed and published so that the public health information community can share experiences and solutions and begin to build an evidence-base to produce superior information systems for the goal of a healthier society.

One of the key points for topic map advocates:

The need for improved indexing of public health information was highlighted by Alpi, discussing the role of expert searching in public health information retrieval.2 Existing taxonomies such as the MeSH system used by PubMed/Medline are perceived as inadequate for indexing the breadth of public health literature and are seen to be too clinically focussed.2 There is also concern at the lack of systematic indexing of grey literature.2 Given that more than one study has highlighted the high level of use of grey literature by public health practitioners, this lack of indexing should be of real concern to public health information specialists and practitioners. LaPelle also found that participants in her research had experienced difficulties with search terms for public health which is indicative of the current inadequacy of public health indexing.1

Other opportunities for topic maps are apparent in the literature review but inadequate indexing should be topic maps bread and butter.

September 7, 2012

EU-ADR Web Platform

Filed under: Bioinformatics,Biomedical,Drug Discovery,Medical Informatics — Patrick Durusau @ 10:29 am

EU-ADR Web Platform

I was disappointed to not find the UMLS concepts and related terms mapping for participants in the EU-ADR project.

I did find these workflows at the EU-ADR Web Platform:

MEDLINE ADR

In the filtering process of well known signals, the aim of the “MEDLINE ADR” workflow is to automate the search of publications related to ADRs corresponding to a given drug/adverse event association. To do so, we defined an approach based on the MeSH thesaurus, using the subheadings «chemically induced» and «adverse effects» with the “Pharmacological Action” knowledge. Using a threshold of ≥3 extracted publications, the automated search method, presented a sensitivity of 93% and a specificity of 97% on the true positive and true negative sets (WP 2.2). We then determined a threshold number of extracted publications ≥ 3 to confirm the knowledge of this association in the literature. This approach offers the opportunity to automatically determine if an ADR (association of a drug and an adverse event) has already been described in MEDLINE. However, the causality relationship between the drug and an event may be judged only by an expert reading the full text article and determining if the methodology of this article was correct and if the association is statically significant.

MEDLINE Co-occurrence

The “MEDLINE Co-occurrence” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the PubMed database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

DailyMed

The “DailyMed” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the DailyMed database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

DrugBank

The “DrugBank” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the DrugBank database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

Substantiation

The “Substantiation” workflow tries to establish a connection between the clinical event and the drug through a gene or protein, by identifying the proteins that are targets of the drug and are also associated with the event. In addition it also considers information about drug metabolites in this process. In such cases it can be argued that the binding of the drug to the protein would lead to the observed event phenotype. Associations between the event and proteins are found by querying our integrated gene-disease association database (Bauer-Mehren, et al., 2010). As this database provides annotations of the gene-disease associations to the articles reporting the association and in case of text-mining derived associations even the exact sentence, the article or sentence can be studied in more detail in order to inspect the supporting evidence for each gene-disease association. It has to be mentioned that our gene-disease association database also contains information about genetic variants or SNPs and their association to diseases or adverse drug events. The methodology for providing information about the binding of a drug (or metabolite) to protein targets is reported in deliverable 4.2, and includes extraction from different databases (annotated chemical libraries) and application of prediction methods based on chemical similarity.

A glimpse of what is state of the art today and a basis for building better tools for tomorrow.

Harmonization of Reported Medical Events in Europe

Filed under: Bioinformatics,Biomedical,Health care,Medical Informatics — Patrick Durusau @ 10:00 am

Harmonization process for the identification of medical events in eight European healthcare databases: the experience from the EU-ADR project by Paul Avillach, et. al. (J Am Med Inform Assoc doi:10.1136/amiajnl-2012-000933)

Abstract

Objective Data from electronic healthcare records (EHR) can be used to monitor drug safety, but in order to compare and pool data from different EHR databases, the extraction of potential adverse events must be harmonized. In this paper, we describe the procedure used for harmonizing the extraction from eight European EHR databases of five events of interest deemed to be important in pharmacovigilance: acute myocardial infarction (AMI); acute renal failure (ARF); anaphylactic shock (AS); bullous eruption (BE); and rhabdomyolysis (RHABD).

Design The participating databases comprise general practitioners’ medical records and claims for hospitalization and other healthcare services. Clinical information is collected using four different disease terminologies and free text in two different languages. The Unified Medical Language System was used to identify concepts and corresponding codes in each terminology. A common database model was used to share and pool data and verify the semantic basis of the event extraction queries. Feedback from the database holders was obtained at various stages to refine the extraction queries.

….

Conclusions The iterative harmonization process enabled a more homogeneous identification of events across differently structured databases using different coding based algorithms. This workflow can facilitate transparent and reproducible event extractions and understanding of differences between databases.

Not to be overly critical but the one thing left out of the abstract was some hint about the “…procedure used for harmonizing the extraction…” which interests me.

The workflow diagram from figure 2 is worth transposing into HTML markup:

  • Event definition
    • Choice of the event
    • Event Definition Form (EDF) containing the medical definition and diagnostic criteria for the event
  • Concepts selection and projection into the terminologies
    • Search for Unified Medical Language System (UMLS) concepts corresponding to the medical definition as reported in the EDF
    • Projection of UMLS concepts into the different terminologies used in the participating databases
    • Publication on the project’s forum of the first list of UMLS concepts and corresponding codes and terms for each terminology
  • Revision of concepts and related terms
    • Feedback from database holders about the list of concepts with corresponding codes and related terms that they have previously used to identify the event of interest
    • Report on literature review on search criteria being used in previous observational studies that explored the event of interest
    • Text mining in database to identify potentially missing codes through the identification of terms associated with the event in databases
    • Conference call for finalizing the list of concepts
    • Search for new UMLS concepts from the proposed terms
    • Final list of UMLS concepts and related codes posted on the forum
  • Translation of concepts and coding algorithms into queries
    • Queries in each database were built using:
      1. the common data model;
      2. the concept projection into different terminologies; and
      3. the chosen algorithms for event definition
    • Query Analysis
      • Database holders extract data on the event of interest using codes and free text from pre-defined concepts and with database-specific refinement strategies
      • Database holders calculate incidence rates and comparisons are made among databases
      • Database holders compare search queries via the forum

At least for non-members, the EU-ADR website does not appear to offer access to the UMLS concepts and related codes mapping. That mapping could be used to increase accessibility to any database using those codes.

August 10, 2012

[C]rowdsourcing … knowledge base construction

Filed under: Biomedical,Crowd Sourcing,Data Mining,Medical Informatics — Patrick Durusau @ 1:48 pm

Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications by Allison B McCoy, Adam Wright, Archana Laxmisan, Madelene J Ottosen, Jacob A McCoy, David Butten, and Dean F Sittig. (J Am Med Inform Assoc 2012; 19:713-718 doi:10.1136/amiajnl-2012-000852)

Abstract:

Objective We describe a novel, crowdsourcing method for generating a knowledge base of problem–medication pairs that takes advantage of manually asserted links between medications and problems.

Methods Through iterative review, we developed metrics to estimate the appropriateness of manually entered problem–medication links for inclusion in a knowledge base that can be used to infer previously unasserted links between problems and medications.

Results Clinicians manually linked 231 223 medications (55.30% of prescribed medications) to problems within the electronic health record, generating 41 203 distinct problem–medication pairs, although not all were accurate. We developed methods to evaluate the accuracy of the pairs, and after limiting the pairs to those meeting an estimated 95% appropriateness threshold, 11 166 pairs remained. The pairs in the knowledge base accounted for 183 127 total links asserted (76.47% of all links). Retrospective application of the knowledge base linked 68 316 medications not previously linked by a clinician to an indicated problem (36.53% of unlinked medications). Expert review of the combined knowledge base, including inferred and manually linked problem–medication pairs, found a sensitivity of 65.8% and a specificity of 97.9%.

Conclusion Crowdsourcing is an effective, inexpensive method for generating a knowledge base of problem–medication pairs that is automatically mapped to local terminologies, up-to-date, and reflective of local prescribing practices and trends.

I would not apply the term “crowdsourcing,” here, in part because the “crowd” is hardly unknown. Not a crowd at all, but an identifiable group of clinicians.

Doesn’t invalidate the results, which shows the utility of data mining for creating knowledge bases.

As a matter of usage, let’s not confuse anonymous “crowds,” with specific groups of people.

Phenol-Explorer 2.0:… [Topic Maps As Search Templates]

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 12:31 pm

Phenol-Explorer 2.0: a major update of the Phenol-Explorer database integrating data on polyphenol metabolism and pharmacokinetics in humans and experimental animals by Joseph A. Rothwell, Mireia Urpi-Sarda, Maria Boto-Ordoñez, Craig Knox, Rafael Llorach, Roman Eisner, Joseph Cruz, Vanessa Neveu, David Wishart, Claudine Manach, Cristina Andres-Lacueva, and Augustin Scalbert.

Abstract:

Phenol-Explorer, launched in 2009, is the only comprehensive web-based database on the content in foods of polyphenols, a major class of food bioactives that receive considerable attention due to their role in the prevention of diseases. Polyphenols are rarely absorbed and excreted in their ingested forms, but extensively metabolized in the body, and until now, no database has allowed the recall of identities and concentrations of polyphenol metabolites in biofluids after the consumption of polyphenol-rich sources. Knowledge of these metabolites is essential in the planning of experiments whose aim is to elucidate the effects of polyphenols on health. Release 2.0 is the first major update of the database, allowing the rapid retrieval of data on the biotransformations and pharmacokinetics of dietary polyphenols. Data on 375 polyphenol metabolites identified in urine and plasma were collected from 236 peer-reviewed publications on polyphenol metabolism in humans and experimental animals and added to the database by means of an extended relational design. Pharmacokinetic parameters have been collected and can be retrieved in both tabular and graphical form. The web interface has been enhanced and now allows the filtering of information according to various criteria. Phenol-Explorer 2.0, which will be periodically updated, should prove to be an even more useful and capable resource for polyphenol scientists because bioactivities and health effects of polyphenols are dependent on the nature and concentrations of metabolites reaching the target tissues. The Phenol-Explorer database is publicly available and can be found online at http://www.phenol-explorer.eu.

I wanted to call your attention to Table 1: Search Strategy and Terms, step 4 which reads:

Polyphenol* or flavan* or flavon*or anthocyan* or isoflav* or phytoestrogen* or phyto-estrogen* or lignin* or stilbene* or chalcon* or phenolic acid* or ellagic* or coumarin* or hydroxycinnamic* or quercetin* or kaempferol* or rutin* or apigenin* or luteolin* or catechin* or epicatechin* or gallocatechin* or epigallocatechin* or procyanidin* or hesperetin* or naringenin* or cyanidin* or malvidin* or petunid* or peonid*or daidz* or genist* or glycit* or equol* or gallic* or vanillic* or chlorogenic* or tyrosol* or hydoxytyrosol* or resveratrol* or viniferin*

Which of these terms are synonyms for “tyrosol?

No peeking!

Wikipedia (a generalist source), lists five (5) names, including tyrosol, and 5 different identifiers.

Common Chemistry, which you can access by the CAS number, has twenty-one (21) synonyms.

Ready?

Would you believe 0?

See for yourself: Wikipedia Tyrosol; Common Chemistry – CAS 501-94-0.

Another question: In one week (or even tomorrow), how much of the query in step 4 will you remember?

Some obvious comments:

  • The creators of Pehno-Explorer 2.0 have done a great service to the community by curating this data resource.
  • Creating comprehensive queries is a creative enterprise and not easy to duplicate.

Perhaps less obvious comments:

  • The terms in the query have synonyms, which is no great surprise.
  • If the terms were represented as topics in a topic map, synonyms could be captured for those terms.
  • Capturing of synonyms for terms would support expansion or contraction of search queries.
  • Capturing terms (and their synonyms) in a topic map, would permit merging of terms/synonyms from other researchers.

Final question: Have you thought about using topic maps as search templates?

MedLingMap

Filed under: Medical Informatics,Natural Language Processing — Patrick Durusau @ 9:27 am

MedLingMap

From the “welcome” entry:

MedLingMap is a growing resource providing a map of NLP systems and research in the Medical Domain. The site is being developed as part of the NLP Systems in the Medical Domain course in Brandeis University’s Computational Linguistics Master’s Program, taught by Dr. Marie Meteer. Learn more about the students doing the work.

MedLIngMap brings together the many different references, resources, organizations, and people in this very diverse domain. By using a faceted indexing approach to organizing the materials, MedLingMap can capture not only the content, but also the context by including facets such as the applications of the technology, the research or development group it was done by, and the techniques and algorithms that were utilized in developing the technology.

Not a lot of resources listed but every project has to start somewhere.

Capturing the use of specific techniques and algorithms will make this a particularly useful resource.

i2b2: Informatics for Integrating Biology and the Bedside

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:43 am

i2b2: Informatics for Integrating Biology and the Bedside

I discovered this site while chasing down a coreference resolution workshop. From the homepage:

Informatics for Integrating Biology and the Bedside (i2b2) is an NIH-funded National Center for Biomedical Computing (NCBC) based at Partners HealthCare System in Boston, Mass. Established in 2004 in response to an NIH Roadmap Initiative RFA, this NCBC is one of four national centers awarded in this first competition (http://www.bisti.nih.gov/ncbc/); currently there are seven NCBCs. One of 12 specific initiatives in the New Pathways to Discovery Cluster, the NCBCs will initiate the development of a national computational infrastructure for biomedical computing. The NCBCs and related R01s constitute the National Program of Excellence in Biomedical Computing.

The i2b2 Center, led by Director Isaac Kohane, M.D., Ph.D., Professor of Pediatrics at Harvard Medical School at Children’s Hospital Boston, is comprised of seven cores involving investigators from the Harvard-affiliated hospitals, MIT, Harvard School of Public Health, Joslin Diabetes Center, Harvard Medical School and the Harvard/MIT Division of Health Sciences and Technology. This Center is funded under a Cooperative agreement with the National Institutes of Health.

The i2b2 Center is developing a scalable computational framework to address the bottleneck limiting the translation of genomic findings and hypotheses in model systems relevant to human health. New computational paradigms (Core 1) and methodologies (Cores 2) are being developed and tested in several diseases (airways disease, hypertension, type 2 diabetes mellitus, Huntington’s Disease, rheumatoid arthritis, and major depressive disorder) (Core 3 Driving Biological Projects).

The i2b2 Center (Core 5) offers a Summer Institute in Bioinformatics and Integrative Genomics for qualified undergraduate students, supports an Academic Users’ Group of over 125 members, sponsors annual Shared Tasks for Challenges in Natural Language Processing for Clinical Data, distributes an NLP DataSet for research purpose, and sponsors regular Symposia and Workshops for the community.

Sounds like prime hunting grounds for vocabularies that cross disciplinary boundaries and the like.

Extensive resources. Will explore and report back.

August 9, 2012

Evaluating the state of the art in coreference resolution for electronic medical records

Evaluating the state of the art in coreference resolution for electronic medical records by Ozlem Uzuner, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian, and Brett R South. (J Am Med Inform Assoc 2012; 19:786-791 doi:10.1136/amiajnl-2011-000784)

Abstract:

Background The fifth i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records conducted a systematic review on resolution of noun phrase coreference in medical records. Informatics for Integrating Biology and the Bedside (i2b2) and the Veterans Affair (VA) Consortium for Healthcare Informatics Research (CHIR) partnered to organize the coreference challenge. They provided the research community with two corpora of medical records for the development and evaluation of the coreference resolution systems. These corpora contained various record types (ie, discharge summaries, pathology reports) from multiple institutions.

Methods The coreference challenge provided the community with two annotated ground truth corpora and evaluated systems on coreference resolution in two ways: first, it evaluated systems for their ability to identify mentions of concepts and to link together those mentions. Second, it evaluated the ability of the systems to link together ground truth mentions that refer to the same entity. Twenty teams representing 29 organizations and nine countries participated in the coreference challenge.

Results The teams’ system submissions showed that machine-learning and rule-based approaches worked best when augmented with external knowledge sources and coreference clues extracted from document structure. The systems performed better in coreference resolution when provided with ground truth mentions. Overall, the systems struggled in solving coreference resolution for cases that required domain knowledge.

That systems “struggled in solving coreference resolution for cases that required domain knowledge” isn’t surprising.

But, as we saw in > 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis], for any given diagnosis, there is a finite number of ways to say it.

Usually far fewer than 4,000. If we capture the ways as they are encountered, our systems don’t need “domain knowledge.”

As the lead character in O Brother, Where Art Thou? says, our applications can be as “dumb as a bag of hammers.”

PS: Apologies but I could not find an accessible version of this article. Will run down the details on the coreference workshop tomorrow and hopefully some accessible materials on it.

The Cell: An Image Library

Filed under: Bioinformatics,Biomedical,Data Source,Medical Informatics — Patrick Durusau @ 3:50 pm

The Cell: An Image Library

For the casual user, an impressive collection of cell images.

For the professional user, the advanced search page gives you an idea of the depth of images in this collection.

A good source of images for curated (not “mash up”) alignment with other materials. Such as instructional resources on biology or medicine.

August 5, 2012

Journal of the American Medical Informatics Association (JAMIA)

Filed under: Bioinformatics,Informatics,Medical Informatics,Pathology Informatics — Patrick Durusau @ 10:53 am

Journal of the American Medical Informatics Association (JAMIA)

Aims and Scope

JAMIA is AMIA‘s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA’s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.

Another informatics journal to whitelist for searching.

Content is freely available after twelve (12) months.

Cancer, NLP & Kaiser Permanente Southern California (KPSC)

Filed under: Bioinformatics,Medical Informatics,Pathology Informatics,Uncategorized — Patrick Durusau @ 10:38 am

Kaiser Permanente Southern California (KPSC) deserves high marks for the research in:

Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm by Justin A Strauss, et. al.

Abstract:

Objective Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem.

Materials and methods SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report.

Results Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups.

Discussion Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability.

Conclusion SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.

Before I forget:

Data sharing statement SCENT is freely available for non-commercial use and modification. Program source code and requisite support files may be downloaded from: http://www.kp-scalresearch.org/research/tools_scent.aspx

Topic map promotion point: Application was built to account for linguistic variability, not to stamp it out.

Tools build to fit users are more likely to succeed, don’t you think?

Journal of Pathology Informatics (JPI)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Pathology Informatics — Patrick Durusau @ 10:09 am

Journal of Pathology Informatics (JPI)

About:

The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, book reviews, and correspondence to the editors. All submissions are subject to peer review by the well-regarded editorial board and by expert referees in appropriate specialties.

Another site to add to your whitelist of sites to search for informatics information.

> 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis]

The feasibility of using natural language processing to extract clinical information from breast pathology reports by Julliette M Buckley, et.al.

Abstract:

Objective: The opportunity to integrate clinical decision support systems into clinical practice is limited due to the lack of structured, machine readable data in the current format of the electronic health record. Natural language processing has been designed to convert free text into machine readable data. The aim of the current study was to ascertain the feasibility of using natural language processing to extract clinical information from >76,000 breast pathology reports.

Approach and Procedure: Breast pathology reports from three institutions were analyzed using natural language processing software (Clearforest, Waltham, MA) to extract information on a variety of pathologic diagnoses of interest. Data tables were created from the extracted information according to date of surgery, side of surgery, and medical record number. The variety of ways in which each diagnosis could be represented was recorded, as a means of demonstrating the complexity of machine interpretation of free text.

Results: There was widespread variation in how pathologists reported common pathologic diagnoses. We report, for example, 124 ways of saying invasive ductal carcinoma and 95 ways of saying invasive lobular carcinoma. There were >4000 ways of saying invasive ductal carcinoma was not present. Natural language processor sensitivity and specificity were 99.1% and 96.5% when compared to expert human coders.

Conclusion: We have demonstrated how a large body of free text medical information such as seen in breast pathology reports, can be converted to a machine readable format using natural language processing, and described the inherent complexities of the task.

The advantages of using current language practices include:

  • No new vocabulary needs to be developed.
  • No adoption curve for a new vocabulary.
  • No training required for users to introduce the new vocabulary
  • Works with historical data.

and I am sure there are others.

Add natural language usage to your topic map for immediately useful results for your clients.

July 15, 2012

The Ontology for Biomedical Investigations (OBI)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology — Patrick Durusau @ 9:40 am

The Ontology for Biomedical Investigations (OBI)

From the webpage:

The Ontology for Biomedical Investigations (OBI) project is developing an integrated ontology for the description of biological and clinical investigations. This includes a set of ‘universal’ terms, that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. This ontology will support the consistent annotation of biomedical investigations, regardless of the particular field of study. The ontology will represent the design of an investigation, the protocols and instrumentation used, the material used, the data generated and the type analysis performed on it. Currently OBI is being built under the Basic Formal Ontology (BFO).

  • Develop an Ontology for Biomedical Investigations in collaboration with groups representing different biological and technological domains involved in Biomedical Investigations
  • Make OBI compatible with other bio-ontologies
  • Develop OBI using an open source approach
  • Create a valuable resource for the biomedical communities to provide a source of terms for consistent annotation of investigations

An ontology that will be of interest if you are integrating biomedical materials.

At least as a starting point.

My listing of ontologies, vocabularies, etc., for any field are woefully incomplete for any field and represent at best starting points for your own, more comprehensive investigations. If you do find these starting points useful, please send pointers to your more complete investigations for any field.

July 14, 2012

Journal of Data Mining in Genomics and Proteomics

Filed under: Bioinformatics,Biomedical,Data Mining,Genome,Medical Informatics,Proteomics — Patrick Durusau @ 12:20 pm

Journal of Data Mining in Genomics and Proteomics

From the Aims and Scope page:

Journal of Data Mining in Genomics & Proteomics (JDMGP), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Genomics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes.

In today’s wired world information is available at the click of the button, curtsey the Internet. JDMGP-Open Access gives a worldwide audience larger than that of any subscription-based journal in OMICS field, no matter how prestigious or popular, and probably increases the visibility and impact of published work. JDMGP-Open Access gives barrier-free access to the literature for research. It increases convenience, reach, and retrieval power. Free online literature is available for software that facilitates full-text searching, indexing, mining, summarizing, translating, querying, linking, recommending, alerting, “mash-ups” and other forms of processing and analysis. JDMGP-Open Access puts rich and poor on an equal footing for these key resources and eliminates the need for permissions to reproduce and distribute content.

A publication (among many) from the OMICS Publishing Group, which sponsors a large number of online publications.

Has the potential to be an interesting source of information. Not much in the way of back files but then it is a very young journal.

April 15, 2012

Constructing Case-Control Studies With Hadoop

Filed under: Bioinformatics,Biomedical,Giraph,Hadoop,Medical Informatics — Patrick Durusau @ 7:13 pm

Constructing Case-Control Studies With Hadoop by Josh Wills.

From the post:

San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.

Case-Control Studies

A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.

Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

Great walk through on constructing a case-control study, including the use of the Apache Giraph library.

April 2, 2012

The 1000 Genomes Project

The 1000 Genomes Project

If Amazon is hosting a single dataset > 200 TB, is your data “big data?” 😉

This merits quoting in full:

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3. 

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

A few years ago there was a quantum leap in the technology for sequencing DNA, which drastically reduced the time and cost of identifying genetic code. This offered the promise of being able to compare full genomes from individuals, rather than entire species, leading to a much more detailed genetic map of where we, as individuals, have genetic similarities and differences. This will ultimately give us better insight into human health and disease.

The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available, ultimately with data from the genomes of over 2,661 people from 26 populations around the world. The project began with three pilot studies that assessed strategies for producing a catalog of genetic variants that are present at one percent or greater in the populations studied. We were happy to host the initial pilot data on Amazon S3 in 2010, and today we're making the latest dataset available to all, including results from sequencing the DNA of approximately 1,700 people.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow

You can find more information, the location of the data and how to get started using it on our 1000 Genomes web page, or from the project pages.

If that sounds like a lot of data, just imagine all of the recorded mathematical texts and the relationships between the concepts represented in such texts?

It is in our view that data looks smooth or simple. Or complex.

« Newer PostsOlder Posts »

Powered by WordPress