Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 16, 2012

International BASP Frontiers Workshop 2013

Filed under: Astroinformatics,Biomedical,Conferences,Signal/Collect — Patrick Durusau @ 1:28 pm

International BASP Frontiers Workshop 2013

January 27th – February 1st, 2013 Villars-sur-Ollon (Switzerland)

The international biomedical and astronomical signal processing (BASP) Frontiers workshop was created to promote synergies between selected topics in astronomy and biomedical sciences, around common challenges for signal processing.

The 2013 workshop will concentrate on the themes of sparse signal sampling and reconstruction, for radio interferometry and MRI, but also open its floor to many other interesting hot topics in theoretical, astrophysical, and biomedical signal processing.

Signal processing is one form of “big data” and is rich in subjects, both in the literature and in the data.

Proceedings from the first BASP workshop are available. Be advised it is a 354 MB zip file. If you aren’t on an airport wifi, you can find those proceedings here.

July 15, 2012

The Ontology for Biomedical Investigations (OBI)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology — Patrick Durusau @ 9:40 am

The Ontology for Biomedical Investigations (OBI)

From the webpage:

The Ontology for Biomedical Investigations (OBI) project is developing an integrated ontology for the description of biological and clinical investigations. This includes a set of ‘universal’ terms, that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. This ontology will support the consistent annotation of biomedical investigations, regardless of the particular field of study. The ontology will represent the design of an investigation, the protocols and instrumentation used, the material used, the data generated and the type analysis performed on it. Currently OBI is being built under the Basic Formal Ontology (BFO).

  • Develop an Ontology for Biomedical Investigations in collaboration with groups representing different biological and technological domains involved in Biomedical Investigations
  • Make OBI compatible with other bio-ontologies
  • Develop OBI using an open source approach
  • Create a valuable resource for the biomedical communities to provide a source of terms for consistent annotation of investigations

An ontology that will be of interest if you are integrating biomedical materials.

At least as a starting point.

My listing of ontologies, vocabularies, etc., for any field are woefully incomplete for any field and represent at best starting points for your own, more comprehensive investigations. If you do find these starting points useful, please send pointers to your more complete investigations for any field.

Functional Genomics Data Society – FGED

Filed under: Bioinformatics,Biomedical,Functional Genomics — Patrick Durusau @ 9:29 am

Functional Genomics Data Society – FGED

While searching out the MAGE-TAB standard, I found:

The Functional Genomics Data Society – FGED Society, founded in 1999 as the MGED Society, advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our goal is to assure that investment in functional genomics data generates the maximum public benefit. Our work on defining minimum information specifications for reporting data in functional genomics papers have already enabled large data sets to be used and reused to their greater potential in biological and medical research.

We work with other organisations to develop standards for biological research data quality, annotation and exchange. We facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by genome wide and other biological research data integration and meta-analysis.

Home of:

Along with links to other resources and collaborations.

ISA-TAB

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 9:06 am

ISA-TAB format page at SourceForge.

Where you will find:

ISA-TAB 1.0 – Candidate release (PDF file)

Example ISA-TAB files.

ISAValidator

Abstract from ISA-TAB 1.0:

This document describes ISA-TAB, a general purpose framework with which to capture and communicate the complex metadata required to interpret experiments employing combinations of technologies, and the associated data files. Sections 1 to 3 introduce the ISA-TAB proposal, describe the rationale behind its development, provide an overview of its structure and relate it to other formats. Section 4 describes the specification in detail; section 5 provides examples of design patterns.

ISA-TAB builds on the existing paradigm that is MAGE-TAB – a tab-delimited format to exchange microarray data. ISA-TAB necessarily maintains backward compatibility with existing MAGE-TAB files to facilitate adoption; conserving the simplicity of MAGE-TAB for simple experimental designs, while incorporating new features to capture the full complexity of experiments employing a combination of technologies. Like MAGE-TAB before it, ISA-TAB is simply a format; the decision on how to regulate its use (i.e. enforcing completion of mandatory fields or use of a controlled terminology) is a matter for those communities, which will implement the format in their systems and for which submission and exchange of minimal information is critical. In this case, an additional layer or of constraints should be agreed and required on top of the ISA-TAB specification.

Knowledge of the MAGE-TAB format is required, on which see: MAGE-TAB.

As terminologies/vocabularies/ontologies evolve, ISA-TAB formatted files are a good example of targets for topic maps.

Researchers can continue their use of ISA-TAB formatted files undisturbed by changes in terminology, vocabulary or even ontology due to the semantic navigation layer provided by topic maps.

Or perhaps more correctly, one researcher or librarian can create a mapping of such changes that benefit all the other members of their lab.

GigaScience

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:32 am

GigaScience

From the description:

GigaScience is a new integrated database and journal co-published in collaboration between BGI Shenzhen and BioMed Central, to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data.” BGI (formerly known as Beijing Genomics Institute) was founded in 1999 and has since become the largest genomic organization in the world and has a proven track record of innovative, high profile research.

To achieve its goals, GigaScience has developed a novel publishing format that integrates manuscript publication with a database that will provide DOI assignment to every dataset. Supporting the open-data movement, we require that all supporting data and source code be publically available in a suitable public repository and/or under a public domain CC0 license in the BGI GigaScience database. Using the BGI cloud as a test environment, we also consider open-source software tools/methods for the analysis or handling of large-scale data. When submitting a manuscript, please contact us if you have datasets or cloud applications you would like us to host. To maximize data usability submitters are encouraged to follow best practice for metadata reporting and are given the opportunity to submit in ISA-Tab format.

A new journal to watch. One of the early articles is accompanied by an 83 GB data file.

Doing a separate post on the ISA-Tab format.

While I write that, image a format that carries with it known subject mappings into the literature? Or references to subject mappings into the literature?

July 14, 2012

Journal of Data Mining in Genomics and Proteomics

Filed under: Bioinformatics,Biomedical,Data Mining,Genome,Medical Informatics,Proteomics — Patrick Durusau @ 12:20 pm

Journal of Data Mining in Genomics and Proteomics

From the Aims and Scope page:

Journal of Data Mining in Genomics & Proteomics (JDMGP), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Genomics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes.

In today’s wired world information is available at the click of the button, curtsey the Internet. JDMGP-Open Access gives a worldwide audience larger than that of any subscription-based journal in OMICS field, no matter how prestigious or popular, and probably increases the visibility and impact of published work. JDMGP-Open Access gives barrier-free access to the literature for research. It increases convenience, reach, and retrieval power. Free online literature is available for software that facilitates full-text searching, indexing, mining, summarizing, translating, querying, linking, recommending, alerting, “mash-ups” and other forms of processing and analysis. JDMGP-Open Access puts rich and poor on an equal footing for these key resources and eliminates the need for permissions to reproduce and distribute content.

A publication (among many) from the OMICS Publishing Group, which sponsors a large number of online publications.

Has the potential to be an interesting source of information. Not much in the way of back files but then it is a very young journal.

July 8, 2012

MicrobeDB: a locally maintainable database of microbial genomic sequences

Filed under: Bioinformatics,Biomedical,Database,Genome,MySQL — Patrick Durusau @ 3:54 pm

MicrobeDB: a locally maintainable database of microbial genomic sequences by Morgan G. I. Langille, Matthew R. Laird, William W. L. Hsiao, Terry A. Chiu, Jonathan A. Eisen, and Fiona S. L. Brinkman. (Bioinformatics (2012) 28 (14): 1947-1948. doi: 10.1093/bioinformatics/bts273)

Abstract

Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.

Availability: MicrobeDB is freely available under the GNU-GPL at: http://github.com/mlangill/microbedb/

No doubt a useful project but the article seems to be at war with itself:

Although many of these centers provide genomic data in a variety of static formats such as Genbank and Fasta, these are often inadequate for complex queries. To carry out these analyses efficiently, a relational database such as MySQL (http://mysql.com) can be used to allow rapid querying across many genomes at once. Some existing data providers such as CMR allow downloading of their database files directly, but these databases are designed for large web-based infrastructures and contain numerous tables that demand a steep learning curve. Also, addition of unpublished genomes to these databases is often not supported. A well known and widely used system is the Generic Model Organism Database (GMOD) project (http://gmod.org). GMOD is an open-source project that provides a common platform for building model organism databases such as FlyBase (McQuilton et al., 2011) and WormBase (Yook et al., 2011). GMOD supports a variety of options such as GBrowse (Stein et al., 2002) and a variety of database choices including Chado (Mungall and Emmert, 2007) and BioSQL (http://biosql.org). GMOD provides a comprehensive system, but for many researchers such a complex system is not needed.

On one hand, current solutions are “…often inadequate for complex queries” and just a few lines later, “…such a complex system is not needed.”

I have no doubt that using unfamiliar and complex table structures is a burden on any user. Not to mention lacking the ability to add “unpublished genomes” or fixing versions of data for analysis.

What concerns me is the “solution” being seen as yet another set of “local” options. Which impedes the future use of the now “localized” data.

The issue raised here need to be addressed but one-off solutions seem like a particularly poor choice.

July 7, 2012

Genome-scale analysis of interaction dynamics reveals organization of biological networks

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 5:25 am

Genome-scale analysis of interaction dynamics reveals organization of biological networks by Jishnu Das, Jaaved Mohammed, and Haiyuan Yu. (Bioinformatics (2012) 28 (14): 1873-1878. doi: 10.1093/bioinformatics/bts283)

Summary:

Analyzing large-scale interaction networks has generated numerous insights in systems biology. However, such studies have primarily been focused on highly co-expressed, stable interactions. Most transient interactions that carry out equally important functions, especially in signal transduction pathways, are yet to be elucidated and are often wrongly discarded as false positives. Here, we revisit a previously described Smith–Waterman-like dynamic programming algorithm and use it to distinguish stable and transient interactions on a genomic scale in human and yeast. We find that in biological networks, transient interactions are key links topologically connecting tightly regulated functional modules formed by stable interactions and are essential to maintaining the integrity of cellular networks. We also perform a systematic analysis of interaction dynamics across different technologies and find that high-throughput yeast two-hybrid is the only available technology for detecting transient interactions on a large scale.

Research of obvious importance to anyone investigating biological networks but I mention it for the problem of how to represent transient relationships/interactions in a network?

Assuming a graph/network typology, how does a transient relationship impact a path traversal?

Assuming a graph/network typology, do we ignore the transience for graph theoretical properties such as shortest path?

Do we need graph theoretical queries versus biological network queries? Are the results always the same?

Can transient relationships results in transient properties? How do we record those?

Better yet, how do we ignore transient properties and under what conditions? (Leaving to one side how we would formally/computationally accomplish that ignorance.) What are the theoretical issues?

You can find the full text of this article at Professor Yu’s site: http://yulab.icmb.cornell.edu/PDF/Das_B2012.pdf

July 6, 2012

Tutorial on biological networks [The Heterogeneity of Nature]

Filed under: Bioinformatics,Biomedical,Graphs,Heterogeneous Data,Networks — Patrick Durusau @ 3:54 pm

Tutorial on biological networks by Francisco G. Vital-Lopez, Vesna Memišević, and Bhaskar Dutta. (Vital-Lopez, F. G., Memišević, V. and Dutta, B. (2012), Tutorial on biological networks. WIREs Data Mining Knowl Discov, 2: 298–325. doi: 10.1002/widm.1061)

Abstract:

Understanding how the functioning of a biological system emerges from the interactions among its components is a long-standing goal of network science. Fomented by developments in high-throughput technologies to characterize biomolecules and their interactions, network science has emerged as one of the fastest growing areas in computational and systems biology research. Although the number of research and review articles on different aspects of network science is increasing, updated resources that provide a broad, yet concise, review of this area in the context of systems biology are few. The objective of this article is to provide an overview of the research on biological networks to a general audience, who have some knowledge of biology and statistics, but are not necessarily familiar with this research field. Based on the different aspects of network science research, the article is broadly divided into four sections: (1) network construction, (2) topological analysis, (3) network and data integration, and (4) visualization tools. We specifically focused on the most widely studied types of biological networks, which are, metabolic, gene regulatory, protein–protein interaction, genetic interaction, and signaling networks. In future, with further developments on experimental and computational methods, we expect that the analysis of biological networks will assume a leading role in basic and translational research.

As a frozen artifact in time, I would suggest reading this article before it is too badly out of date. It will be sad to see it ravaged by time and pitted by later research that renders entire sections obsolete. Or of interest only to medical literature spelunkers of some future time.

Developers of homogeneous and “correct” models of biological networks should take warning from the closing lines of this survey article:

Currently different types of networks, such as PPI, GRN, or metabolic networks are analyzed separately. These heterogeneous networks have to be integrated systematically to generate comprehensive network, which creates a realistic representation of biological systems.[cite omitted] The integrated networks have to be combined with different types of molecular profiling data that measures different facades of the biological system. A recent multi institutional collaborative project, named The Cancer Genome Atlas,[cite omitted] has already started generating much multi-‘omics’ data for large cancer patient cohorts. Thus, we can expect to witness an exciting and fast paced growth on biological network research in the coming years.

Interesting.

Nature uses heterogeneous networks, with great success.

We can keep building homogenous networks or we can start building heterogeneous networks (at least to the extent we are capable).

What do you think?

July 3, 2012

The Science Network of Medical Data Mining

Filed under: Biomedical,Data Mining — Patrick Durusau @ 4:14 pm

The Science Network of Medical Data Mining

From the description of Unit 1:

Bar-Ilan University & The Chaim Sheba Medical Center – The Biomedical Informatics Program – The Science Network of Medical Data Mining

Course 80-665 – Medical Data Mining Spring, 2012

Lecturer: Dr. Ronen Tal-Botzer

Lectures as of today:

  • Unit 01 – Introduction & Scientific Background
  • Unit 02 – From Data to Information to Knowledge
  • Unit 03 – From Knowledge to Wisdom to Decision
  • Unit 04 – The Electronic Medical Record
  • Unit 05 – Artificial Intelligence in Medicine – Part A
  • Unit 06 – Science Network A: System Requirement Description

An enthusiastic lecturer which counts for a lot!

The presentation of medical information as intertwined with data mining sounds like a sound approach to me. Assuming students are grounded in medical information (or some other field), adding data mining is an extension of the familiar.

June 29, 2012

MuteinDB

Filed under: Bioinformatics,Biomedical,Genome,XML — Patrick Durusau @ 3:16 pm

MuteinDB: the mutein database linking substrates, products and enzymatic reactions directly with genetic variants of enzymes by Andreas Braun, Bettina Halwachs, Martina Geier, Katrin Weinhandl, Michael Guggemos, Jan Marienhagen, Anna J. Ruff, Ulrich Schwaneberg, Vincent Rabin, Daniel E. Torres Pazmiño, Gerhard G. Thallinger, and Anton Glieder.

Abstract:

Mutational events as well as the selection of the optimal variant are essential steps in the evolution of living organisms. The same principle is used in laboratory to extend the natural biodiversity to obtain better catalysts for applications in biomanufacturing or for improved biopharmaceuticals. Furthermore, single mutation in genes of drug-metabolizing enzymes can also result in dramatic changes in pharmacokinetics. These changes are a major cause of patient-specific drug responses and are, therefore, the molecular basis for personalized medicine. MuteinDB systematically links laboratory-generated enzyme variants (muteins) and natural isoforms with their biochemical properties including kinetic data of catalyzed reactions. Detailed information about kinetic characteristics of muteins is available in a systematic way and searchable for known mutations and catalyzed reactions as well as their substrates and known products. MuteinDB is broadly applicable to any known protein and their variants and makes mutagenesis and biochemical data searchable and comparable in a simple and easy-to-use manner. For the import of new mutein data, a simple, standardized, spreadsheet-based data format has been defined. To demonstrate the broad applicability of the MuteinDB, first data sets have been incorporated for selected cytochrome P450 enzymes as well as for nitrilases and peroxidases.

Database URL: http://www.muteindb.org/

Why is this relevant to topic maps or semantic diversity you ask?

I will let the author’s answer:

Information about specific proteins and their muteins are widely spread in the literature. Many studies only describe single mutation and its effects without comparison to already known muteins. Possible additive effects of single amino acid changes are scarcely described or used. Even after a thorough and time-consuming literature search, researchers face the problem of assembling and presenting the data in an easy understandable and comprehensive way. Essential information may be lost such as details about potentially cooperative mutations or reactions one would not expect in certain protein families. Therefore, a web-accessible database combining available knowledge about a specific enzyme and its muteins in a single place are highly desirable. Such a database would allow researchers to access relevant information about their protein of interest in a fast and easy way and accelerate the engineering of new and improved variants. (Third paragraph of the introduction)

I would have never dreamed that gene data would be spread to Hell and back. 😉

The article will give you insight into how gene data is collected, searched, organized, etc. All of which will be valuable to you whether you are designing or using information systems in this area.

I was a bit let down when I read about data formats:

Most of them are XML based, which can be difficult to create and manipulate. Therefore, simpler, spreadsheet-based formats have been introduced which are more accessible for the individual researcher.

I’ve never had any difficulties with XML based formats but will admit that may not be a universal experience. Sounds to me like the XML community should concentrate a bit less on making people write angle-bang syntax and more on long term useful results. (Which I think XML can deliver.)

June 25, 2012

In the red corner – PubMed and in the blue corner – Google Scholar

Filed under: Bioinformatics,Biomedical,PubMed,Search Engines,Searching — Patrick Durusau @ 7:40 pm

Medical literature searches: a comparison of PubMed and Google Scholar by Eva Nourbakhsh, Rebecca Nugent, Helen Wang, Cihan Cevik and Kenneth Nugent. (Health Information & Libraries Journal, Article first published online: 19 JUN 2012)

From the abstract:

Background

Medical literature searches provide critical information for clinicians. However, the best strategy for identifying relevant high-quality literature is unknown.

Objectives

We compared search results using PubMed and Google Scholar on four clinical questions and analysed these results with respect to article relevance and quality.

Methods

Abstracts from the first 20 citations for each search were classified into three relevance categories. We used the weighted kappa statistic to analyse reviewer agreement and nonparametric rank tests to compare the number of citations for each article and the corresponding journals’ impact factors.

Results

Reviewers ranked 67.6% of PubMed articles and 80% of Google Scholar articles as at least possibly relevant (P = 0.116) with high agreement (all kappa P-values < 0.01). Google Scholar articles had a higher median number of citations (34 vs. 1.5, P < 0.0001) and came from higher impact factor journals (5.17 vs. 3.55, P = 0.036). Conclusions

PubMed searches and Google Scholar searches often identify different articles. In this study, Google Scholar articles were more likely to be classified as relevant, had higher numbers of citations and were published in higher impact factor journals. The identification of frequently cited articles using Google Scholar for searches probably has value for initial literature searches.

I have several concerns that may or may not be allied by further investigation:

  • Four queries seems like an inadequate basis for evaluation. Not that I expect to see one “winner” and one “loser,” but am more concerned with what lead to the differences in results.
  • It is unclear why a citation from a journal with a higher impact factor is superior to one with a lesser impact factor? I assume the point of the query is to obtain a useful result (in the sense of medical treatment, not tenure).
  • Neither system enabled users to build upon the query experience of prior users with a similar query.
  • Neither system enabled users to avoid re-reading the same texts as other had read before them.

Thoughts?

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE

Filed under: Bioinformatics,Biomedical,Text Mining — Patrick Durusau @ 7:15 pm

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE by Neveol, A., Wilbur, W. J., Lu, Z.

Abstract:

High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, <50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation.

Database URLs: MEDLINE http://www.ncbi.nlm.nih.gov/PubMed, GEO http://www.ncbi.nlm.nih.gov/geo/, PDB http://www.rcsb.org/pdb/.

A good illustration of the use of automated means to augment the capacity of curators of data links.

Or topic map authors performing the same task.

June 22, 2012

Sage Bionetworks and Amazon SWF

Sage Bionetworks and Amazon SWF

From the post:

Over the past couple of decades the medical research community has witnessed a huge increase in the creation of genetic and other bio molecular data on human patients. However, their ability to meaningfully interpret this information and translate it into advances in patient care has been much more modest. The difficulty of accessing, understanding, and reusing data, analysis methods, or disease models across multiple labs with complimentary expertise is a major barrier to the effective interpretation of genomic data. Sage Bionetworks is a non-profit biomedical research organization that seeks to revolutionize the way researchers work together by catalyzing a shift to an open, transparent research environment. Such a shift would benefit future patients by accelerating development of disease treatments, and society as a whole by reducing costs and efficacy of health care.

To drive collaboration among researchers, Sage Bionetworks built an on-line environment, called Synapse. Synapse hosts clinical-genomic datasets and provides researchers with a platform for collaborative analyses. Just like GitHub and Source Forge provide tools and shared code for software engineers, Synapse provides a shared compute space and suite of analysis tools for researchers. Synapse leverages a variety of AWS products to handle basic infrastructure tasks, which has freed the Sage Bionetworks development team to focus on the most scientifically-relevant and unique aspects of their application.

Amazon Simple Workflow Service (Amazon SWF) is a key technology leveraged in Synapse. Synapse relies on Amazon SWF to orchestrate complex, heterogeneous scientific workflows. Michael Kellen, Director of Technology for Sage Bionetworks states, “SWF allowed us to quickly decompose analysis pipelines in an orderly way by separating state transition logic from the actual activities in each step of the pipeline. This allowed software engineers to work on the state transition logic and our scientists to implement the activities, all at the same time. Moreover by using Amazon SWF, Synapse is able to use a heterogeneity of computing resources including our servers hosted in-house, shared infrastructure hosted at our partners’ sites, and public resources, such as Amazon’s Elastic Compute Cloud (Amazon EC2). This gives us immense flexibility is where we run computational jobs which enables Synapse to leverage the right combination of infrastructure for every project.”

The Sage Bionetworks case study (above) and another one, NASA JPL and Amazon SWF, will get you excited about reaching out to the documentation on Amazon Simple Workflow Service (Amazon SWF).

In ways that presentations that consist of reading slides about management advantages to Amazon SWF simply can’t reach. At least not for me.

Take the tip and follow the case studies, then onto the documentation.

Full disclosure: I have always been fascinated by space and really hard bioinformatics problems. And have < 0 interest in DRM antics on material if piped to /dev/null would raise a user's IQ.

BADREX: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions

Filed under: Biomedical,Regexes — Patrick Durusau @ 2:16 pm

BADREX: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions by Phil Gooch.

Abstract:

BADREX uses dynamically generated regular expressions to annotate term definition-term abbreviation pairs, and corefers unpaired acronyms and abbreviations back to their initial definition in the text. Against the Medstract corpus BADREX achieves precision and recall of 98% and 97%, and against a much larger corpus, 90% and 85%, respectively. BADREX yields improved performance over previous approaches, requires no training data and allows runtime customisation of its input parameters. BADREX is freely available from https://github.com/philgooch/BADREX-Biomedical-Abbreviation-Expander as a plugin for the General Architecture for Text Engineering (GATE) framework and is licensed under the GPLv3.

From the conclusion:

The use of regular expressions dynamically generated from document content yields modestly improved performance over previous approaches to identifying term definition–term abbreviation pairs, with the benefit of providing in-place annotation, expansion and coreference in a single pass. BADREX requires no training data and allows runtime customisation of its input parameters.

Although not mentioned by the author, a reader can agree/disagree with an expansion as they are reading the text. Could provide for faster feedback/correction of the expansion.

Assuming you accept a correct/incorrect view of expansions. I prefer agree/disagree as the more general rule. Correct/incorrect is the result of the application of a specified rule.

June 12, 2012

Network Medicine: Using Visualization to Decode Complex Diseases

Filed under: Bioinformatics,Biomedical,Genome,Graphs,Networks — Patrick Durusau @ 6:26 pm

Network Medicine: Using Visualization to Decode Complex Diseases

From the post:

Albert Làszló Barabàsi is a physicist, but maybe best known for his work in the field of network theory. In his TEDMED talk titled “Network Medicine: A Network Based Approach to Decode Complex Diseases” [tedmed.com], Albert-Làszló applies advanced network theory to the field of biology.

Using a metaphor of Manhattan maps, he explains how an all-encompassing map of the relationships between genes, proteins and metabolites can form the key to truly understand the mechanisms behind many diseases. He further makes the point that diseases should not be divided up in organ-based separate branches of medicin, but rather as a tightly interconnected network.

More information and movies at the post (information aesthetics)

Turns out that relationships (can you say graph/network?) are going to be critical in the treatment of disease. (Not treatment of symptoms, treatment of disease.)

June 5, 2012

Capturing…Quantitative and Semantic Information in Radiology Images

Filed under: Biomedical,Ontology — Patrick Durusau @ 7:55 pm

Daniel Rubin from Stanford University on “Capturing and Computer Reasoning with Quantitative and Semantic Information in Radiology Images” at 10:00am PT, Wednesday, June 6.

ABSTRACT:

The use of semantic Web technologies to make the myriad of data in cyberspace accessible to intelligent agents is well established. However, a crucial type of information on the Web–and especially in life sciences–is imaging, which is largely being overlooked in current semantic Web endeavors. We are developing methods and tools to enable the transparent discovery and use of large distributed collections of medical images within hospital information systems and ultimately on the Web. Our approach is to make the human and machine descriptions of image content machine-accessible through “semantic annotation” using ontologies, capturing semantic and quantitative information from images as physicians view them in a manner that minimally affects their current workflow. We exploit new standards for making image contents explicit and publishable on the semantic Web. We will describe tools and methods we are developing and preliminary results using them for response assessment in cancer. While this work is focused on images in the life sciences, it has broader applicability to all images on the Web. Our ultimate goal is to enable semantic integration of images and all the related scientific data pertaining to their content so that physicians and basic scientists can have the best understanding of the biological and physiological significance of image content.

SPEAKER BIO:

Daniel L. Rubin, MD, MS is Assistant Professor of Radiology and Medicine (Biomedical Informatics Research) at Stanford University. He is a Member of the Stanford Cancer Center and the Bio-X interdisciplinary research program. His NIH-funded research program focuses on the intersection of biomedical informatics and imaging science, developing computational methods and applications to extract quantitative information and meaning from clinical, molecular, and imaging data, and to translate these methods into practice through applications to improve diagnostic accuracy and clinical effectiveness. He is Principal Investigator of one of the centers in the National Cancer Institute’s recently-established Quantitative Imaging Network (QIN), Chair of the RadLex Steering Committee of the Radiological Society of North America (RSNA), and Chair of the Informatics Committee of the American College of Radiology Imaging Network (ACRIN). Dr. Rubin has published over 100 scientific publications in biomedical imaging informatics and radiology.

WEBEX DETAILS:
——————————————————-
To start or join the online meeting
——————————————————-
Go to https://stanford.webex.com/stanford/j.php?ED=175352027&UID=481527042&PW=NYjM4OTVlZTFj&RT=MiM0

——————————————————-
Audio conference information
——————————————————-
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers: https://stanford.webex.com/stanford/globalcallin.php?serviceType=MC&ED=175352027&tollFree=0

Access code:925 343 903

Whether you are using topic maps for image annotation or mapping between systems of image annotation, this promises to be an interesting presentation.

May 8, 2012

Downloading the XML data from the Exome Variant Server

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 10:44 am

Downloading the XML data from the Exome Variant Server

Pierre Lindenbaum writes:

From EVS: “The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.

The NHLBI Exome Sequencing Project provides a download area but I wanted to build a local database for the richer XML data returned by their Web Services (previously described here on my blog ). The following java program sends some XML/SOAP requests to the EVS server for each chromosome using a genomic window of 150000 bp and parses the XML response.

If you are interested in tools that will assist you in populating a genome-centric topic map, Pierre’s blog is an important one to follow.

April 17, 2012

Using the Disease ontology (DO) to map the genes involved in a category of disease

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:11 pm

Using the Disease ontology (DO) to map the genes involved in a category of disease by Pierre Lindenbaum.

Of particular interest if you are developing topic maps for bioinformatics.

The medical community has created a number of mapping term resources. In this particular case a mapping from the DO (disease ontology) to OMIM (Online Mendelian Inheritance in Man) and to NCBI Gene (Gene).

April 15, 2012

Constructing Case-Control Studies With Hadoop

Filed under: Bioinformatics,Biomedical,Giraph,Hadoop,Medical Informatics — Patrick Durusau @ 7:13 pm

Constructing Case-Control Studies With Hadoop by Josh Wills.

From the post:

San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.

Case-Control Studies

A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.

Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

Great walk through on constructing a case-control study, including the use of the Apache Giraph library.

April 13, 2012

Operations, machine learning and premature babies

Filed under: Bioinformatics,Biomedical,Machine Learning — Patrick Durusau @ 4:40 pm

Operations, machine learning and premature babies: An astonishing connection between web ops and medical care. By Mike Loukides.

From the post:

Julie Steele and I recently had lunch with Etsy’s John Allspaw and Kellan Elliott-McCrea. I’m not sure how we got there, but we made a connection that was (to me) astonishing between web operations and medical care for premature infants.

I’ve written several times about IBM’s work in neonatal intensive care at the University of Toronto. In any neonatal intensive care unit (NICU), every baby is connected to dozens of monitors. And each monitor is streaming hundreds of readings per second into various data systems. They can generate alerts if anything goes severely out of spec, but in normal operation, they just generate a summary report for the doctor every half hour or so.

IBM discovered that by applying machine learning to the full data stream, they were able to diagnose some dangerous infections a full day before any symptoms were noticeable to a human. That’s amazing in itself, but what’s more important is what they were looking for. I expected them to be looking for telltale spikes or irregularities in the readings: perhaps not serious enough to generate an alarm on their own, but still, the sort of things you’d intuitively expect of a person about to become ill. But according to Anjul Bhambhri, IBM’s Vice President of Big Data, the telltale signal wasn’t spikes or irregularities, but the opposite. There’s a certain normal variation in heart rate, etc., throughout the day, and babies who were about to become sick didn’t exhibit the variation. Their heart rate was too normal; it didn’t change throughout the day as much as it should.

That observation strikes me as revolutionary. It’s easy to detect problems when something goes out of spec: If you have a fever, you know you’re sick. But how do you detect problems that don’t set off an alarm? How many diseases have early symptoms that are too subtle for a human to notice, and only accessible to a machine learning system that can sift through gigabytes of data?

The post goes on to discuss how our servers may exhibit behaviors that machine learning could recognize but that we can’t specify.

That may be Rumsfeld’s “unknown unknowns,” however we all laughed at the time.

There are “unknown unknown’s” and tireless machine learning may be the only way to identify them.

In topic map lingo, I would say there are subjects that we haven’t yet learned to recognize.

April 12, 2012

From Beaker to Bits: Graph Theory Yields Computational Model of Human Tissue

Filed under: Bioinformatics,Biomedical,Graphs — Patrick Durusau @ 7:04 pm

From Beaker to Bits: Graph Theory Yields Computational Model of Human Tissue

An all too seldom example of how reaching across disciplinary lines can lead to fundamental breakthroughs in more than one area.

First step, alert any graph or data store people you know, along with any medical research types.

Second step, if you are in CS/Math, think about another department that interests you. If you are in other sciences or humanities, strike up a conversation with the CS/Math department types.

In both cases, don’t take “no” or lack of interest as an answer. Talk to the newest faculty or even faculty at other institutions. Or even established companies.

No guarantees that you will strike up a successful collaboration, much less have a successful result. But, we all know how successful a project that never begins will be, don’t we?

Here is a story of a collaborative project that persisted and succeeded:

Computer scientists and biologists in the Data Science Research Center at Rensselaer Polytechnic Institute have developed a rare collaboration between the two very different fields to pick apart a fundamental roadblock to progress in modern medicine. Their unique partnership has uncovered a new computational model called “cell graphs” that links the structure of human tissue to its corresponding biological function. The tool is a promising step in the effort to bring the power of computational science together with traditional biology to the fight against human diseases, such as cancer.

The discovery follows a more than six-year collaboration, breaking ground in both fields. The work will serve as a new method to understand and predict relationships between the cells and tissues in the human body, which is essential to detect, diagnose and treat human disease. It also serves as an important reminder of the power of collaboration in the scientific process.

The new research led by Professor of Biology George Plopper and Professor of Computer Science Bulent Yener is published in the March 30, 2012, edition of the journal PLoS One in a paper titled, “ Coupled Analysis of in Vitro and Histology Tissue Samples to Quantify Structure-Function Relationship.” They were joined in the research by Evrim Acar, a graduate student at Rensselaer in Yener’s lab currently at the University of Copenhagen. The research is funded by the National Institutes of Health and the Villum Foundation.

The new, purely computational tool models the relationship between the structure and function of different tissues in body. As an example of this process, the new paper analyzes the structure and function of healthy and cancerous brain, breast and bone tissues. The model can be used to determine computationally whether a tissue sample is cancerous or not, rather than relying on the human eye as is currently done by pathologists around the world each day. The objective technique can be used to eliminate differences of opinion between doctors and as a training tool for new cancer pathologists, according to Yener and Plopper. The tool also helps fill an important gap in biological knowledge, they said.

BTW, if you want to see all the details: Coupled Analysis of in Vitro and Histology Tissue Samples to Quantify Structure-Function Relationship

April 8, 2012

Indexing the content of Gene Ontology with apache SOLR

Filed under: Bioinformatics,Biomedical,Gene Ontology,Solr — Patrick Durusau @ 4:21 pm

Indexing the content of Gene Ontology with apache SOLR by Pierre Lindenbaum.

Pierre walks you through the use of Solr to index GeneOntology. As with all of his work, impressive!

Of course, one awesome post deserves another! So Pierre follows with:

Apache SOLR and GeneOntology: Creating the JQUERY-UI client (with autocompletion)

So you get to learn JQuery/UI stuff as well.

April 2, 2012

The 1000 Genomes Project

The 1000 Genomes Project

If Amazon is hosting a single dataset > 200 TB, is your data “big data?” 😉

This merits quoting in full:

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3. 

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

A few years ago there was a quantum leap in the technology for sequencing DNA, which drastically reduced the time and cost of identifying genetic code. This offered the promise of being able to compare full genomes from individuals, rather than entire species, leading to a much more detailed genetic map of where we, as individuals, have genetic similarities and differences. This will ultimately give us better insight into human health and disease.

The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available, ultimately with data from the genomes of over 2,661 people from 26 populations around the world. The project began with three pilot studies that assessed strategies for producing a catalog of genetic variants that are present at one percent or greater in the populations studied. We were happy to host the initial pilot data on Amazon S3 in 2010, and today we're making the latest dataset available to all, including results from sequencing the DNA of approximately 1,700 people.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow

You can find more information, the location of the data and how to get started using it on our 1000 Genomes web page, or from the project pages.

If that sounds like a lot of data, just imagine all of the recorded mathematical texts and the relationships between the concepts represented in such texts?

It is in our view that data looks smooth or simple. Or complex.

March 28, 2012

GWAS Central

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:22 pm

GWAS Central

From the website:

GWAS Central (previously the Human Genome Variation database of Genotype-to-Phenotype information) is a database of summary level findings from genetic association studies, both large and small. We actively gather datasets from public domain projects, and encourage direct data submission from the community.

GWAS Central is built upon a basal layer of Markers that comprises all known SNPs and other variants from public databases such as dbSNP and the DBGV. Allele and genotype frequency data, plus genetic association significance findings, are added on top of the Marker data, and organised the same way that investigations are reported in typical journal manuscripts. Critically, no individual level genotypes or phenotypes are presented in GWAS Central – only group level aggregated (summary level) data. The largest unit in a data submission is a Study, which can be thought of as being equivalent to one journal article. This may contain one or more Experiments, one or more Sample Panels of test subjects, and one or more Phenotypes. Sample Panels may be characterised in terms of various Phenotypes, and they also may be combined and/or split into Assayed Panels. The Assayed Panels are used as the basis for reporting allele/genotype frequencies (in `Genotype Experiments`) and/or genetic association findings (in ‘Analysis Experiments’). Environmental factors are handled as part of the Sample Panel and Assayed Panel data structures.

Although I mentioned GWAS some time ago, I saw it mentioned in Christophe Lalanne’s Bag of Tweets for March 2012 and on taking a another look, thought I should mention it again.

In part because as the project reports above, this is an aggregation level site, not one that reaches into the details of studies, that may or may not be important for some researchers. That aggregation leaves a gap for aggregation or analysis of the underlying data, plus mapping it to other data!

Openfmri.org

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:22 pm

Openfmri.org

From the webpage:

OpenfMRI.org is a project dedicated to the free and open sharing of functional magnetic resonance imaging (fMRI) datasets, including raw data.

Now that’s a data set you don’t see everyday!

Not to mention being one that would be ripe to link into medical literature, hospital/physician records, etc.

First seen in Christophe Lalanne’s Bag of Tweets for March, 2012.

March 18, 2012

Drug data reveal sneaky side effects

Filed under: Bioinformatics,Biomedical,Knowledge Economics,Medical Informatics — Patrick Durusau @ 8:54 pm

Drug data reveal sneaky side effects

From the post:

An algorithm designed by US scientists to trawl through a plethora of drug interactions has yielded thousands of previously unknown side effects caused by taking drugs in combination.

The work, published today in Science Translational Medicine [Tatonetti, N. P., Ye, P. P., Daneshjou, R. and Altman, R. B. Sci. Transl. Med. 4, 125ra31 (2012).], provides a way to sort through the hundreds of thousands of ‘adverse events’ reported to the US Food and Drug Administration (FDA) each year. “It’s a step in the direction of a complete catalogue of drug–drug interactions,” says the study’s lead author, Russ Altman, a bioengineer at Stanford University in California.

From later in the post:

The team then used this method to compile a database of 1,332 drugs and possible side effects that were not listed on the labels for those drugs. The algorithm came up with an average of 329 previously unknown adverse events for each drug — far surpassing the average of 69 side effects listed on most drug labels.

Double trouble

The team also compiled a similar database looking at interactions between pairs of drugs, which yielded many more possible side effects than could be attributed to either drug alone. When the data were broken down by drug class, the most striking effect was seen when diuretics called thiazides, often prescribed to treat high blood pressure and oedema, were used in combination with a class of drugs called selective serotonin reuptake inhibitors, used to treat depression. Compared with people who used either drug alone, patients who used both drugs were significantly more likely to experience a heart condition known as prolonged QT, which is associated with an increased risk of irregular heartbeats and sudden death.

A search of electronic medical records from Stanford University Hospital confirmed the relationship between these two drug classes, revealing a roughly 1.5-fold increase in the likelihood of prolonged QT when the drugs were combined, compared to when either drug was taken alone. Altman says that the next step will be to test this finding further, possibly by conducting a clinical trial in which patients are given both drugs and then monitored for prolonged QT.

This data could be marketed to drug companies, trial lawyers (both sides), medical malpractice insurers, etc. This is an example of the data marketing I mentioned in Knowledge Economics II.

March 6, 2012

Extending the GATK for custom variant comparisons using Clojure

Filed under: Bioinformatics,Biomedical,Clojure,MapReduce — Patrick Durusau @ 8:09 pm

Extending the GATK for custom variant comparisons using Clojure by Brad Chapman.

From the post:

The Genome Analysis Toolkit (GATK) is a full-featured library for dealing with next-generation sequencing data. The open-source Java code base, written by the Genome Sequencing and Analysis Group at the Broad Institute, exposes a Map/Reduce framework allowing developers to code custom tools taking advantage of support for: BAM Alignment files through Picard, BED and other interval file formats through Tribble, and variant data in VCF format.

Here I’ll show how to utilize the GATK API from Clojure, a functional, dynamic programming language that targets the Java Virtual Machine. We’ll:

  • Write a GATK walker that plots variant quality scores using the Map/Reduce API.
  • Create a custom annotation that adds a mean neighboring base quality metric using the GATK VariantAnnotator.
  • Use the VariantContext API to parse and access variant information in a VCF file.

The Clojure variation library is freely available and is part of a larger project to provide variant assessment capabilities for the Archon Genomics XPRIZE competition.

Interesting data, commercial potential, cutting edge technology and subject identity issues galore. What more could you want?

February 16, 2012

Effectopedia

Filed under: Bioinformatics,Biomedical,Collaboration — Patrick Durusau @ 7:03 pm

Effectopedia – An Open Data Project for Collaborative Scientific Research, with the aim of reducing Animal Testing by Velichka Dimitrova, Coordinator of the Open Economics Working Group and Hristo Alajdov, Associate Professor at Institute of Biomedical Engineering at the Bulgarian Academy of Sciences.

From the post:

One of the key problems in natural science research is the lack of effective collaboration. A lot of research is conducted by scientists from different disciplines, yet cross-discipline collaboration is rare. Even within a discipline, research is often duplicated, which wastes resources and valuable scientific potential. Furthermore, without a common framework and context, research that involves animal testing often becomes phenomenological and little or no general knowledge can be gained from it. The peer reviewed publishing process is also not very effective in stimulating scientific collaboration, mainly due to the loss of an underlying machine readable structure for the data and the duration of the process itself.

If research results were more effectively shared and re-used by a wider scientific community – including scientists with different disciplinary backgrounds – many of these problems could be addressed. We could hope to see a more efficient use of resources, an accelerated rate of academic publications, and, ultimately, a reduction in animal testing.

Effectopedia is a project of the International QSAR Foundation. Effectopedia itself is an open knowledge aggregation and collaboration tool that provides a means of describing adverse outcome pathways (AOPs)1 in an encyclopedic manner. Effectopedia defines internal organizational space which helps scientist with different backgrounds to know exactly where their knowledge belongs and aids them in identifying both the larger context of their research and the individual experts who might be actively interested in it. Using automated notifications when researchers create causal linkage between parts of the pathways, they can simultaneously create a valuable contact with a fellow researcher interested in the same topic who might have a different background or perspective towards the subject. Effectopedia allows creation of live scientific documents which are instantly open for focused discussions and feedback whilst giving credit to the original authors and reviewers involved. The review process is never closed and if new evidence arises it can be presented immediately, allowing the information in Effectopedia to remain current, while keeping track of its complete evolution.

Sounds interesting but there is no link to the Effectopedia website. Followed links a bit and found: Effectopedia at SourceForge.

Apparently still in pre-alpha state.

I remember more than one workspace project so how do we decide whose identifications/terminology gets used?

Isn’t that the tough nut of collaboration? If scholars (given my background in biblical studies) decide to collaborate beyond their departments, they form projects, but that are less inclusive than all workers in a particular area. The end result being there are multiple projects with different identifications/terminologies. How do we bridge those gaps?

As you know, my suggestion is that everyone keeps their own identifications/terminologies.

Curious though if everyone does, keeps their own identifications/terminologies, if they will be able to read enough of another project’s content to understand that it is meaningful in their quest?

That is a topic map author deciding that two or more representatives represent the same subject may not carry over to users of the topic map having the same appreciation.

February 8, 2012

PSEUDOMARKER: a powerful program for joint linkage…

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 5:13 pm

PSEUDOMARKER: a powerful program for joint linkage and/or linkage disequilibrium analysis on mixtures of singletons and related individuals. By Hiekkalinna T, Schäffer AA, Lambert B, Norrgrann P, Göring HH, Terwilliger JD.

Abstract:

A decade ago, there was widespread enthusiasm for the prospects of genome-wide association studies to identify common variants related to common chronic diseases using samples of unrelated individuals from populations. Although technological advancements allow us to query more than a million SNPs across the genome at low cost, a disappointingly small fraction of the genetic portion of common disease etiology has been uncovered. This has led to the hypothesis that less frequent variants might be involved, stimulating a renaissance of the traditional approach of seeking genes using multiplex families from less diverse populations. However, by using the modern genotyping and sequencing technology, we can now look not just at linkage, but jointly at linkage and linkage disequilibrium (LD) in such samples. Software methods that can look simultaneously at linkage and LD in a powerful and robust manner have been lacking. Most algorithms cannot jointly analyze datasets involving families of varying structures in a statistically or computationally efficient manner. We have implemented previously proposed statistical algorithms in a user-friendly software package, PSEUDOMARKER. This paper is an announcement of this software package. We describe the motivation behind the approach, the statistical methods, and software, and we briefly demonstrate PSEUDOMARKER’s advantages over other packages by example.

I didn’t set out to find this particular article but was trying to update references on Cri-Map, which is now somewhat data software for:

… rapid, largely automated construction of multilocus linkage maps (and facilitate the attendant tasks of assessing support relative to alternative locus orders, generating LOD tables, and detecting data errors). Although originally designed to handle codominant loci (e.g. RFLPs) scored on pedigrees “without missing individuals”, such as CEPH or nuclear families, it can now (with some caveats described below) be used on general pedigrees, and some disease loci.

Just as background, you may wish to see:

CRI-MAP – Introduction

And, Multilocus linkage analysis

With multilocus linkage analysis, more than two loci are simultaneously considered for linkage. When mapping a disease gene relative to a group of markers with known intermarker recombination fractions, it is possible to perform parametric (lod score) as well as nonparametric analysis.

My interest being in the use of additional information (in the lead article “linkage and linkage disequilibrium”) in determining linkage issues.

Not that every issue of subject identification needs or should be probabilistic or richly nuanced.

In a prison there are “free men” and prisoners.

Rather sharp and useful distinction. Doesn’t require a URL. Or a subject identifier. What does your use case require?

« Newer PostsOlder Posts »

Powered by WordPress