Archive for the ‘Biomedical’ Category

Can You Replicate Your Searches?

Thursday, February 16th, 2017

A comment at PubMed raises the question of replicating reported literature searches:

From the comment:

Mellisa Rethlefsen

I thank the authors of this Cochrane review for providing their search strategies in the document Appendix. Upon trying to reproduce the Ovid MEDLINE search strategy, we came across several errors. It is unclear whether these are transcription errors or represent actual errors in the performed search strategy, though likely the former.

For instance, in line 39, the search is “tumour bed boost.sh.kw.ti.ab” [quotes not in original]. The correct syntax would be “tumour bed boost.sh,kw,ti,ab” [no quotes]. The same is true for line 41, where the commas are replaced with periods.

In line 42, the search is “Breast Neoplasms /rt.sh” [quotes not in original]. It is not entirely clear what the authors meant here, but likely they meant to search the MeSH heading Breast Neoplasms with the subheading radiotherapy. If that is the case, the search should have been “Breast Neoplasms/rt” [no quotes].

In lines 43 and 44, it appears as though the authors were trying to search for the MeSH term “Radiotherapy, Conformal” with two different subheadings, which they spell out and end with a subject heading field search (i.e., Radiotherapy, Conformal/adverse events.sh). In Ovid syntax, however, the correct search syntax would be “Radiotherapy, Conformal/ae” [no quotes] without the subheading spelled out and without the extraneous .sh.

In line 47, there is another minor error, again with .sh being extraneously added to the search term “Radiotherapy/” [quotes not in original].

Though these errors are minor and are highly likely to be transcription errors, when attempting to replicate this search, each of these lines produces an error in Ovid. If a searcher is unaware of how to fix these problems, the search becomes unreplicable. Because the search could not have been completed as published, it is unlikely this was actually how the search was performed; however, it is a good case study to examine how even small details matter greatly for reproducibility in search strategies.

A great reminder that replication of searches is a non-trivial task and that search engines are literal to the point of idiocy.

Unmet Needs for Analyzing Biological Big Data… [Data Integration #1 – Spells Market Opportunity]

Wednesday, February 15th, 2017

Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators by Lindsay Barone, Jason Williams, David Micklos.

Abstract:

In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principle investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multi-step workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

In particular, needs topic maps can address rank #1, #2, #6, #7, and #10, or as found by the authors:


A majority of PIs—across bioinformatics/other disciplines, larger/smaller groups, and the four NSF programs—said their institutions are not meeting nine of 13 needs (Figure 3). Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HP computing (71%) were the three greatest unmet needs. High performance computing was an unmet need for only 27% of PIs—with similar percentages across disciplines, different sized groups, and NSF programs.

or graphically (figure 3):

So, cloud, distributed, parallel, pipelining, etc., processing is insufficient?

Pushing undocumented and unintegratable data at ever increasing speeds is impressive but gives no joy?

This report will provoke another round of Esperanto fantasies, that is the creation of “universal” vocabularies, which if used by everyone and back-mapped to all existing literature, would solve the problem.

The number of Esperanto fantasies and the cost/delay of back-mapping to legacy data defeats all such efforts. Those defeats haven’t prevented repeated funding of such fantasies in the past, present and no doubt the future.

Perhaps those defeats are a question of scope.

That is rather than even attempting some “universal” interchange of data, why not approach it incrementally?

I suspect the PI’s surveyed each had some particular data set in mind when they mentioned data integration (which itself is a very broad term).

Why not seek out, develop and publish data integrations in particular instances, as opposed to attempting to theorize what might work for data yet unseen?

The need topic maps wanted to meet remains unmet. With no signs of lessening.

Opportunity knocks. Will we answer?

New Virus Breaks The Rules Of Infection – Cyber Analogies?

Friday, August 26th, 2016

New Virus Breaks The Rules Of Infection by Michaeleen Doucleff.

From the post:

Human viruses are like a fine chocolate truffle: It takes only one to get the full experience.

At least, that’s what scientists thought a few days ago. Now a new study published Thursday is making researchers rethink how some viruses could infect animals.

A team at the U.S. Army Medical Research Institute of Infectious Diseases has found a mosquito virus that’s broken up into pieces. And the mosquito needs to catch several of the pieces to get an infection.

“It’s the most bizarre thing,” says Edward Holmes, a virologist at the University of Sydney, who wasn’t involved in the study. It’s like the virus is dismembered, he says.

“If you compare it to the human body, it’s like a person would have their legs, trunk and arms all in different places,” Holmes says. “Then all the pieces come together in some way to work as one single virus. I don’t think anything else in nature moves this way.”

Also from the post:

These are insect cells infected with the Guaico Culex virus. The different colors denote cells infected with different pieces of the virus. Only the brown-colored cells are infectious, because they contain the complete virus. Michael Lindquist/Cell Press

new-virus-pieces-460

The full scale image.

How very cool!

Any known analogies in computer viruses?

The Gene Hackers [Chaos Remains King]

Tuesday, November 10th, 2015

The Gene Hackers by Michael Specter.

From the post:

It didn’t take Zhang or other scientists long to realize that, if nature could turn these molecules into the genetic equivalent of a global positioning system, so could we. Researchers soon learned how to create synthetic versions of the RNA guides and program them to deliver their cargo to virtually any cell. Once the enzyme locks onto the matching DNA sequence, it can cut and paste nucleotides with the precision we have come to expect from the search-and-replace function of a word processor. “This was a finding of mind-boggling importance,” Zhang told me. “And it set off a cascade of experiments that have transformed genetic research.”

With CRISPR, scientists can change, delete, and replace genes in any animal, including us. Working mostly with mice, researchers have already deployed the tool to correct the genetic errors responsible for sickle-cell anemia, muscular dystrophy, and the fundamental defect associated with cystic fibrosis. One group has replaced a mutation that causes cataracts; another has destroyed receptors that H.I.V. uses to infiltrate our immune system.

The potential impact of CRISPR on the biosphere is equally profound. Last year, by deleting all three copies of a single wheat gene, a team led by the Chinese geneticist Gao Caixia created a strain that is fully resistant to powdery mildew, one of the world’s most pervasive blights. In September, Japanese scientists used the technique to prolong the life of tomatoes by turning off genes that control how quickly they ripen. Agricultural researchers hope that such an approach to enhancing crops will prove far less controversial than using genetically modified organisms, a process that requires technicians to introduce foreign DNA into the genes of many of the foods we eat.

The technology has also made it possible to study complicated illnesses in an entirely new way. A few well-known disorders, such as Huntington’s disease and sickle-cell anemia, are caused by defects in a single gene. But most devastating illnesses, among them diabetes, autism, Alzheimer’s, and cancer, are almost always the result of a constantly shifting dynamic that can include hundreds of genes. The best way to understand those connections has been to test them in animal models, a process of trial and error that can take years. CRISPR promises to make that process easier, more accurate, and exponentially faster.

Deeply compelling read on the stellar career of Feng Zhang and his use of “clustered regularly interspaced short palindromic repeats” (CRISPR) for genetic engineering.

If you are up for the technical side, try PubMed on CRISPR at 2,306 “hits” as of today.

If not, continue with Michael’s article. You will get enough background to realize this is a very profound moment in the development of genetic engineering.

A profound moment that can be made all the more valuable by linking its results to the results (not articles or summaries of articles) of prior research.

Proposals for repackaging data in some yet-to-be-invented format are a non-starter from my perspective. That is more akin to the EU science/WPA projects than a realistic prospect for value-add.

Let’s start with the assumption that when held in electronic format, data has its native format as a given. Nothing we can change about that part of the problem of access.

Whether labbooks, databases, triple stores, etc.

That one assumption reduces worries about corrupting the original data and introduces a sense of “tinkering” with existing data interfaces. (Watch for a post tomorrow on the importance of “tinkering.”)

Hmmm, nodes anyone?

PS: I am not overly concerned about genetic “engineering.” My money is riding on chaos in genetics and environmental factors.

The Economics of Reproducibility in Preclinical Research

Wednesday, June 10th, 2015

The Economics of Reproducibility in Preclinical Research by Leonard P. Freedman, Iain M. Cockburn, Timothy S. Simcoe. PLOS Published: June 9, 2015 DOI: 10.1371/journal.pbio.1002165.

Abstract:

Low reproducibility rates within life science research undermine cumulative knowledge production and contribute to both delays and costs of therapeutic drug development. An analysis of past studies indicates that the cumulative (total) prevalence of irreproducible preclinical research exceeds 50%, resulting in approximately US$28,000,000,000 (US$28B)/year spent on preclinical research that is not reproducible—in the United States alone. We outline a framework for solutions and a plan for long-term improvements in reproducibility rates that will help to accelerate the discovery of life-saving therapies and cures.

The authors find four categories of irreproducibility:

(1) study design, (2) biological reagents and reference materials, (3) laboratory protocols, and (4) data analysis and reporting.

But only address “(1) study design, (2) biological reagents and reference materials.”

Once again, documentation doesn’t make the cut. 🙁

I find that curious because judging just from the flood of social media data, people in general spend a good part of every day capturing and transmitting information. Where is the pain point between that activity and formal documentation that makes the later into an anathema?

Documentation, among other things, could lead to higher reproducibility rates for medical and other research areas, to say nothing of saving data scientists time puzzling out data and/or programmers debugging old code.

Memantic Is Online!

Monday, June 1st, 2015

Memantic

I first blogged about the Memantic paper: Memantic: A Medical Knowledge Discovery Engine in March of this year and am very happy to now find it online!

From the about page:

Memantic captures relationships between medical concepts by mining biomedical literature and organises these relationships visually according to a well-known medical ontology. For example, a search for “Vitamin B12 deficiency” will yield a visual representation of all related diseases, symptoms and other medical entities that Memantic has discovered from the 25 million medical publications and abstracts mentioned above, as well as a number of medical encyclopaedias.

The user can explore a relationship of interest (such as the one between “Vitamin B12 deficiency” and “optic neuropathy”, for instance) by clicking on it, which will bring up links to all the scientific texts that have been discovered to support that relationship. Furthermore, the user can select the desired type of related concepts — such as “diseases”, “symptoms”, “pharmacological agents”, “physiological functions”, and so on — and use it as a filter to make the visualisation even more concise. Finally, the related concepts can be semantically grouped into an expandable tree hierarchy to further reduce screen clutter and to let the user quickly navigate to the relevant area of interest.

Concisely organising related medical entities without duplication

Memantic first presents all medical terms related to the query concept and then groups publications by the presence of each such term in addition to the query itself. The hierarchical nature of this grouping allows the user to quickly establish previously unencountered relationships and to drill down into the hierarchy to only look at the papers concerning such relationships. Contrast this with the same search performed on Google, where the user normally gets a number of links, many of which have the same title; the user has to go through each link to see if it contains any novel information that is relevant to their query.

Keeping the index of relationships up-to-date

Memantic perpetually renews its index by continuously mining the biomedical literature, extracting new relationships and adding supporting publications to the ones already discovered. The key advantage of Memantic’s user interface is that novel relationships become apparent to the user much quicker than on standard search engines. For example, Google may index a new research paper that exposes a previously unexplored connection between a particular drug and the disease that is being searched for by the user. However, Google may not assign that paper the sufficient weight for it to appear in the first few pages of the search results, thus making it invisible to the people searching for the disease who do not persevere in clicking past those initial pages.

To get a real feel for what the site is capable of, you need to create an account (free) and try it for yourself.

I am not a professional medical researchers but was able to duplicate some prior research I have done on edge case conditions fairly quickly. Whether that was due to the interface and its techniques or because of my knowledge of the subject area is hard to answer.

The interface alone is worth the visit.

Do give Memantic a spin! I think you will like what you find.

Computational drug repositioning through heterogeneous network clustering

Tuesday, November 11th, 2014

Computational drug repositioning through heterogeneous network clustering by Wu C, Gudivada RC, Aronow BJ, Jegga AG. (BMC Syst Biol. 2013;7 Suppl 5:S6. doi: 10.1186/1752-0509-7-S5-S6. Epub 2013 Dec 9.)

Abstract:

BACKGROUND:

Given the costly and time consuming process and high attrition rates in drug discovery and development, drug repositioning or drug repurposing is considered as a viable strategy both to replenish the drying out drug pipelines and to surmount the innovation gap. Although there is a growing recognition that mechanistic relationships from molecular to systems level should be integrated into drug discovery paradigms, relatively few studies have integrated information about heterogeneous networks into computational drug-repositioning candidate discovery platforms.

RESULTS:

Using known disease-gene and drug-target relationships from the KEGG database, we built a weighted disease and drug heterogeneous network. The nodes represent drugs or diseases while the edges represent shared gene, biological process, pathway, phenotype or a combination of these features. We clustered this weighted network to identify modules and then assembled all possible drug-disease pairs (putative drug repositioning candidates) from these modules. We validated our predictions by testing their robustness and evaluated them by their overlap with drug indications that were either reported in published literature or investigated in clinical trials.

CONCLUSIONS:

Previous computational approaches for drug repositioning focused either on drug-drug and disease-disease similarity approaches whereas we have taken a more holistic approach by considering drug-disease relationships also. Further, we considered not only gene but also other features to build the disease drug networks. Despite the relative simplicity of our approach, based on the robustness analyses and the overlap of some of our predictions with drug indications that are under investigation, we believe our approach could complement the current computational approaches for drug repositioning candidate discovery.

A reminder that data clustering isn’t just of academic interest but is useful in highly remunerative fields as well. 😉

There is a vast amount of literature on data clustering but I don’t know if there is a collection of data clustering patterns?

That is a work that summarizes where data clustering has been used by domain and the similarities on which clustering was performed.

In this article, the clustering was described as:

The nodes represent drugs or diseases while the edges represent shared gene, biological process, pathway, phenotype or a combination of these features.

Has that been used elsewhere in medical research?

Not that clustering should be limited to prior patterns but prior patterns could stimulate new patterns to be applied.

Thoughts?

Recognizing patterns in genomic data

Friday, October 10th, 2014

Recognizing patterns in genomic data – New visualization software uncovers cancer subtypes from a vast repository of biomedical information by Stephanie Dutchen.

From the post:

Much of biomedical research these days is about big data—collecting and analyzing vast, detailed repositories of information about health and disease. These data sets can be treasure troves for investigators, often uncovering genetic mutations that drive a particular kind of cancer, for example.

Trouble is, it’s impossible for humans to browse that much data, let alone make any sense of it.

“It’s [StratomeX] a tool to help you make sense of the data you’re collecting and find the right questions to ask,” said Nils Gehlenborg, research associate in biomedical informatics at HMS and co-senior author of the correspondence in Nature Methods. “It gives you an unbiased view of patterns in the data. Then you can explore whether those patterns are meaningful.”

The software, called StratomeX, was developed to help researchers distinguish subtypes of cancer by crunching through the incredible amount of data gathered as part of The Cancer Genome Atlas, a National Institutes of Health–funded initiative. Identifying distinct cancer subtypes can lead to more effective, personalized treatments.

When users input a query, StratomeX compares tumor data at the molecular level that was collected from hundreds of patients and detects patterns that might indicate significant similarities or differences between groups of patients. The software presents those connections in an easy-to-grasp visual format.

“It helps you make meaningful distinctions,” said co-first author Alexander Lex, a postdoctoral researcher in the Pfister group.

Other than the obvious merits of this project, note the the role of software as the assistant to the user. It crunches the numbers in a specific domain and presents those results in a meaningful fashion.

It is up to the user to decide which patters are useful and which are not. Shades of “recommending” other instances of the “same” subject?

StratomeX is available for download.

I first saw this in a tweet by Harvard SEAS.

Medical Heritage Library (MHL)

Sunday, September 21st, 2014

Medical Heritage Library (MHL)

From the post:

The Medical Heritage Library (MHL) and DPLA are pleased to announce that MHL content can now be discovered through DPLA.

The MHL, a specialized research collection stored in the Internet Archive, currently includes nearly 60,000 digital rare books, serials, audio and video recordings, and ephemera in the history of medicine, public health, biomedical sciences, and popular medicine from the medical special collections of 22 academic, special, and public libraries. MHL materials have been selected through a rigorous process of curation by subject specialist librarians and archivists and through consultation with an advisory committee of scholars in the history of medicine, public health, gender studies, digital humanities, and related fields. Items, selected for their educational and research value, extend from 1235 (Liber Aristotil[is] de nat[u]r[a] a[nima]li[u]m ag[res]tium [et] marino[rum]), to 2014 (The Grog Issue 40 2014) with the bulk of the materials dating from the 19th century.

“The rich history of medicine content curated by the MHL is available for the first time alongside collections like those from the Biodiversity Heritage Library and the Smithsonian, and offers users a single access point to hundreds of thousands of scientific and history of science resources,” said DPLA Assistant Director for Content Amy Rudersdorf.

The collection is particularly deep in American and Western European medical publications in English, although more than a dozen languages are represented. Subjects include anatomy, dental medicine, surgery, public health, infectious diseases, forensics and legal medicine, gynecology, psychology, anatomy, therapeutics, obstetrics, neuroscience, alternative medicine, spirituality and demonology, diet and dress reform, tobacco, and homeopathy. The breadth of the collection is illustrated by these popular items: the United States Naval Bureau of Medical History’s audio oral history with Doctor Walter Burwell (1994) who served in the Pacific theatre during World War II and witnessed the first Japanese kamikaze attacks; History and medical description of the two-headed girl : sold by her agents for her special benefit, at 25 cents (1869), the first edition of Gray’s Anatomy (1858) (the single most-downloaded MHL text at more than 2,000 downloads annually), and a video collection of Hanna – Barbera Production Flintstones (1960) commercials for Winston cigarettes.

“As is clear from today’s headlines, science, health, and medicine have an impact on the daily lives of Americans,” said Scott H. Podolsky, chair of the MHL’s Scholarly Advisory Committee. “Vaccination, epidemics, antibiotics, and access to health care are only a few of the ongoing issues the history of which are well documented in the MHL. Partnering with the DPLA offers us unparalleled opportunities to reach new and underserved audiences, including scholars and students who don’t have access to special collections in their home institutions and the broader interested public.“

Quick links:

Digital Public Library of America

Internet Archive

Medical Heritage Library website

I remember the Flintstone commercials for Winston cigarettes. Not all that effective a campaign, I smoked Marboros (reds in a box) for almost forty-five (45) years. 😉

As old vices die out, new ones, like texting and driving take their place. On behalf of current and former smokers, I am confident that smoking was not a factor in 1,600,000 accidents per year and 11 teen deaths every day.

2015 Medical Subject Headings (MeSH) Now Available

Thursday, September 18th, 2014

2015 Medical Subject Headings (MeSH) Now Available

From the post:

Introduction to MeSH 2015
The Introduction to MeSH 2015 is now available, including information on its use and structure, as well as recent updates and availability of data.

MeSH Browser
The default year in the MeSH Browser remains 2014 MeSH for now, but the alternate link provides access to 2015 MeSH. The MeSH Section will continue to provide access via the MeSH Browser for two years of the vocabulary: the current year and an alternate year. Sometime in November or December, the default year will change to 2015 MeSH and the alternate link will provide access to the 2014 MeSH.

Download MeSH
Download 2015 MeSH in XML and ASCII formats. Also available for 2015 from the same MeSH download page are:

  • Pharmacologic Actions (Forthcoming)
  • New Headings with Scope Notes
  • MeSH Replaced Headings
  • MeSH MN (tree number) changes
  • 2015 MeSH in MARC format

Enjoy!

multiMiR R package and database:…

Sunday, August 10th, 2014

The multiMiR R package and database: integration of microRNA–target interactions along with their disease and drug associations by Yuanbin Ru, et al. ( Nucl. Acids Res. (2014) doi: 10.1093/nar/gku631)

Abstract:

microRNAs (miRNAs) regulate expression by promoting degradation or repressing translation of target transcripts. miRNA target sites have been catalogued in databases based on experimental validation and computational prediction using various algorithms. Several online resources provide collections of multiple databases but need to be imported into other software, such as R, for processing, tabulation, graphing and computation. Currently available miRNA target site packages in R are limited in the number of databases, types of databases and flexibility. We present multiMiR, a new miRNA–target interaction R package and database, which includes several novel features not available in existing R packages: (i) compilation of nearly 50 million records in human and mouse from 14 different databases, more than any other collection; (ii) expansion of databases to those based on disease annotation and drug microRNAresponse, in addition to many experimental and computational databases; and (iii) user-defined cutoffs for predicted binding strength to provide the most confident selection. Case studies are reported on various biomedical applications including mouse models of alcohol consumption, studies of chronic obstructive pulmonary disease in human subjects, and human cell line models of bladder cancer metastasis. We also demonstrate how multiMiR was used to generate testable hypotheses that were pursued experimentally.

Amazing what you can do with R and a MySQL database!

The authors briefly describe their “cleaning” process for the consolidation of these databases on page 2 but then note on page 4:

For many of the databases, the links are available. However, in Supplementary Table S2 we have listed the databases where links may be broken due to outdated identifiers in those databases. We also listed the databases that do not have the option to search by miR NA-gene pairs.

Perhaps due to editing standards (available for free lance work) I have allergy to terms like “many,” especially when it is possible to enumerate the “many.”

In this particular case, you have to download and consult Supplementary Table S2, which reads:

S2

The explanation for this table reads:

For each database, the columns indicate whether external links are available to include as part of multiMiR, whether those databases use identifiers that are updated and whether the links are based on miRNA-gene pairs. For those database that do not have updated identifiers, some links may be broken. For the other databases, where you can only search by miRNA or gene but not pairs, the links are provided by gene, except for ElMMo which is by miRNA because of its database structure.

Counting I see ten (10) databases with a blank under “Undated Identifiers” or Search by miRNA-gene,” or both.

I guess ten (10) out of fourteen (14) qualifies as “many,” but saying seventy-one percent (71%) of the databases in this study lack either “Updated Identifiers,” “Search by miRNA-gene,” or both, would have been more informative.

Potential records with these issues? EIMMo, version 4 has human (50M) and mouse (15M), MicroCosm / miRBase human (879054), and miRanda (assuming human, Good mirSVR score, Conserved miRNA), 1097069. For the rest you can consult Supplemental Table 1, which lists URLs for the databases and dates of access, but where multiple human options are available, not which one(s) were selected.

The number of records for each database that may have these problems also merits mention in the description of the data.

I can’t comment on the usefulness of this R package for exploring the data but the condition of the data it explores needs more prominent mention.

Expanded 19th-century Medical Collection

Wednesday, July 30th, 2014

Wellcome Library and Jisc announce partners in 19th-century medical books digitisation project

From the post:

The libraries of six universities have joined the partnership – UCL (University College London), the University of Leeds, the University of Glasgow, the London School of Hygiene & Tropical Medicine, King’s College London and the University of Bristol – along with the libraries of the Royal College of Physicians of London, the Royal College of Physicians of Edinburgh and the Royal College of Surgeons of England.

Approximately 15 million pages of printed books and pamphlets from all ten partners will be digitised over a period of two years and will be made freely available to researchers and the public under an open licence. By pooling their collections the partners will create a comprehensive online library. The content will be available on multiple platforms to broaden access, including the Internet Archive, the Wellcome Library and Jisc Historic Books.

The project’s focus is on books and pamphlets from the 19th century that are on the subject of medicine or its related disciplines. This will include works relating to the medical sciences, consumer health, sport and fitness, as well as different kinds of medical practice, from phrenology to hydrotherapy. Works on food and nutrition will also feature: around 1400 cookery books from the University of Leeds are among those lined up for digitisation. They, along with works from the other partner institutions, will be transported to the Wellcome Library in London where a team from the Internet Archive will undertake the digitisation work. The project will build on the success of the US-based Medical Heritage Library consortium, of which the Wellcome Library is a part, which has already digitised over 50 000 books and pamphlets.

Digital coverage of the 19th century is taking another leap forward!

Given the changes in medical terminology (and practices!) since the 19th century, this should be a gold mine for topic map applications.

Tools and Resources Development Fund [bioscience ODF UK]

Friday, July 25th, 2014

Tools and Resources Development Fund

Application deadline: 17 September 2014, 4pm

From the summary:

Our Tools and Resources Development Fund (TRDF) aims to pump prime the next generation of tools, technologies and resources that will be required by bioscience researchers in scientific areas within our remit. It is anticipated that successful grants will not exceed £150k (£187k FEC) (ref 1) and a fast-track, light touch peer review process will operate to enable researchers to respond rapidly to emerging challenges and opportunities.

Projects are expected to have a maximum value of £150k (ref 1). The duration of projects should be between 6 and 18 months, although community networks to develop standards could be supported for up to 3 years.

A number of different types of proposal are eligible for consideration.

  • New approaches to the analysis, modelling and interpretation of research data in the biological sciences, including development of software tools and algorithms. Of particular interest will be proposals that address challenges arising from emerging new types of data and proposals that address known problems associated with data handling (e.g. next generation sequencing, high-throughput phenotyping, the extraction of data from challenging biological images, metagenomics).
  • New frameworks for the curation, sharing, and re-use/re-purposing of research data in the biological sciences, including embedding data citation mechanisms (e.g. persistent identifiers for datasets within research workflows) and novel data management planning (DMP) implementations (e.g. integration of DMP tools within research workflows)
  • Community approaches to the sharing of research data including the development of standards (this could include coordinating UK input into international standards development activities).
  • Approaches designed to exploit the latest computational technology to further biological research; for example, to facilitate the use of cloud computing approaches or high performance computing architectures.

Projects may extend existing software resources; however, the call is designed to support novel tools and methods. Incremental improvement and maintenance of existing software that does not provide new functionality or significant performance improvements (e.g. by migration to an advanced computing environment) does not fall within the scope of the call.

Very timely since the UK announcement that OpenDocument Format (ODF) is among the open standards:

The standards set out the document file formats that are expected to be used across all government bodies. Government will begin using open formats that will ensure that citizens and people working in government can use the applications that best meet their needs when they are viewing or working on documents together. (Open document formats selected to meet user needs)

ODF as a format supports RDFa as metadata but lacks an implementation that makes full use of that capability.

Imagine biocuration that:

  • Starts with authors writing a text and is delivered to
  • Publishers, who can proof or augment the author’s biocuration
  • Results are curated on on publication (not months or years later)
  • Results are immediately available for collation with other results.

The only way to match the explosive growth of bioscience publications with equally explosive growth of bioscience curation, is to use tools the user already knows. Like word processing software.

Please pass this along and let me know of other grants or funding opportunities where adaptation of office standards or software could change the fundamentals of workflow.

…[S]emantically enriched open pharmacological space…

Wednesday, July 16th, 2014

Scientific competency questions as the basis for semantically enriched open pharmacological space development by Kamal Azzaoui, et al. (Drug Discovery Today, Volume 18, Issues 17–18, September 2013, Pages 843–852)

Abstract:

Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compound–target–pathway–disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.

Pharmacology may not be your space but this is a good example of what it takes for semantic integration of resources in a complex area.

Despite the “…you too can be a brain surgeon with our new web-based app…” from various sources, semantic integration has been, is and will remain difficult under the best of circumstances.

I don’t say that to discourage anyone but to avoid the let-down when integration projects don’t provide easy returns.

It is far better to plan for incremental and measurable benefits along the way than to fashion grandiose goals that are ever receding on the horizon.

I first saw this in a tweet by ChemConnector.

Egas:…

Saturday, June 21st, 2014

Egas: a collaborative and interactive document curation platform by David Campos, el al.

Abstract:

With the overwhelming amount of biomedical textual information being produced, several manual curation efforts have been set up to extract and store concepts and their relationships into structured resources. As manual annotation is a demanding and expensive task, computerized solutions were developed to perform such tasks automatically. However, high-end information extraction techniques are still not widely used by biomedical research communities, mainly because of the lack of standards and limitations in usability. Interactive annotation tools intend to fill this gap, taking advantage of automatic techniques and existing knowledge bases to assist expert curators in their daily tasks. This article presents Egas, a web-based platform for biomedical text mining and assisted curation with highly usable interfaces for manual and automatic in-line annotation of concepts and relations. A comprehensive set of de facto standard knowledge bases are integrated and indexed to provide straightforward concept normalization features. Real-time collaboration and conversation functionalities allow discussing details of the annotation task as well as providing instant feedback of curator’s interactions. Egas also provides interfaces for on-demand management of the annotation task settings and guidelines, and supports standard formats and literature services to import and export documents. By taking advantage of Egas, we participated in the BioCreative IV interactive annotation task, targeting the assisted identification of protein–protein interactions described in PubMed abstracts related to neuropathological disorders. When evaluated by expert curators, it obtained positive scores in terms of usability, reliability and performance. These results, together with the provided innovative features, place Egas as a state-of-the-art solution for fast and accurate curation of information, facilitating the task of creating and updating knowledge bases and annotated resources.

Database URL: http://bioinformatics.ua.pt/egas

Read this article and/or visit the webpage and tell me this doesn’t have topic map editor written all over it!

Domain specific to be sure but any decent interface for authoring topic maps is going to be domain specific.

Very, very impressive!

I am following up with the team to check on the availability of the software.

A controlled vocabulary for pathway entities and events

Saturday, June 21st, 2014

A controlled vocabulary for pathway entities and events by Steve Jupe, et al.

Abstract:

Entities involved in pathways and the events they participate in require descriptive and unambiguous names that are often not available in the literature or elsewhere. Reactome is a manually curated open-source resource of human pathways. It is accessible via a website, available as downloads in standard reusable formats and via Representational State Transfer (REST)-ful and Simple Object Access Protocol (SOAP) application programming interfaces (APIs). We have devised a controlled vocabulary (CV) that creates concise, unambiguous and unique names for reactions (pathway events) and all the molecular entities they involve. The CV could be reapplied in any situation where names are used for pathway entities and events. Adoption of this CV would significantly improve naming consistency and readability, with consequent benefits for searching and data mining within and between databases.

Database URL: http://www.reactome.org

There is no doubt that “unambiguous and unique names for reactions (pathway events) and all the molecular entities they involve” would have all the benefits listed by the authors.

Unfortunately, the experience of the HUGO Gene Nomenclature Committee, for example, has been that “other” names for genes are used and then the HUGO designation is created. Making the HUGO designation only one of several names a gene may have.

Another phrase for “universal name” is “an additional name.”

It is an impressive effort and should be useful in disambiguating the additional names for pathway entities and events.

FYI, from the homepage of the database:

Reactome is a free, open-source, curated and peer reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education.

Elasticsearch, RethinkDB and the Semantic Web

Wednesday, June 11th, 2014

Elasticsearch, RethinkDB and the Semantic Web by Michel Dumontier.

From the post:

Everyone is handling big data nowadays, or at least, so it seems. Hadoop is very popular among the Big Data wranglers and it is often mentioned as the de facto solution. I have dabbled into working with Hadoop over the past years and found that: yes, it is very suitable for certain kinds of data mining/analysis and for those it provides high data crunching throughput, but, no, it cannot answer queries quickly and you cannot port every algorithm into Hadoop’s map/reduce paradigm. I have since turned to Elasticsearch and more recently to RethinkDB. It is a joy to work with the latter and it performs faceting just as well as Elasticsearch for the benchmark data that I used, but still permits me to carry out more complex data mining and analysis too.

The story here describes the data that I am working with a bit, it shows how it can be turned into a data format that both Elasticsearch and RethinkDB understand, how the data is being loaded and indexed, and finally, how to get some facets out of the systems.

Interesting post on biomedical data in RDF N-Quads format which is converted into JSON and then processed with ElasticSearch and RethinkDB.

I first saw this in a tweet by Joachim Baran.

A Methodology for Empirical Analysis of LOD Datasets

Friday, June 6th, 2014

A Methodology for Empirical Analysis of LOD Datasets by Vit Novacek.

Abstract:

CoCoE stands for Complexity, Coherence and Entropy, and presents an extensible methodology for empirical analysis of Linked Open Data (i.e., RDF graphs). CoCoE can offer answers to questions like: Is dataset A better than B for knowledge discovery since it is more complex and informative?, Is dataset X better than Y for simple value lookups due its flatter structure?, etc. In order to address such questions, we introduce a set of well-founded measures based on complementary notions from distributional semantics, network analysis and information theory. These measures are part of a specific implementation of the CoCoE methodology that is available for download. Last but not least, we illustrate CoCoE by its application to selected biomedical RDF datasets. (emphasis in original)

A deeply interesting work on the formal characteristics of LOD datasets but as we learned in Community detection in networks:… a relationship between a typology (another formal characteristic) and some hidden fact(s) may or may not exist.

Or to put it another way, formal characteristics are useful for rough evaluation of data sets but cannot replace a grounded actor considering their meaning. That would be you.

I first saw this in a tweet by Marin Dimitrov

Innovations in peer review:…

Tuesday, April 22nd, 2014

Innovations in peer review: join a discussion with our Editors by Shreeya Nanda.

From the post:

Innovation may not be an adjective often associated with peer review, indeed commentators have claimed that peer review slows innovation and creativity in science. Preconceptions aside, publishers are attempting to shake things up a little, with various innovations in peer review, and these are the focus of a panel discussion at BioMed Central’s Editors’ Conference on Wednesday 23 April in Doha, Qatar. This follows our spirited discussion at the Experimental Biology conference in Boston last year.

The discussion last year focussed on the limitations of the traditional peer review model (you can see a video here). This year we want to talk about innovations in the field and the ways in which the limitations are being addressed. Specifically, we will focus on open peer review, portable peer review – in which we help authors transfer their manuscript, often with reviewers’ reports, to a more appropriate journal – and decoupled peer review, which is undertaken by a company or organisation independent of, or on contract from, a journal.

We will be live tweeting from the session at 11.15am local time (9.15am BST), so if you want to join the discussion or put questions to our panellists, please follow #BMCEds14. If you want to brush up on any or all of the models that we’ll be discussing, have a look at some of the content from around BioMed Central’s journals, blogs and Biome below:

This post includes pointers to a number of useful resources concerning the debate around peer review.

But there are oddities as well. First, the claim that peer review “slows innovation and creativity in science,” considering recent reports that peer review is no better than random chance for grants (…lotteries to pick NIH research-grant recipients and the not infrequent reports of false papers, fraud in actual papers, and a general inability to replicate research described in papers (Reproducible Research/(Mapping?)).

A claim doesn’t have to appear on the alt.fringe.peer.review newsgroup (imaginary newsgroup) in order to be questionable on its face.

Secondly, despite the invitation to follow and participate on Twitter, holding the meeting in Qartar means potential attendees from the United States will have to rise at:

Eastern 4:15 AM (last year’s location)

Central 3:15 AM

Mountain 2:15 AM

Western 1:15 AM

I wonder what the participation levels will be from Boston last year as compared to Qatar this year?

Nothing against non-United States locations but non-junket locations, such as major educational/research hubs, should be the sites for such meetings.

tagtog: interactive and text-mining-assisted annotation…

Monday, April 14th, 2014

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles by Juan Miguel Cejuela, et al.

Abstract:

The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.

Database URL: www.tagtog.net, www.flybase.org.

Encouraging because the “tagging” is not wholly automated nor is it wholly hand-authored. Rather the goal is to create an interface that draws on the strengths of automated processing as moderated by human expertise.

Annotation remains at a document level, which consigns subsequent users to mining full text but this is definitely a step in the right direction.

Open Science Leaps Forward! (Johnson & Johnson)

Friday, January 31st, 2014

In Stunning Win For Open Science, Johnson & Johnson Decides To Release Its Clinical Trial Data To Researchers by Matthew Herper.

From the post:

Drug companies tend to be secretive, to say the least, about studies of their medicines. For years, negative trials would not even be published. Except for the U.S. Food and Drug Administration, nobody got to look at the raw information behind those studies. The medical data behind important drugs, devices, and other products was kept shrouded.

Today, Johnson & Johnson is taking a major step toward changing that, not only for drugs like the blood thinner Xarelto or prostate cancer pill Zytiga but also for the artificial hips and knees made for its orthopedics division or even consumer products. “You want to know about Listerine trials? They’ll have it,” says Harlan Krumholz of Yale University, who is overseeing the group that will release the data to researchers.

….

Here’s how the process will work: J&J has enlisted The Yale School of Medicine’s Open Data Access Project (YODA) to review requests from physicians to obtain data from J&J products. Initially, this will only include products from the drug division, but it will expand to include devices and consumer products. If YODA approves a request, raw, anonymized data will be provided to the physician. That includes not just the results of a study, but the results collected for each patient who volunteered for it with identifying information removed. That will allow researchers to re-analyze or combine that data in ways that would not have been previously possible.

….

Scientists can make a request for data on J&J drugs by going to www.clinicaltrialstudytransparency.com.

The ability to “…re-analyze or combine that data in ways that would not have been previously possible…” is the public benefit of Johnson & Johnson’s sharing of data.

With any luck, this will be the start of a general trend among drug companies.

Mappings of the semantics of such data sets should be contributed back to the Yale School of Medicine’s Open Data Access Project (YODA), to further enhance re-use of these data sets.

Map of Preventable Diseases

Wednesday, January 29th, 2014

preventable disease

Be sure to see the interactive version of this map by the Council on Foreign Relations.

I first saw this at Chart Porn, which was linking to Map of preventable disease outbreaks shows the influence of anti-vaccination movements by Rich McCormick, which in turn pointed to the CFG map.

The dataset is downloadable from the CFG.

Vaccination being more a matter of public health, I have always wondered by anyone would be allowed an option to decline. Certainly some people will have adverse reactions, even die, and they or their families should be cared for and/or compensated. But they should not be allowed to put large numbers of others at risk.

BTW, when you look at the interactive map, locate Georgia in the United States and you will see the large green dot reports 247 cases of whooping cough for Georgia. The next green dot which slightly overlaps with it, reports 2 cases. While being more than half the size of the dot on Georgia.

Disproportionate scaling of icons reduces the accuracy of the information conveyed by the map. Unfortunate because this is an important public health issue.

Applying linked data approaches to pharmacology:…

Wednesday, January 29th, 2014

Applying linked data approaches to pharmacology: Architectural decisions and implementation by Alasdair J.G. Gray, et. al.

Abstract:

The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. In this application report, we describe a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies. We describe the architecture of the platform focusing on seven design decisions that drove its development with the aim of informing others developing similar software in this or other domains. The utility of the platform is demonstrated by the variety of drug discovery applications being built to access the integrated data.

An alpha version of the OPS platform is currently available to the Open PHACTS consortium and a first public release will be made in late 2012, see http://www.openphacts.org/ for details.

The paper acknowledges that present database entries lack semantics.

A further challenge is the lack of semantics associated with links in traditional database entries. For example, the entry in UniProt for the protein “kinase C alpha type homo sapien4 contains a link to the Enzyme database record 5, which has complementary data about the same protein and thus the identifiers can be considered as being equivalent. One approach to resolve this, proposed by Identifiers.org, is to provide a URI for the concept which contains links to the database records about the concept [27]. However, the UniProt entry also contains a link to the DrugBank compound “Phosphatidylserine6. Clearly, these concepts are not identical as one is a protein and the other a chemical compound. The link in this case is representative of some interaction between the compound and the protein, but this is left to a human to interpret. Thus, for successful data integration one must devise strategies that address such inconsistencies within the existing data.

I would have said databases lack properties to identify the subjects in question but there is little difference in the outcome of our respective positions, i.e., we need more semantics to make robust use of existing data.

Perhaps even more importantly, the paper treats “equality” as context dependent:

Equality is context dependent

Datasets often provide links to equivalent concepts in other datasets. These result in a profusion of “equivalent” identifiers for a concept. Identifiers.org provide a single identifier that links to all the underlying equivalent dataset records for a concept. However, this constrains the system to a single view of the data, albeit an important one.

A novel approach to instance level links between the datasets is used in the OPS platform. Scientists care about the types of links between entities: different scientists will accept concepts being linked in different ways and for different tasks they are willing to accept different forms of relationships. For example, when trying to find the targets that a particular compound interacts with, some data sources may have created mappings to gene rather than protein identifiers: in such instances it may be acceptable to users to treat gene and protein IDs as being in some sense equivalent. However, in other situations this may not be acceptable and the OPS platform needs to allow for this dynamic equivalence within a scientific context. As a consequence, rather than hard coding the links into the datasets, the OPS platform defers the instance level links to be resolved during query execution by the Identity Mapping Service (IMS). Thus, by changing the set of dataset links used to execute the query, different interpretations over the data can be provided.

Opaque mappings between datasets, i.e., mappings that don’t assign properties to source, target and then say what properties or conditions must be met for the mapping to be vaild, are of little use. Rely on opaque mappings at your own risk.

On the other hand, I fully agree that equality is context dependent and the choice of the criteria for equivalence should be left up to users. I suppose in that sense if users wanted to rely on opaque mappings, that would be their choice.

While an exciting paper, it is discussing architectural decisions and so we are not at the point of debating these issues in detail. It promises to be an exciting discussion!

EVEX

Sunday, January 26th, 2014

EVEX

From the about page:

EVEX is a text mining resource built on top of PubMed abstracts and PubMed Central full text articles. It contains over 40 million bio-molecular events among more than 76 million automatically extracted gene/protein name mentions. The text mining data further has been enriched with gene identifiers and gene families from Ensembl and HomoloGene, providing homology-based event generalizations. EVEX presents both direct and indirect associations between genes and proteins, enabling explorative browsing of relevant literature.

Ok, it’s not web-scale but it is important information. 😉

What I find the most interesting is the “…direct and indirect associations between genes and proteins, enabling explorative browsing of the relevant literature.”

See their tutorial on direct and indirect associations.

I think part of the lesson here is that no matter how gifted, a topic map with static associations limits a user’s ability to explore the infoverse.

That may work quite well where uniform advice, even if incorrect, is preferred over exploration. However, in rapidly changing areas like medical research, static associations could be more of a hindrance than a boon.

Open Educational Resources for Biomedical Big Data

Friday, January 17th, 2014

Open Educational Resources for Biomedical Big Data (R25)

Deadline for submission: April 1, 2014

Additional information: bd2k_training@mail.nih.gov

As part of the NIH Big Data to Knowledge (BD2K) project, BD2K R25 FOA will support:

Curriculum or Methods Development of innovative open educational resources that enhance the ability of the workforce to use and analyze biomedical Big Data.

The challenges:

The major challenges to using biomedical Big Data include the following:

Locating data and software tools: Investigators need straightforward means of knowing what datasets and software tools are available and where to obtain them, along with descriptions of each dataset or tool. Ideally, investigators should be able to easily locate all published and resource datasets and software tools, both basic and clinical, and, to the extent possible, unpublished or proprietary data and software.

Gaining access to data and software tools: Investigators need straightforward means of 1) releasing datasets and metadata in standard formats; 2) obtaining access to specific datasets or portions of datasets; 3) studying datasets with the appropriate software tools in suitable environments; and 4) obtaining analyzed datasets.

Standardizing data and metadata: Investigators need data to be in standard formats to facilitate interoperability, data sharing, and the use of tools to manage and analyze the data. The datasets need to be described by standard metadata to allow novel uses as well as reuse and integration.

Sharing data and software: While significant progress has been made in broad and rapid sharing of data and software, it is not yet the norm in all areas of biomedical research. More effective data- and software-sharing would be facilitated by changes in the research culture, recognition of the contributions made by data and software generators, and technical innovations. Validation of software to ensure quality, reproducibility, provenance, and interoperability is a notable goal.

Organizing, managing, and processing biomedical Big Data: Investigators need biomedical data to be organized and managed in a robust way that allows them to be fully used; currently, most data are not sufficiently well organized. Barriers exist to releasing, transferring, storing, and retrieving large amounts of data. Research is needed to design innovative approaches and effective software tools for organizing biomedical Big Data for data integration and sharing while protecting human subject privacy.

Developing new methods for analyzing biomedical Big Data: The size, complexity, and multidimensional nature of many datasets make data analysis extremely challenging. Substantial research is needed to develop new methods and software tools for analyzing such large, complex, and multidimensional datasets. User-friendly data workflow platforms and visualization tools are also needed to facilitate the analysis of Big Data.

Training researchers for analyzing biomedical Big Data: Advances in biomedical sciences using Big Data will require more scientists with the appropriate data science expertise and skills to develop methods and design tools, including those in many quantitative science areas such as computational biology, biomedical informatics, biostatistics, and related areas. In addition, users of Big Data software tools and resources must be trained to utilize them well.

Another big data biomedical data integration funding opportunity!

I do wonder about the suggestion:

The datasets need to be described by standard metadata to allow novel uses as well as reuse and integration.

Do they mean:

“Standard” metadata for a particular academic lab?

“Standard” metadata for a particular industry lab?

“Standard” metadata for either one five (5) years ago?

“Standard” metadata for either one (5) years from now?

The problem being the familiar one that knowledge that isn’t moving forward is outdated.

It’s hard to do good research with outdated information.

Making metadata dynamic, so that it reflects yesterday’s terminology, today’s and someday tomorrow’s, would be far more useful.

The metadata displayed to any user would be their choice of metadata and not the complexities that make the metadata dynamic.

Interested?

Courses for Skills Development in Biomedical Big Data Science

Thursday, January 16th, 2014

Courses for Skills Development in Biomedical Big Data Science

Deadline for submission: April 1, 2014

Additional information: bd2k_training@mail.nih.gov

As part of the NIH Big Data to Knowledge (BD2K) the purpose of BD2K R25 FOA will support:

Courses for Skills Development in topics necessary for the utilization of Big Data, including the computational and statistical sciences in a biomedical context. Courses will equip individuals with additional skills and knowledge to utilize biomedical Big Data.

Challenges in biomedical Big Data?

The major challenges to using biomedical Big Data include the following:

Locating data and software tools: Investigators need straightforward means of knowing what datasets and software tools are available and where to obtain them, along with descriptions of each dataset or tool. Ideally, investigators should be able to easily locate all published and resource datasets and software tools, both basic and clinical, and, to the extent possible, unpublished or proprietary data and software.

Gaining access to data and software tools: Investigators need straightforward means of 1) releasing datasets and metadata in standard formats; 2) obtaining access to specific datasets or portions of datasets; 3) studying datasets with the appropriate software tools in suitable environments; and 4) obtaining analyzed datasets.

Standardizing data and metadata: Investigators need data to be in standard formats to facilitate interoperability, data sharing, and the use of tools to manage and analyze the data. The datasets need to be described by standard metadata to allow novel uses as well as reuse and integration.

Sharing data and software: While significant progress has been made in broad and rapid sharing of data and software, it is not yet the norm in all areas of biomedical research. More effective data- and software-sharing would be facilitated by changes in the research culture, recognition of the contributions made by data and software generators, and technical innovations. Validation of software to ensure quality, reproducibility, provenance, and interoperability is a notable goal.

Organizing, managing, and processing biomedical Big Data: Investigators need biomedical data to be organized and managed in a robust way that allows them to be fully used; currently, most data are not sufficiently well organized. Barriers exist to releasing, transferring, storing, and retrieving large amounts of data. Research is needed to design innovative approaches and effective software tools for organizing biomedical Big Data for data integration and sharing while protecting human subject privacy.

Developing new methods for analyzing biomedical Big Data: The size, complexity, and multidimensional nature of many datasets make data analysis extremely challenging. Substantial research is needed to develop new methods and software tools for analyzing such large, complex, and multidimensional datasets. User-friendly data workflow platforms and visualization tools are also needed to facilitate the analysis of Big Data.

Training researchers for analyzing biomedical Big Data: Advances in biomedical sciences using Big Data will require more scientists with the appropriate data science expertise and skills to develop methods and design tools, including those in many quantitative science areas such as computational biology, biomedical informatics, biostatistics, and related areas. In addition, users of Big Data software tools and resources must be trained to utilize them well.

It’s hard to me to read that list and not see subject identity as playing some role in meeting all of those challenges. Not a complete solution because there are a variety of problems in each challenge. But to preserve access to data sets over time, issues and approaches, subject identity is a necessary component of any solution.

Applicants have to be institutions of higher education but I assume they can hire expertise as required.

Needles in Stacks of Needles:…

Monday, January 6th, 2014

Needles in Stacks of Needles: genomics + data mining by Martin Krzywinski. (ICDM2012 Keynote)

Abstract:

In 2001, the first human genome sequence was published. Now, just over 10 years later, we capable of sequencing a genome in just a few days. Massive parallel sequencing projects now make it possible to study the cancers of thousands of individuals. New data mining approaches are required to robustly interrogate the data for causal relationships among the inherently noisy biology. How does one identify genetic changes that are specific and causal to a disease within the rich variation that is either natural or merely correlated? The problem is one of finding a needle in a stack of needles. I will provide a non-specialist introduction to data mining methods and challenges in genomics, with a focus on the role visualization plays in the exploration of the underlying data.

This page links to the slides Martin used in his presentation.

Excellent graphics and a number of amusing points, even without the presentation itself:

Cheap Data: A fruit fly that expresses high sensitivity to alcohol.

Kenny: A fruit fly without this gene dies in two days, named for the South Park character who dies in each episode.

Ken and Barbie: Fruit flys that fail to develop external genitalia.

One observation that rings true across disciplines:

Literature is still largely composed and published opaquely.

I searched for a video recording of the presentation but came up empty.

Need a Human

Monday, January 6th, 2014

Need a Human

Shamelessly stolen from Martin Krzywinski’s ICDM2012 Keynote — Needles in Stacks of Needles.

I am about to post on that keynote but thought the image merited a post of its own.

Galaxy:…

Friday, December 27th, 2013

Galaxy: Data Intensive Biology For Everyone

From the website:

Galaxy is an open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.

From the Galaxy wiki:

Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.

  • Accessible: Users without programming experience can easily specify parameters and run tools and workflows.
  • Reproducible: Galaxy captures information so that any user can repeat and understand a complete computational analysis.
  • Transparent: Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis.

This is the Galaxy Community Wiki. It describes all things Galaxy.

Whether you are a home bio-hacker or an IT person looking to understand computational biology, Galaxy may be a good fit for you.

You can try out the public server before troubling to install it locally. Assuming you are paranoid about your bits going over the network. 😉