Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 13, 2012

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Filed under: Data Analysis,MapReduce — Patrick Durusau @ 3:19 pm

Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc.

Vendor content so usual disclaimers apply but this may signal an important but subtle shift in computing environments.

From the post:

Introduction

In today’s competitive world, businesses need to make fast decisions to respond to changing market conditions and to maintain a competitive edge. The explosion of data that must be analyzed to find trends or hidden insights intensifies this challenge. Both the private and public sectors are turning to parallel computing techniques, such as “map/reduce” to quickly sift through large data volumes.

In some cases, it is practical to analyze huge sets of historical, disk-based data over the course of minutes or hours using batch processing platforms such as Hadoop. For example, risk modeling to optimize the handling of insurance claims potentially needs to analyze billions of records and tens of terabytes of data. However, many applications need to continuously analyze relatively small but fast-changing data sets measured in the hundreds of gigabytes and reaching into terabytes. Examples include clickstream data to optimize online promotions, stock trading data to implement trading strategies, machine log data to tune manufacturing processes, smart grid data, and many more.

Over the last several years, in-memory data grids (IMDGs) have proven their value in storing fast-changing application data and scaling application performance. More recently, IMDGs have integrated map/reduce analytics into the grid to achieve powerful, easy-to-use analysis and enable near real-time decision making. For example, the following diagram illustrates an IMDG used to store and analyze incoming streams of market and news data to help generate alerts and strategies for optimizing financial operations. This article explains how using an IMDG with integrated map/reduce capabilities can simplify data analysis and provide important competitive advantages.

Lowering the complexity of map/reduce, increasing operation speed (no file i/o), enabling easier parallelism, are all good things.

But they are differences in degree, not in kind.

I find IMDGs interesting because of the potential to increase the complexity of relationships between data, including data that is the output of operations.

From the post:

For example, an e-commerce Web site may need to monitor online shopping carts to see which products are selling.

Yawn.

That is probably a serious technical/data issue for Walmart or Home Depot, but it is a different in degree. You could do the same operations with a shoebox and paper receipts, although that would take a while.

Consider the beginning of something a bit more imaginative: What if sales at stores were treated differently than online shopping carts (due to delivery factors) and models built using weather forecasts three to five days out, time of year, local holidays and festivals? Multiple relationships between different data nodes.

That is just a back of an envelope sketch and I am sure successful retailers do even more than what I have suggested.

Complex relationships between data elements are almost at our fingertips.

Are you still counting shopping care items?

August 12, 2012

Semantic physical science

Filed under: Science,Semantics — Patrick Durusau @ 7:56 pm

Semantic physical science by Peter Murray-Rust and Henry S Rzepa. (Journal of Cheminformatics 2012, 4:14 doi:10.1186/1758-2946-4-14)

Abstract:

The articles in this special issue arise from a workshop and symposium held in January 2012 (‘Semantic Physical Science’). We invited people who shared our vision for the potential of the web to support chemical and related subjects. Other than the initial invitations, we have not exercised any control over the content of the contributed articles.

There are pointers to videos and other materials for the following workshop presentations:

  • Introduction – Peter Murray-Rust [11]
  • Why we (PNNL) are supporting semantic science – Bill Shelton
  • Adventures in Semantic Materials Informatics – Nico Adams
  • Semantic Crystallographic Publishing – Brian McMahon [12]
  • Service-oriented science: why good code matters and why a fundamental change in thinking is required – Cameron Neylon [13]
  • On the use of CML in computational materials research – Martin Dove [14]
  • FoX, CML and semantic tools for atomistic simulation – Andrew Walker [15]
  • Semantic Physical Science: the CML roadmap – Marcus Hanwell [16]
  • CMLisattion of NWChem and development strategy for FoXification and dictionaries – Bert de Jong
  • NMR working group – Nancy Washton

A remarkable workshop with which I have only one minor difference:

There was remarkable and exciting unanimity that semantics should and could be introduced now and rapidly into the practice of large areas of chemistry. We agreed that we should concentrate on the three main areas of crystallography, computation and NMR spectroscopy. In crystallography, this is primarily a strategy of working very closely with the IUCr, being able to translate crystallographic data automatically into semantic form and exploring the value of semantic publication and repositories. The continued development of Chempound for crystal structures is Open and so can be fed back regularly into mainstream crystallography.

When computers were being introduced to indexing chemistry and other physical sciences in the 1950’s/60’s, the then practitioners were under the impression their data already had semantics. That it did not have to await the next turn of the century in order to have semantics.

Not to take anything away from the remarkable progress that CML and related efforts have made, but they are not the advent of semantics for chemistry.

Clarification of semantics, documentation of semantics, refinement of semantics, all true.

But chemistry (and data) has always had semantics.

4th International Workshop on Graph Data Management: Techniques and Applications (GDM 2013)

Filed under: Conferences,Graphs,Networks — Patrick Durusau @ 6:13 pm

4th International Workshop on Graph Data Management: Techniques and Applications (GDM 2013)

Important Dates:

Paper submission deadline: October 7, 2011
Author Notification: November 21, 2011
Final Camera-ready Copy Deadline: November 28, 2011
Workshop: April 11, 2012 (Brisbane, Australia)

From the call for papers:

The GDM workshop targets researchers that concentrate on all aspects of managing graph data such as: “How to store graph data?”, “How to efficiently index graph data?” and “How to query graph databases?”. Hence, we are interested in applications of graph databases in different practical domains. The workshop invites original research contributions as well as reports on prototype systems from research communities dealing with different theoretical and applied aspects of graph data management. Submitted papers will be evaluated on the basis of significance, originality, technical quality, and exposition. Papers should clearly establish the research contribution, and relation to previous research. Position and survey papers are also welcome.

Topics of interest include, but are not limited to:

  • Methods/Techniques for storing, indexing and querying graph data.
  • Methods/Techniques for estimating the selectivity of graph queries.
  • Methods/Techniques for graph mining.
  • Methods/Techniques for compact (compressed) representation of graph data.
  • Methods/Techniques for measuring graph similarity.
  • Methods/Techniques for large scale and distributed graph processing.
  • Tools/Techniques for graph data management for social network applications.
  • Tools/Techniques for graph data management of chemical compounds.
  • Tools/Techniques for graph data management of protein networks.
  • Tools/Techniques for graph data management of multimedia databases.
  • Tools/Techniques for graph data management of semantic web data (RDF).
  • Tools/Techniques for graph data management for spatio-temporal applications.
  • Tools/Techniques for graph data management for Business Process Management applications.
  • Tools/Techniques for visualizing, browsing, or navigating graph data
  • Analysis/Proposals for graph query languages.
  • Advanced applications and tools for managing graph databases in different domains.
  • Benchmarking and testing of graph data management techniques.

Being held in conjunction with the 29th IEEE International Conference on Data Engineering, Brisbane, Australia, April 8-12, 2013.

Scalding for the Impatient

Filed under: Cascading,Scala,Scalding,TF-IDF — Patrick Durusau @ 5:39 pm

Scalding for the Impatient by Sujit Pal.

From the post:

Few weeks ago, I wrote about Pig, a DSL that allows you to specify a data processing flow in terms of PigLatin operations, and results in a sequence of Map-Reduce jobs on the backend. Cascading is similar to Pig, except that it provides a (functional) Java API to specify a data processing flow. One obvious advantage is that everything can now be in a single language (no more having to worry about UDF integration issues). But there are others as well, as detailed here and here.

Cascading is well documented, and there is also a very entertaining series of articles titled Cascading for the Impatient that builds up a Cascading application to calculate TF-IDF of terms in a (small) corpus. The objective is to showcase the features one would need to get up and running quickly with Cascading.

Scalding is a Scala DSL built on top of Cascading. As you would expect, Cascading code is an order of magnitude shorter than equivalent Map-Reduce code. But because Java is not a functional language, implementing functional constructs leads to some verbosity in Cascading that is eliminated in Scalding, leading to even shorter and more readable code.

I was looking for something to try my newly acquired Scala skills on, so I hit upon the idea of building up a similar application to calculate TF-IDF for terms in a corpus. The table below summarizes the progression of the Cascading for the Impatient series. I’ve provided links to the original articles for the theory (which is very nicely explained there) and links to the source codes for both the Cascading and Scalding versions.

A very nice side by side comparison and likely to make you interested in Scalding.

Measuring similarity and distance function

Filed under: Distance,Similarity — Patrick Durusau @ 4:42 pm

Measuring similarity and distance function by Ricky Ho.

Ricky covers:

  • Distance between numeric data points
  • Distance between categorical data points
  • Distance between mixed categorical and numeric data points
  • Distance between sequence (String, TimeSeries)
  • Distance between nodes in a network
  • Distance between population distribution

Not every measure or function but enough to get you started.

St. Laurent on Balisage

Filed under: JSON,XML — Patrick Durusau @ 4:29 pm

Applying markup to complexity: The blurry line between markup and programming by Simon St. Laurent.

Simon’s review of Balisage will make you want to attend next year, if you missed this year.

He misses an important issue with JSON (and XML) when he writes:

JSON gave programmers much of what they wanted: a simple format for shuttling (and sometimes storing) loosely structured data. Its simpler toolset, freed of a heritage of document formats and schemas, let programmers think less about information formats and more about the content of what they were sending.

XML and JSON look at data through different lenses. XML is a tree structure of elements, attributes, and content, while JSON is arrays, objects, and values. Element order matters by default in XML, while JSON is far less ordered and contains many more anonymous structures. (emphasis added)

The problem with JSON in a nutshell (apologies to O’Reilly): anonymous structures.

How is a subsequent programmer going to discover the semantics of “anonymous structures?”

Works great for job security, works less well for information integration several “generations” of programmers later.

XML can be poorly documented, just like JSON, but relationships between elements are explicit.

Anonymity, of all kinds, is the enemy of re-use of data, semantic integration and useful archiving of data.

If those aren’t your use cases, use anonymous JSON structures. (Or undocumented XML.)

DATA MINING: Accelerating Drug Discovery by Text Mining of Patents

Filed under: Contest,Data Mining,Drug Discovery,Patents,Text Mining — Patrick Durusau @ 1:34 pm

DATA MINING: Accelerating Drug Discovery by Text Mining of Patents

From the contest page:

Patent documents contain important research that is valuable to the industry, business, law, and policy-making communities. Take the patent documents from the United States Patent and Trademark Office (USPTO) as examples. The structured data include: filing date, application date, assignees, UPC (US Patent Classification) codes, IPC codes, and others, while the unstructured segments include: title, abstract, claims, and description of the invention. The description of the invention can be further segmented into field of the invention, background, summary, and detailed description.

Given a set of “Source” patents or documents, we can use text mining to identify patents that are “similar” and “relevant” for the purpose of discovery of drug variants. These relevant patents could further be clustered and visualized appropriately to reveal implicit, previously unknown, and potentially useful patterns.

The eventual goal is to obtain a focused and relevant subset of patents, relationships and patterns to accelerate discovery of variations or evolutions of the drugs represented by the “source” patents.

Timeline:

  • July 19, 2012 – Start of the Contest Part 1
  • August 23, 2012 – Deadline for Submission of Onotolgy delieverables 
  • August 24 to August 29, 2012 – Crowdsourced And Expert Evaluation for Part 1. NO SUBMISSIONS ACCEPTED for contest during this week.
  • Milestone 1: August 30, 2012 – Winner for Part 1 contest announced and Ontology release to the community for Contest Part 2
  • Aug. 31 to Sept. 21, 2012 – Contest Part 2 Begins – Data Exploration / Text Mining of Patent Data
  • Milestone 2: Sept. 21, 2012 – Deadline for Submission Contest Part 2. FULL CONTEST CLOSING.
  • Sept. 22 to Oct. 5, 2012 – Crowdsourced and Expert Evaluation for contest Part 2
  • Milestone 3: Oct. 5, 2012 – Conditional Winners Announcement 

Possibly fertile ground for demonstrating the value of topic maps.

Particularly if you think of topic maps as curating search strategies and results.

Think about that for a moment: curating search strategies and results.

We have all asked reference librarians or other power searchers for assistance and watched while they discovered resources we didn’t imagine existed.

What if for medical expert searchers, we curate the “search request” along with the “search strategy” and the “result” of that search?

Such that we can match future search requests up with likely search strategies?

What we are capturing is the experts understanding and recognition of subjects not apparent to the average user. Capturing it in such a way as to make use of it again in the future.

If you aren’t interested in medical research, how about: Accelerating Discovery of Trolls by Text Mining of Patents? 😉

I first saw this at KDNuggets.


Update: 13 August 2012

Tweet by Lars Marius Garshol points to: Patent troll Intellectual Ventures is more like a HYDRA.

Even a low-end estimate – the patents actually recorded in the USPTO as being assigned to one of those shells – identifies around 10,000 patents held by the firm.

At the upper end of the researchers’ estimates, Intellectual Ventures would rank as the fifth-largest patent holder in the United States and among the top fifteen patent holders worldwide.

As sad as that sounds, remember this is one (1) troll. There are others.

Systematic benchmark of substructure search in molecular graphs – From Ullmann to VF2

Filed under: Algorithms,Bioinformatics,Graphs,Molecular Graphs — Patrick Durusau @ 1:11 pm

Systematic benchmark of substructure search in molecular graphs – From Ullmann to VF2 by Hans-Christian Ehrlich and Matthias Rarey. (Journal of Cheminformatics 2012, 4:13 doi:10.1186/1758-2946-4-13)

Abstract:

Background

Searching for substructures in molecules belongs to the most elementary tasks in cheminformatics and is nowadays part of virtually every cheminformatics software. The underlying algorithms, used over several decades, are designed for the application to general graphs. Applied on molecular graphs, little effort has been spend on characterizing their performance. Therefore, it is not clear how current substructure search algorithms behave on such special graphs. One of the main reasons why such an evaluation was not performed in the past was the absence of appropriate data sets.

Results

In this paper, we present a systematic evaluation of Ullmann’s and the VF2 subgraph isomorphism algorithms on molecular data. The benchmark set consists of a collection of 1236 SMARTS substructure expressions and selected molecules from the ZINC database. The benchmark evaluates substructures search times for complete database scans as well as individual substructure-molecule-pairs. In detail, we focus on the influence of substructure formulation and size, the impact of molecule size, and the ability of both algorithms to be used on multiple cores.

Conclusions

The results show a clear superiority of the VF2 algorithm in all test scenarios. In general, both algorithms solve most instances in less than one millisecond, which we consider to be acceptable. Still, in direct comparison, the VF2 is most often several folds faster than Ullmann’s algorithm. Additionally, Ullmann’s algorithm shows a surprising number of run time outliers.

Questions:

How do your graphs compare to molecular graphs? Similarities? Differences?

For searching molecular graphs, what algorithm does your software use for substructure searches?

The Semantics of Chemical Markup Language (CML) for Computational Chemistry : CompChem

Filed under: Chemical Markup Language (CML),Cheminformatics,CompChem — Patrick Durusau @ 12:52 pm

The Semantics of Chemical Markup Language (CML) for Computational Chemistry : CompChem by Weerapong Phadungsukanan, Markus Kraft, Joe A Townsend and Peter Murray-Rust (Journal of Cheminformatics 2012, 4:15 doi:10.1186/1758-2946-4-15)

Abstract (provisional):

This paper introduces a subdomain chemistry format for storing computational chemistry data called CompChem. It has been developed based on the design, concepts and methodologies of Chemical Markup Language (CML) by adding computational chemistry semantics on top of the CML Schema. The format allows a wide range of ab initio quantum chemistry calculations of individual molecules to be stored. These calculations include, for example, single point energy calculation, molecular geometry optimization, and vibrational frequency analysis. The paper also describes the supporting infrastructure, such as processing software, dictionaries, validation tools and database repository. In addition, some of the challenges and difficulties in developing common computational chemistry dictionaries are being discussed. The uses of CompChem are illustrated on two practical applications.

Important contribution if you are working with computational chemistry semantics.

Also important for its demonstration of the value of dictionaries and not trying to be all inclusive.

Integrate the data you have at hand and make allowance for the yet to be known.

Besides, there is always the next topic map that may consume the first with new merging rules.

CompChem Convention http://www.xml-cml.org/convention/compchem

CompChem dictionary http://www.xml-cml.org/dictionary/compchem/

CompChem validation stylesheet https://bitbucket.org/wwmm/cml-specs

CMLValidator http://bitbucket.org/cml/cmllite-validator-code

Chemical Markup Language (CML) http://www.xml-cml.org

Calais Release 4.6 is Available for Beta Testing [Through 23rd of August]

Filed under: Natural Language Processing,OpenCalais — Patrick Durusau @ 12:16 pm

Calais Release 4.6 is Available for Beta Testing [Through 23rd of August]

From the post:

As we mentioned in our prior post, Version 4.6 of OpenCalais is now available for beta testing. While we should have 100% backward compatibility – it’s always a good idea to run a set of transaction through and make sure there are no issues.

You’ll see a number of new things in this release:

  • Under the covers we’ve upgraded our core processing engine. While this won’t directly affect you as an end user – it does set the stage for further improvements in the future.
  • We’ve improved the quality of the Company and Person extraction. Not surprisingly, these are two of our most frequently used concepts and we want them to be insanely great – we’re getting there.
  • We’ve updated and refreshed our Social Tags feature. If you haven’t had a chance to experiment with Social Tags in the past, give it a try. This is a great way to immediately improve the “findability” of your content.
  • We’ve introduced six new concepts that we’ll discuss below.

PersonParty extracts information about the affiliation of a person with a political party. CandidatePosition extracts information on past, current and aspirational political positions for a candidate. ArmedAttack extracts information regarding and attack by a person or organization on a country, organization or political figure. MilitaryAction extracts references to non-combative military actions such as troop deployments or movements. ArmsPurchaseSale Extracts information on planned, proposed or consummated arms sales. PersonLocation extracts information on where a person lives or is traveling.

So, it’s the Politics and Conflict pack – always popular topics.

More details at the post (including release notes).

Get your comments in early! Planned end of beta test: 23rd of August 2012.

X3DOM

Filed under: Graphics,Interface Research/Design,X3DOM — Patrick Durusau @ 12:06 pm

about page:

X3DOM (pronounced X-Freedom) is an experimental open source framework and runtime to support the ongoing discussion in the Web3D and W3C communities how an integration of HTML5 and declarative 3D content could look like. It tries to fulfill the current HTML5 specification for declarative 3D content and allows including X3D elements as part of any HTML5 DOM tree.

I had to get past two empty PR releases and finally search the Web for a useful URL.

Even the about page has a great demo. It also has links to more information.

Not stable, yet, but merits your attention for authoring topic map interfaces.

Pointers to your interfaces, topic map or otherwise, using X3DOM, greatly appreciated!

August 11, 2012

Themes in streaming algorithms (workshop at TU Dortmund)

Filed under: Algorithms,Data Streams,Stream Analytics — Patrick Durusau @ 8:31 pm

Themes in streaming algorithms (workshop at TU Dortmund) by Anna C. Gilbert.

From the post:

I recently attended the streaming algorithms workshop at Technische Universitat Dortumund. It was a follow-on to the very successful series of streaming algorithms workshops held in Kanpur over the last six years. Suresh and Andrew have both given excellent summaries of the individual talks at the workshop (see day 1, day 2, and day 3) so, as both a streaming algorithms insider and outsider, I thought it would be good to give a high-level overview of what themes there are in streaming algorithms research these days, to identify new areas of research and to highlight advances in existing areas.

Anna gives the briefest of summaries but I think they will entice you to look further.

Curious, how would you distinguish a “stream of data” from “read once data?”

That is in the second case you only get one pass at reading the data. Errors are just that, errors, but you can’t look back to be sure.

Is data “integrity” an artifact of small data sets and under-powered computers?

Yu and Robinson on The Ambiguity of “Open Government”

Filed under: Ambiguity,Government,Law,Open Government — Patrick Durusau @ 8:14 pm

Yu and Robinson on The Ambiguity of “Open Government”

Legal Informatics calls our attention to the use of ambiguity to blunt, at least in one view, the potency of the phrase “open government.”

Whatever your politics, it is a reminder that for good or ill, semantics originate with us.

Topic maps are one tool to map those semantics, to remove (or enhance) ambiguity.

Lima on Visualization and Legislative Memory of the Brazilian Civil Code

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 6:28 pm

Lima on Visualization and Legislative Memory of the Brazilian Civil Code

Legal Informatics report the publication of the legislative history of the Brazilian Civil Code and a visualization of the Brazilian Civil Code.

Tying in Planiol’s Treatise on Civil Law (or other commentators) to such resources would make a nice showcase for topic maps.

Neo4j and Bioinformatics

Filed under: Bio4j,Bioinformatics,Graphs,Neo4j — Patrick Durusau @ 6:01 pm

Neo4j and Bioinformatics

From the description:

Pablo Pareja will give an overview of Bio4j project, and then move to some of its recent applications. BG7: a new system for bacterial genome annotation designed for NGS data MG7: metagenomics + taxonomy integration Evolutionary studies, transcriptional networks, network analysis..

It may just be me but the sound seems “faint.” Even when set to full volume, it is difficult to hear Pablo clearly.

I have tried this on two different computers with different OSes so I don’t think it is a problem on my end.

Your experience?

BTW, slides are here.

Getting Started With Hyperdex

Filed under: HyperDex,Hypergraphs,Hyperspace — Patrick Durusau @ 3:44 pm

Getting Started With Hyperdex by Ṣeyi Ogunyẹ́mi.

From the post:

Alright, let’s start this off with a fitting soundtrack just because we can. Open it up in a tab and come back?

Greetings, valiant adventurer!

So, I heard you care about data. You aren’t storing your precious data in anything that acknowledges PUT requests before being certain it’ll be able to return it to you? Well then, you’ve come to the right place.

Okay, I’m clearly excited, but with good reason. Some time in the past few months, I ran into a paper; “HyperDex: A Distributed, Searchable Key-Value Store”1 from a team at Cornell. By now the typical reaction to NoSQL news tends to be that your eyes glaze over and you start mouthing “…is Web-Scale™”, but this isn’t “yet another NoSQL database”. So, I’ve finally gotten round to writing this piece in hopes of sharing it with others.

Before plunging into the deep end, it’s probably a good idea to discuss why I’ve found HyperDex to be particularly exciting. For reasons that will probably be in a different blog post, I’ve been researching the design of a distributed key/value store with support for strong consistency (for the morbidly curious, it’s connected to Ampify). You must realise that the state-of-the-art distributed key/value stores such as Dynamo (and it’s open-source clone, Riak) tend to aim for eventual consistency.

If you aren’t already experimenting with Hyperdex you may well be after reading this post.

Confusing Statistical Term #7: GLM

Filed under: Names,Statistics — Patrick Durusau @ 3:43 pm

Confusing Statistical Term #7: GLM by Karen Grace-Martin.

From the post:

Like some of the other terms in our list–level and beta–GLM has two different meanings.

It’s a little different than the others, though, because it’s an abbreviation for two different terms:

General Linear Model and Generalized Linear Model.

It’s extra confusing because their names are so similar on top of having the same abbreviation.

And, oh yeah, Generalized Linear Models are an extension of General Linear Models.

And neither should be confused with Generalized Linear Mixed Models, abbreviated GLMM.

Naturally.

So what’s the difference? And does it really matter?

As you probably have guessed, yes.

You will need a reading knowledge of statistics to really appreciate the post. If you don’t have such knowledge, now would be a good time to pick it up.

Statistics are a way of summarizing information about subjects. You can rely on the judgements of others on such summaries or you can have your own.

HBase FuzzyRowFilter: Alternative to Secondary Indexes

Filed under: HBase — Patrick Durusau @ 3:43 pm

HBase FuzzyRowFilter: Alternative to Secondary Indexes by Alex Baranau.

From the post:

In this post we’ll explain the usage of FuzzyRowFilter which can help in many situations where secondary indexes solutions seems to be the only choice to avoid full table scans.

Background

When it comes to HBase the way you design your row key affects everything. It is a common pattern to have composite row key which consists of several parts, e.g. userId_actionId_timestamp. This allows for fast fetching of rows (or single row) based on start/stop row keys which have to be a prefix of the row keys you want to select. E.g. one may select last time of userX logged in by specifying row key prefix “userX_login_”. Or last action of userX by fetching the first row with prefix “userX_”. These partial row key scans work very fast and does not require scanning the whole table: HBase storage is optimized to make them fast.

Problem

However, there are cases when you need to fetch data based on key parts which happen to be in the middle of the row key. In the example above you may want to find last logged in users. When you don’t know the first parts of the key partial row key scan turns into full table scan which might be very slow and resource intensive.

Although Alex notes the solution he presents is no “silver bullet,” it illustrates:

  • The impact of key design on later usage.
  • Important of knowing all your options for query performance.

I would capture the availability of the “FuzzyRowFilter,” key structure and cardinality of data using a topic map. Saves then next HBase administrator time and effort.

True, they can always work out the details for themselves but then they make not have your analytical skills.

Scale and NoSQL Data Models

Filed under: Graphs,Neo4j,Networks — Patrick Durusau @ 3:42 pm

I am sure you have seen the Neo4j graphic:

NoSQL graphic

in almost every Neo4j presentation.

Seeing the graphic dozens, if not hundreds of times, made me realize it has two fundamental flaws.

First, if the dotted line represents 90% on the size axis, the scale of the size axis must change at the 90% mark or thereabouts.

Otherwise, key/value stores are at 180% of size. A marketing point for them but an unlikely one for anyone to credit.

Second, the complexity axis has no scale at all. Or at least not one that I can discern.

If you take a standard document database, say a CMS system, why is it more complex than a key/value store?

Or a bigtable clone for that matter?

Don’t get me wrong, I still think the future of data processing lies with graph databases.

Or more accurately, with the explicit identification/representation of relationships.

But I don’t need misleading graphics to make that case.

OPLSS 2012

Filed under: Language,Language Design,Programming,Types — Patrick Durusau @ 3:42 pm

OPLSS 2012 by Robert Harper.

From the post:

The 2012 edition of the Oregon Programming Languages Summer School was another huge success, drawing a capacity crowd of 100 eager students anxious to learn the latest ideas from an impressive slate of speakers. This year, as last year, the focus was on languages, logic, and verification, from both a theoretical and a practical perspective. The students have a wide range of backgrounds, some already experts in many of the topics of the school, others with little or no prior experience with logic or semantics. Surprisingly, a large percentage (well more than half, perhaps as many as three quarters) had some experience using Coq, a large uptick from previous years. This seems to represent a generational shift—whereas such topics were even relatively recently seen as the province of a few zealots out in left field, nowadays students seem to accept the basic principles of functional programming, type theory, and verification as a given. It’s a victory for the field, and extremely gratifying for those of us who have been pressing forward with these ideas for decades despite resistance from the great unwashed. But it’s also bittersweet, because in some ways it’s more fun to be among the very few who have created the future long before it comes to pass. But such are the proceeds of success.

As if a post meriting your attention wasn’t enough, it concludes with:

Videos of the lectures, and course notes provided by the speakers, are all available at the OPLSS 12 web site.

Just a summary of what you will find:

  • Logical relations — Amal Ahmed
  • Category theory foundations — Steve Awodey
  • Proofs as Processes — Robert Constable
  • Polarization and focalization — Pierre-Louis Curien
  • Type theory foundations — Robert Harper
  • Monads and all that — John Hughes
  • Compiler verification — Xavier Leroy
  • Language-based security — Andrew Myers
  • Proof theory foundations — Frank Pfenning
  • Software foundations in Coq — Benjamin Pierce

Enjoy!

August 10, 2012

The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs”

Filed under: Bioinformatics,Biomedical,De Bruijn Graphs,Genome,Graphs — Patrick Durusau @ 3:11 pm

The Story Behind “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” by C. Titus Brown.

From the post:

This is the story behind our PNAS paper, “Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs” (released from embargo this past Monday).

Why did we write it? How did it get started? Well, rewind the tape 2 years and more…

There we were in May 2010, sitting on 500 million Illumina reads from shotgun DNA sequencing of an Iowa prairie soil sample. We wanted to reconstruct the microbial community contents and structure of the soil sample, but we couldn’t figure out how to do that from the data. We knew that, in theory, the data contained a number of partial microbial genomes, and we had a technique — de novo genome assembly — that could (again, in theory) reconstruct those partial genomes. But when we ran the software, it choked — 500 million reads was too big a data set for the software and computers we had. Plus, we were looking forward to the future, when we would get even more data; if the software was dying on us now, what would we do when we had 10, 100, or 1000 times as much data?

A perfect post to read over the weekend!

Not all research ends successfully, but when it does, it is a story that inspires.

From Overload to Impact: An Industry Scorecard on Big Data Business Challenges [Oracle Report]

From Overload to Impact: An Industry Scorecard on Big Data Business Challenges [Oracle Report]

Summary:

IT powers today’s enterprises, which is particularly true for the world’s most data-intensive industries. Organizations in these highly specialized industries increasingly require focused IT solutions, including those developed specifically for their industry, to meet their most pressing business challenges, manage and extract insight from ever-growing data volumes, improve customer service, and, most importantly, capitalize on new business opportunities.

The need for better data management is all too acute, but how are enterprises doing? Oracle surveyed 333 C-level executives from U.S. and Canadian enterprises spanning 11 industries to determine the pain points they face regarding managing the deluge of data coming into their organizations and how well they are able to use information to drive profit and growth.

Key Findings:

  • 94% of C-level executives say their organization is collecting and managing more business information today than two years ago, by an average of 86% more
  • 29% of executives give their organization a “D” or “F” in preparedness to manage the data deluge
  • 93% of executives believe their organization is losing revenue – on average, 14% annually – as a result of not being able to fully leverage the information they collect
  • Nearly all surveyed (97%) say their organization must make a change to improve information optimization over the next two years
  • Industry-specific applications are an important part of the mix; 77% of organizations surveyed use them today to run their enterprise—and they are looking for more tailored options

What key finding did they miss?

They cover it in the forty-two (42) page report but it doesn’t appear here.

Care to guess what it is?

Forgotten key finding post coming Monday, 13 August 2012. Watch for it!

I first saw this at Beyond Search.

[C]rowdsourcing … knowledge base construction

Filed under: Biomedical,Crowd Sourcing,Data Mining,Medical Informatics — Patrick Durusau @ 1:48 pm

Development and evaluation of a crowdsourcing methodology for knowledge base construction: identifying relationships between clinical problems and medications by Allison B McCoy, Adam Wright, Archana Laxmisan, Madelene J Ottosen, Jacob A McCoy, David Butten, and Dean F Sittig. (J Am Med Inform Assoc 2012; 19:713-718 doi:10.1136/amiajnl-2012-000852)

Abstract:

Objective We describe a novel, crowdsourcing method for generating a knowledge base of problem–medication pairs that takes advantage of manually asserted links between medications and problems.

Methods Through iterative review, we developed metrics to estimate the appropriateness of manually entered problem–medication links for inclusion in a knowledge base that can be used to infer previously unasserted links between problems and medications.

Results Clinicians manually linked 231 223 medications (55.30% of prescribed medications) to problems within the electronic health record, generating 41 203 distinct problem–medication pairs, although not all were accurate. We developed methods to evaluate the accuracy of the pairs, and after limiting the pairs to those meeting an estimated 95% appropriateness threshold, 11 166 pairs remained. The pairs in the knowledge base accounted for 183 127 total links asserted (76.47% of all links). Retrospective application of the knowledge base linked 68 316 medications not previously linked by a clinician to an indicated problem (36.53% of unlinked medications). Expert review of the combined knowledge base, including inferred and manually linked problem–medication pairs, found a sensitivity of 65.8% and a specificity of 97.9%.

Conclusion Crowdsourcing is an effective, inexpensive method for generating a knowledge base of problem–medication pairs that is automatically mapped to local terminologies, up-to-date, and reflective of local prescribing practices and trends.

I would not apply the term “crowdsourcing,” here, in part because the “crowd” is hardly unknown. Not a crowd at all, but an identifiable group of clinicians.

Doesn’t invalidate the results, which shows the utility of data mining for creating knowledge bases.

As a matter of usage, let’s not confuse anonymous “crowds,” with specific groups of people.

Data-Intensive Librarians for Data-Intensive Research

Data-Intensive Librarians for Data-Intensive Research by Chelcie Rowell.

From the post:

A packed house heard Tony Hey and Clifford Lynch present on The Fourth Paradigm: Data-Intensive Research, Digital Scholarship and Implications for Libraries at the 2012 ALA Annual Conference.

Jim Gray coined The Fourth Paradigm in 2007 to reflect a movement toward data-intensive science. Adapting to this change would, Gray noted, require an infrastructure to support the dissemination of both published work and underlying research data. But the return on investment for building the infrastructure would be to accelerate the transformation of raw data to recombined data to knowledge.

In outlining the current research landscape, Hey and Lynch underscored how right Gray was.

Hey led the audience on a whirlwind tour of how scientific research is practiced in the Fourth Paradigm. He showcased several projects that manage data from capture to curation to analysis and long-term preservation. One example he mentioned was the Dataverse Network Project that is working to preserve diverse scholarly outputs from published work to data, images and software.

Lynch reflected on the changing nature of the scientific record and the different collaborative structures that will be needed to define, generate and preserve that record. He noted that we tend to think of the scholarly record in terms of published works. In light of data-intensive science, Lynch said the definition must be expanded to include the datasets which underlie results and the software required to render data.

I wasn’t able to find a video of the presentations and/or slides but while you wait for those to appear, you can consult the homepages of Lynch and Hey for related materials.

Librarians already have searching and bibliographic skills, which are appropriate to the Fourth Paradigm.

What if they were to add big data design, if not processing, skills to their resumes?

What if articles in professional journals carried a byline in addition to the authors: Librarian(s): ?

Phenol-Explorer 2.0:… [Topic Maps As Search Templates]

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 12:31 pm

Phenol-Explorer 2.0: a major update of the Phenol-Explorer database integrating data on polyphenol metabolism and pharmacokinetics in humans and experimental animals by Joseph A. Rothwell, Mireia Urpi-Sarda, Maria Boto-Ordoñez, Craig Knox, Rafael Llorach, Roman Eisner, Joseph Cruz, Vanessa Neveu, David Wishart, Claudine Manach, Cristina Andres-Lacueva, and Augustin Scalbert.

Abstract:

Phenol-Explorer, launched in 2009, is the only comprehensive web-based database on the content in foods of polyphenols, a major class of food bioactives that receive considerable attention due to their role in the prevention of diseases. Polyphenols are rarely absorbed and excreted in their ingested forms, but extensively metabolized in the body, and until now, no database has allowed the recall of identities and concentrations of polyphenol metabolites in biofluids after the consumption of polyphenol-rich sources. Knowledge of these metabolites is essential in the planning of experiments whose aim is to elucidate the effects of polyphenols on health. Release 2.0 is the first major update of the database, allowing the rapid retrieval of data on the biotransformations and pharmacokinetics of dietary polyphenols. Data on 375 polyphenol metabolites identified in urine and plasma were collected from 236 peer-reviewed publications on polyphenol metabolism in humans and experimental animals and added to the database by means of an extended relational design. Pharmacokinetic parameters have been collected and can be retrieved in both tabular and graphical form. The web interface has been enhanced and now allows the filtering of information according to various criteria. Phenol-Explorer 2.0, which will be periodically updated, should prove to be an even more useful and capable resource for polyphenol scientists because bioactivities and health effects of polyphenols are dependent on the nature and concentrations of metabolites reaching the target tissues. The Phenol-Explorer database is publicly available and can be found online at http://www.phenol-explorer.eu.

I wanted to call your attention to Table 1: Search Strategy and Terms, step 4 which reads:

Polyphenol* or flavan* or flavon*or anthocyan* or isoflav* or phytoestrogen* or phyto-estrogen* or lignin* or stilbene* or chalcon* or phenolic acid* or ellagic* or coumarin* or hydroxycinnamic* or quercetin* or kaempferol* or rutin* or apigenin* or luteolin* or catechin* or epicatechin* or gallocatechin* or epigallocatechin* or procyanidin* or hesperetin* or naringenin* or cyanidin* or malvidin* or petunid* or peonid*or daidz* or genist* or glycit* or equol* or gallic* or vanillic* or chlorogenic* or tyrosol* or hydoxytyrosol* or resveratrol* or viniferin*

Which of these terms are synonyms for “tyrosol?

No peeking!

Wikipedia (a generalist source), lists five (5) names, including tyrosol, and 5 different identifiers.

Common Chemistry, which you can access by the CAS number, has twenty-one (21) synonyms.

Ready?

Would you believe 0?

See for yourself: Wikipedia Tyrosol; Common Chemistry – CAS 501-94-0.

Another question: In one week (or even tomorrow), how much of the query in step 4 will you remember?

Some obvious comments:

  • The creators of Pehno-Explorer 2.0 have done a great service to the community by curating this data resource.
  • Creating comprehensive queries is a creative enterprise and not easy to duplicate.

Perhaps less obvious comments:

  • The terms in the query have synonyms, which is no great surprise.
  • If the terms were represented as topics in a topic map, synonyms could be captured for those terms.
  • Capturing of synonyms for terms would support expansion or contraction of search queries.
  • Capturing terms (and their synonyms) in a topic map, would permit merging of terms/synonyms from other researchers.

Final question: Have you thought about using topic maps as search templates?

MedLingMap

Filed under: Medical Informatics,Natural Language Processing — Patrick Durusau @ 9:27 am

MedLingMap

From the “welcome” entry:

MedLingMap is a growing resource providing a map of NLP systems and research in the Medical Domain. The site is being developed as part of the NLP Systems in the Medical Domain course in Brandeis University’s Computational Linguistics Master’s Program, taught by Dr. Marie Meteer. Learn more about the students doing the work.

MedLIngMap brings together the many different references, resources, organizations, and people in this very diverse domain. By using a faceted indexing approach to organizing the materials, MedLingMap can capture not only the content, but also the context by including facets such as the applications of the technology, the research or development group it was done by, and the techniques and algorithms that were utilized in developing the technology.

Not a lot of resources listed but every project has to start somewhere.

Capturing the use of specific techniques and algorithms will make this a particularly useful resource.

First BOSS Data: 3-D Map of 500,000 Galaxies, 100,000 Quasars

Filed under: Astroinformatics,Data,Science — Patrick Durusau @ 9:02 am

First BOSS Data: 3-D Map of 500,000 Galaxies, 100,000 Quasars

From the post:

The Third Sloan Digital Sky Survey (SDSS-III) has issued Data Release 9 (DR9), the first public release of data from the Baryon Oscillation Spectroscopic Survey (BOSS). In this release BOSS, the largest of SDSS-III’s four surveys, provides spectra for 535,995 newly observed galaxies, 102,100 quasars, and 116,474 stars, plus new information about objects in previous Sloan surveys (SDSS-I and II).

“This is just the first of three data releases from BOSS,” says David Schlegel of the U.S. Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab), an astrophysicist in the Lab’s Physics Division and BOSS’s principal investigator. “By the time BOSS is complete, we will have surveyed more of the sky, out to a distance twice as deep, for a volume more than five times greater than SDSS has surveyed before — a larger volume of the universe than all previous spectroscopic surveys combined.”

Spectroscopy yields a wealth of information about astronomical objects including their motion (called redshift and written “z”), their composition, and sometimes also the density of the gas and other material that lies between them and observers on Earth. The BOSS spectra are now freely available to a public that includes amateur astronomers, astronomy professionals who are not members of the SDSS-III collaboration, and high-school science teachers and their students.

The new release lists spectra for galaxies with redshifts up to z = 0.8 (roughly 7 billion light years away) and quasars with redshifts between z = 2.1 and 3.5 (from 10 to 11.5 billion light years away). When BOSS is complete it will have measured 1.5 million galaxies and at least 150,000 quasars, as well as many thousands of stars and other “ancillary” objects for scientific projects other than BOSS’s main goal.

For data access, software tools, tutorials, etc., see: http://sdss3.org/

Interesting data set but also instructive for the sharing of data and development of tools for operations on shared data. You don’t have to have a local supercomputer to process the data. Dare I say a forerunner of the “cloud?”

Be the alpha geek at your local astronomy club this weekend!

….Comparing Digital Preservation Glossaries [Why Do We Need Common Vocabularies?]

Filed under: Archives,Digital Library,Glossary,Preservation — Patrick Durusau @ 8:28 am

From AIP to Zettabyte: Comparing Digital Preservation Glossaries

Emily Reynolds (2012 Junior Fellow) writes:

As we mentioned in our introductory post last month, the OSI Junior Fellows are working on a project involving a draft digital preservation policy framework. One component of our work is revising a glossary that accompanies the framework. We’ve spent the last two weeks poring through more than two dozen glossaries relating to digital preservation concepts to locate and refine definitions to fit the terms used in the document.

We looked at dictionaries from well-established archival entities like the Society of American Archivists, as well as more strictly technical organizations like the Internet Engineering Task Force. While some glossaries take a traditional archival approach, others were more technical; we consulted documents primarily focusing on electronic records, archives, digital storage and other relevant fields. Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary. Based on what we found, that vocabulary will have to be broadly drawn and flexible to meet different kinds of requirements.

OSI = Office of Strategic Initiatives (Library of Congress)

Not to be overly critical, but I stumble over:

Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary.

Why does a “variety in the definitions for other terms…highlight[s] the need for a common vocabulary?”

I take it as a given that we have diverse vocabularies.

And that attempts at “common” vocabularies succeed in creating yet another “diverse” vocabulary.

So, why would anyone looking at “diverse” vocabularies jump to the conclusion that a “common” vocabulary is required?

Perhaps what is missing is the definition of the problem presented by “diverse” vocabularies.

Hard to solve a problem if you don’t know it is. (Hasn’t stopped some people that I know but that is a story for another day.)

I put it to you (and in your absence I will answer, so answer quickly):

What is the problem (or problems) presented by diverse vocabularies? (Feel free to use examples.)

Or if you prefer, Why do we need common vocabularies?

i2b2: Informatics for Integrating Biology and the Bedside

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:43 am

i2b2: Informatics for Integrating Biology and the Bedside

I discovered this site while chasing down a coreference resolution workshop. From the homepage:

Informatics for Integrating Biology and the Bedside (i2b2) is an NIH-funded National Center for Biomedical Computing (NCBC) based at Partners HealthCare System in Boston, Mass. Established in 2004 in response to an NIH Roadmap Initiative RFA, this NCBC is one of four national centers awarded in this first competition (http://www.bisti.nih.gov/ncbc/); currently there are seven NCBCs. One of 12 specific initiatives in the New Pathways to Discovery Cluster, the NCBCs will initiate the development of a national computational infrastructure for biomedical computing. The NCBCs and related R01s constitute the National Program of Excellence in Biomedical Computing.

The i2b2 Center, led by Director Isaac Kohane, M.D., Ph.D., Professor of Pediatrics at Harvard Medical School at Children’s Hospital Boston, is comprised of seven cores involving investigators from the Harvard-affiliated hospitals, MIT, Harvard School of Public Health, Joslin Diabetes Center, Harvard Medical School and the Harvard/MIT Division of Health Sciences and Technology. This Center is funded under a Cooperative agreement with the National Institutes of Health.

The i2b2 Center is developing a scalable computational framework to address the bottleneck limiting the translation of genomic findings and hypotheses in model systems relevant to human health. New computational paradigms (Core 1) and methodologies (Cores 2) are being developed and tested in several diseases (airways disease, hypertension, type 2 diabetes mellitus, Huntington’s Disease, rheumatoid arthritis, and major depressive disorder) (Core 3 Driving Biological Projects).

The i2b2 Center (Core 5) offers a Summer Institute in Bioinformatics and Integrative Genomics for qualified undergraduate students, supports an Academic Users’ Group of over 125 members, sponsors annual Shared Tasks for Challenges in Natural Language Processing for Clinical Data, distributes an NLP DataSet for research purpose, and sponsors regular Symposia and Workshops for the community.

Sounds like prime hunting grounds for vocabularies that cross disciplinary boundaries and the like.

Extensive resources. Will explore and report back.

« Newer PostsOlder Posts »

Powered by WordPress