Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 7, 2012

Networks, Crowds, and Markets

Filed under: Graphs,Networks — Patrick Durusau @ 10:46 am

Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg. (post by Max De Marzi)

Max has pointers to Cambridge University Press, the web and a pdf version of this book.

Max has pointers to three (3) Jon Kleinberg videos and the complete table of contents from the book.

At 833 pages, this should keep you occupied for a while.

EU-ADR Web Platform

Filed under: Bioinformatics,Biomedical,Drug Discovery,Medical Informatics — Patrick Durusau @ 10:29 am

EU-ADR Web Platform

I was disappointed to not find the UMLS concepts and related terms mapping for participants in the EU-ADR project.

I did find these workflows at the EU-ADR Web Platform:

MEDLINE ADR

In the filtering process of well known signals, the aim of the “MEDLINE ADR” workflow is to automate the search of publications related to ADRs corresponding to a given drug/adverse event association. To do so, we defined an approach based on the MeSH thesaurus, using the subheadings «chemically induced» and «adverse effects» with the “Pharmacological Action” knowledge. Using a threshold of ≥3 extracted publications, the automated search method, presented a sensitivity of 93% and a specificity of 97% on the true positive and true negative sets (WP 2.2). We then determined a threshold number of extracted publications ≥ 3 to confirm the knowledge of this association in the literature. This approach offers the opportunity to automatically determine if an ADR (association of a drug and an adverse event) has already been described in MEDLINE. However, the causality relationship between the drug and an event may be judged only by an expert reading the full text article and determining if the methodology of this article was correct and if the association is statically significant.

MEDLINE Co-occurrence

The “MEDLINE Co-occurrence” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the PubMed database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

DailyMed

The “DailyMed” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the DailyMed database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

DrugBank

The “DrugBank” workflow performs a comprehensive data processing operation, searching the given Drug-Event combination in the DrugBank database. Final workflow results include a final score, measuring found drugs relevance regarding the initial Drug-Event pair, as well as pointers to web pages for the discovered drugs.

Substantiation

The “Substantiation” workflow tries to establish a connection between the clinical event and the drug through a gene or protein, by identifying the proteins that are targets of the drug and are also associated with the event. In addition it also considers information about drug metabolites in this process. In such cases it can be argued that the binding of the drug to the protein would lead to the observed event phenotype. Associations between the event and proteins are found by querying our integrated gene-disease association database (Bauer-Mehren, et al., 2010). As this database provides annotations of the gene-disease associations to the articles reporting the association and in case of text-mining derived associations even the exact sentence, the article or sentence can be studied in more detail in order to inspect the supporting evidence for each gene-disease association. It has to be mentioned that our gene-disease association database also contains information about genetic variants or SNPs and their association to diseases or adverse drug events. The methodology for providing information about the binding of a drug (or metabolite) to protein targets is reported in deliverable 4.2, and includes extraction from different databases (annotated chemical libraries) and application of prediction methods based on chemical similarity.

A glimpse of what is state of the art today and a basis for building better tools for tomorrow.

Harmonization of Reported Medical Events in Europe

Filed under: Bioinformatics,Biomedical,Health care,Medical Informatics — Patrick Durusau @ 10:00 am

Harmonization process for the identification of medical events in eight European healthcare databases: the experience from the EU-ADR project by Paul Avillach, et. al. (J Am Med Inform Assoc doi:10.1136/amiajnl-2012-000933)

Abstract

Objective Data from electronic healthcare records (EHR) can be used to monitor drug safety, but in order to compare and pool data from different EHR databases, the extraction of potential adverse events must be harmonized. In this paper, we describe the procedure used for harmonizing the extraction from eight European EHR databases of five events of interest deemed to be important in pharmacovigilance: acute myocardial infarction (AMI); acute renal failure (ARF); anaphylactic shock (AS); bullous eruption (BE); and rhabdomyolysis (RHABD).

Design The participating databases comprise general practitioners’ medical records and claims for hospitalization and other healthcare services. Clinical information is collected using four different disease terminologies and free text in two different languages. The Unified Medical Language System was used to identify concepts and corresponding codes in each terminology. A common database model was used to share and pool data and verify the semantic basis of the event extraction queries. Feedback from the database holders was obtained at various stages to refine the extraction queries.

….

Conclusions The iterative harmonization process enabled a more homogeneous identification of events across differently structured databases using different coding based algorithms. This workflow can facilitate transparent and reproducible event extractions and understanding of differences between databases.

Not to be overly critical but the one thing left out of the abstract was some hint about the “…procedure used for harmonizing the extraction…” which interests me.

The workflow diagram from figure 2 is worth transposing into HTML markup:

  • Event definition
    • Choice of the event
    • Event Definition Form (EDF) containing the medical definition and diagnostic criteria for the event
  • Concepts selection and projection into the terminologies
    • Search for Unified Medical Language System (UMLS) concepts corresponding to the medical definition as reported in the EDF
    • Projection of UMLS concepts into the different terminologies used in the participating databases
    • Publication on the project’s forum of the first list of UMLS concepts and corresponding codes and terms for each terminology
  • Revision of concepts and related terms
    • Feedback from database holders about the list of concepts with corresponding codes and related terms that they have previously used to identify the event of interest
    • Report on literature review on search criteria being used in previous observational studies that explored the event of interest
    • Text mining in database to identify potentially missing codes through the identification of terms associated with the event in databases
    • Conference call for finalizing the list of concepts
    • Search for new UMLS concepts from the proposed terms
    • Final list of UMLS concepts and related codes posted on the forum
  • Translation of concepts and coding algorithms into queries
    • Queries in each database were built using:
      1. the common data model;
      2. the concept projection into different terminologies; and
      3. the chosen algorithms for event definition
    • Query Analysis
      • Database holders extract data on the event of interest using codes and free text from pre-defined concepts and with database-specific refinement strategies
      • Database holders calculate incidence rates and comparisons are made among databases
      • Database holders compare search queries via the forum

At least for non-members, the EU-ADR website does not appear to offer access to the UMLS concepts and related codes mapping. That mapping could be used to increase accessibility to any database using those codes.

Graph Motif Resume

Filed under: Graph Coloring,Graphs,Networks — Patrick Durusau @ 9:01 am

An (almost complete) state of the art around the Graph Motif problem by Florian Sikora.

A listing of current (March, 2012) results on the Graph Motif problem, including references to software.

Starts with an intuitive illustration of the problem.

Makes it easy to see why this problem is going to command a lot of attention as more complex (and realistic) modeling becomes commonplace.

Or to put it another way, normalized data is just that, normalized data. That’s why we don’t call it reality.

Graph motif in O*(2^k) time by narrow sieves [Comparing Graph Software/Databases]

Filed under: Graphs,Networks — Patrick Durusau @ 8:51 am

Graph motif in O*(2^k) time by narrow sieves by Lukasz Kowalik.

Abstract:

We show an O*(2^k)-time polynomial space algorithm for the Graph Motif problem. Moreover, we show an evidence that our result might be essentially tight: the existence of an O((2-\epsilon)^k)-time algorithm for the Graph Motif problem implies an O((2-\epsilon’)^n)-time algorithm for Set Cover.

The assault on graph problems continues!

It does make me wonder: Is there a comparison listing of graph software/databases and the algorithms they support?

It is one thing to support arbitrary nodes and edges, quite another to do something useful with them.

September 6, 2012

Meet the Committer, Part One: Alan Gates

Filed under: Hadoop,Hortonworks,MapReduce,Pig — Patrick Durusau @ 7:52 pm

Meet the Committer, Part One: Alan Gates by Kim Truong.

From the post:

Series Introduction

Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.

What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.

Alan Gates, Apache Pig and HCatalog Committer

Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.

Alan Gates is a leader in our Hadoop education programs. That is why I’m incredibly excited to kick off the next phase of our “Future of Apache Hadoop” webinar series. We’re starting off this segment with 4-webinar series on September 12 with “Pig out to Hadoop” with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.

My only complaint is that the interview is too short!

Looking forward to the Pig webinar!

MapReduce Makes Further Inroads in Academia

Filed under: MapReduce — Patrick Durusau @ 7:42 pm

MapReduce Makes Further Inroads in Academia by Ian Armas Foster.

From the post:

Most conversations about Hadoop and MapReduce tend to filter in from enterprise quarters, but if the recent uptick in scholarly articles extolling its benefit for scientific and technical computing applications is any indication, the research world might have found its next open source darling.

Of course, it’s not just about making use of the approach—for many researchers, it’s about expanding, refining and tweaking the tool to make it suitable for new, heavy-hitting class of applications. As a result, research to improve MapReduce’s functionality and efficiency flourishes, which could eventually provide some great trickle-down technology for the business users as well.

As one case among an increasing number, researchers Marcelo Neves, Tiago Ferreto, and Cesar De Rose of PUCRS in Brazil are working to extend the capabilities of MapReduce. Their approach to MapReduce sought to tackle one of the more complex issues for MapReduce on high performance computing hardware. In this case, the mighty scheduling problem was the target.

The team recently proposed a new algorithm that would enhance MapReduce’s work rate and job scheduling called MapReduce Job Adaptor. Neves et al presented their algorithm in a recent paper.

I am not sure how (or if) it will be documented, but users of MapReduce should watch for how their analysis of a problem changes, based on the anticipated use of MapReduce.

Some academic is going to write the history of MapReduce on one or more problems. Could be you.

Progress on Partial Edge Drawings

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 4:55 pm

Progress on Partial Edge Drawings by Till Bruckdorfer, Sabine Cornelsen, Carsten Gutwenger, Michael Kaufmann, Fabrizio Montecchiani, Martin Nöllenburg and Alexander Wolff.

Abstract:

Recently, a new way of avoiding crossings in straight-line drawings of non-planar graphs has been investigated. The idea of partial edge drawings (PED) is to drop the middle part of edges and rely on the remaining edge parts called stubs. We focus on a symmetric model (SPED) that requires the two stubs of an edge to be of equal length. In this way, the stub at the other endpoint of an edge assures the viewer of the edge’s existence. We also consider an additional homogeneity constraint that forces the stub lengths to be a given fraction $\delta$ of the edge lengths ($\delta$-SHPED). Given length and direction of a stub, this model helps to infer the position of the opposite stub.

We show that, for a fixed stub–edge length ratio $\delta$, not all graphs have a $\delta$-SHPED. Specifically, we show that $K_{241}$ does not have a 1/4-SHPED, while bandwidth-$k$ graphs always have a $\Theta(1/\sqrt{k})$-SHPED. We also give bounds for complete bipartite graphs. Further, we consider the problem \textsc{MaxSPED} where the task is to compute the SPED of maximum total stub length that a given straight-line drawing contains. We present an efficient solution for 2-planar drawings and a 2-approximation algorithm for the dual problem.

I like the hair ball, brightly colored graphs as much as anyone but have to confess discerning useful information from them is problematic.

As graphs become more popular as a methodology, I suspect you will see more and more “default” presentations of hair ball visualizations.

This and similar research may help you move beyond cluttered visualizations to useful ones.

A dynamic data structure for counting subgraphs in sparse graphs

Filed under: Graphs,Networks,Subject Identity — Patrick Durusau @ 4:35 pm

A dynamic data structure for counting subgraphs in sparse graphs by Zdenek Dvorak and Vojtech Tuma.

Abstract:

We present a dynamic data structure representing a graph G, which allows addition and removal of edges from G and can determine the number of appearances of a graph of a bounded size as an induced subgraph of G. The queries are answered in constant time. When the data structure is used to represent graphs from a class with bounded expansion (which includes planar graphs and more generally all proper classes closed on topological minors, as well as many other natural classes of graphs with bounded average degree), the amortized time complexity of updates is polylogarithmic.

Work on data structures seems particularly appropriate when discussing graphs.

Subject identity, beyond string equivalent, can be seen as graph isomorphism or subgraph problem.

Has anyone proposed “bounded” subject identity mechanisms that correspond to the bounds necessary on graphs to make them processable?

We know how to do string equivalence and the “ideal” solution would be unlimited relationships to other subjects, but that is known to be intractable. For one thing we don’t know every relationship for any subject.

Thinking there may be boundary conditions for constructing subject identities that are more complex than string equivalence but that result in tractable identifications.

Suggestions?

Ternary graph isomorphism in polynomial time, after Luks

Filed under: Graphs,Networks,Python — Patrick Durusau @ 4:04 pm

Ternary graph isomorphism in polynomial time, after Luks by Adria Alcala Mena and Francesc Rossello.

Abstract:

The graph isomorphism problem has a long history in mathematics and computer science, with applications in computational chemistry and biology, and it is believed to be neither solvable in polynomial time nor NP-complete. E. Luks proposed in 1982 the best algorithm so far for the solution of this problem, which moreover runs in polynomial time if an upper bound for the degrees of the nodes in the graphs is taken as a constant. Unfortunately, Luks’ algorithm is purely theoretical, very difficult to use in practice, and, in particular, we have not been able to find any implementation of it in the literature. The main goal of this paper is to present an efficient implementation of this algorithm for ternary graphs in the SAGE system, as well as an adaptation to fully resolved rooted phylogenetic networks on a given set of taxa.

Building on his masters thesis, Adria focuses on implementation issues of Luks’ graph isomorphism algorithm.

Trivalent Graph isomorphism in polynomial time

Filed under: Graphs,Networks — Patrick Durusau @ 3:57 pm

Trivalent Graph isomorphism in polynomial time by Adria Alcala Mena.

Abstract:

It’s important to design polynomial time algorithms to test if two graphs are isomorphic at least for some special classes of graphs.

An approach to this was presented by Eugene M. Luks(1981) in the work Isomorphism of Graphs of Bounded Valence Can Be Tested in Polynomial Time. Unfortunately, it was a theoretical algorithm and was very difficult to put into practice. On the other hand, there is no known implementation of the algorithm, although Galil, Hoffman and Luks(1983) shows an improvement of this algorithm running in $O(n^3 \log n)$.

The two main goals of this master thesis are to explain more carefully the algorithm of Luks(1981), including a detailed study of the complexity and, then to provide an efficient implementation in SAGE system. It is divided into four chapters plus an appendix.

Work like this makes graph isomorphism sound vulnerable.

Other resources you may find useful:

Eugene M. Luks (homepage)

Isomorphism of graphs of bounded valence can be tested in polynomial time, Eugene M. Luks, Journal of Computer and System Sciences, 25, 1982, 42-65.

Eugene M. Luks (DBLP)

TwitterScope gets best paper at GD

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 3:24 pm

TwitterScope gets best paper at GD by David Eppstein.

From the post:

This year’s Graph Drawing Symposium is coming up in less than two weeks, and has already announced the winners of the best paper awards in each of the two submission tracks, “combinatorial and algorithmic aspects” and “visualization systems and interfaces”. The winner in the theory track was my paper on Lombardi drawing, but I already posted here about that, so instead I wanted to say some more about the other winner, “Visualizing Streaming Text Data with Dynamic Graphs and Maps” by Emden Gansner, Yifan Hu, and Stephen North. A preprint version of their paper is online at arXiv:1206.3980.

The paper describes the TwitterScope project, which provides visualization tools for high-volume streams of text data (e.g. from Twitter). Currently it exists as a publicly viewable prototype, set up to choose among nine preselected topics. Recent tweets are shown as small icons, grouped into colored regions (representing subtopics) within what looks like a map. Hovering the mouse over an icon shows the corresponding tweet. It’s updated dynamically as new tweets come in, and has a timeline for viewing past tweets. My feeling from the description is that the work involved in putting this system together was less about coming up with new technical methods for visualization (although there is some of that there, particularly in how they handle disconnected graphs) and more about making a selection among many different ideas previously seen in this area and putting them together into a single coherent and well-engineered system. Which I guess should be what that track is all about.

My musings on the same paper: Visualizing Streaming Text Data with Dynamic Maps.

Human Ignorance, Deep and Profound

Filed under: Bioinformatics,Biomedical,Graphs,Networks — Patrick Durusau @ 3:33 am

Scientists have discovered over 4 million gene switches, formerly known as “junk” (a scientific shorthand for “we don’t know what this means”), in the human genome. From the New York Times article Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role (GINA KOLATA):

…The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as “junk” but that turn out to play critical roles in controlling how cells, organs and other tissues behave. The discovery, considered a major medical and scientific breakthrough, has enormous implications for human health because many complex diseases appear to be caused by tiny changes in hundreds of gene switches.

The findings, which are the fruit of an immense federal project involving 440 scientists from 32 laboratories around the world, will have immediate applications for understanding how alterations in the non-gene parts of DNA contribute to human diseases, which may in turn lead to new drugs. They can also help explain how the environment can affect disease risk. In the case of identical twins, small changes in environmental exposure can slightly alter gene switches, with the result that one twin gets a disease and the other does not.

As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed. The result of the work is an annotated road map of much of this DNA, noting what it is doing and how. It includes the system of switches that, acting like dimmer switches for lights, control which genes are used in a cell and when they are used, and determine, for instance, whether a cell becomes a liver cell or a neuron.

Reminds me of the discovery that glial cells aren’t packing material to support neurons. We were missing about half the human brain by size.

While I find both discoveries exciting, I am also mindful that we are not getting any closer to complete knowledge.

Rather opening up opportunities to correct prior mistakes and at some future time, to discover present ones.

PS: As you probably suspect, relationships between gene switches are extremely complex. New graph databases/algorithms anyone?

September 5, 2012

YarcData Architect on Hadoop’s Fatal Flaw

Filed under: Graphs,YarcData — Patrick Durusau @ 5:53 pm

YarcData Architect on Hadoop’s Fatal Flaw

From the post:

Systems like Hadoop and MapReduce are great at slicing problems into multiple pieces, evaluating each little piece, and plugging them back into the whole accordingly, much like an integral in calculus. But what if those little pieces interact with each other constantly, like sections of an ocean?

According to YarcData’s Solutions Architect, James Maltby, Hadoop and MapReduce are less suited to store these graphs than his company’s uRIKA database.

“Many graphs are tightly connected and not easily cut up into small pieces,” said Maltby. “A good example might be a map of genomic networks, which may contain 500 times as many connections as data nodes. Many MapReduce steps are required to solve this problem, and performance suffers. In contrast, uRIKA stores its graph in a large, shared memory pool, and no partitioning is necessary at all.”

Genomics is one of the more complicated and more exciting big data research fields. Medical scientists are working on genomics in hopes to ascertain precisely where diseases originate. However, the vast amount of genes per genome and the many connections those genes make amongst themselves makes genomics a complex big data problem. Slicing that problem severs those all-important connections.

Note that RAM for the uRIKA system is measured in TBs.

For more details: YarcData.

Specialized hardware today, but ten years ago, so were large computing clusters. Access to large computing clusters today requires only a credit card and Internet connection.

The time to start redefining computing is now. The future will be here sooner than you think.

IWCS 2013 Workshop: Towards a formal distributional semantics

Filed under: Conferences,Semantics — Patrick Durusau @ 4:43 pm

IWCS 2013 Workshop: Towards a formal distributional semantics

When Mar 19, 2013 – Mar 22, 2013
Where Potsdam, Germany
Submission Deadline Nov 30, 2012
Notification Due Jan 4, 2013
Final Version Due Jan 25, 2013

From the call for papers:

The Tenth International Conference for Computational Semantics (IWCS) will be held March 20–22, 2013 in Potsdam, Germany.

The aim of the IWCS conference is to bring together researchers interested in the computation, annotation, extraction, and representation of meaning in natural language, whether this is from a lexical or structural semantic perspective. IWCS embraces both symbolic and statistical approaches to computational semantics, and everything in between.

Topics of Interest

Areas of special interest for the conference will be computational aspects of meaning of natural language within written, spoken, or multimodal communication. Papers are invited that are concerned with topics in these and closely related areas, including the following:

  • representation of meaning
  • syntax-semantics interface
  • representing and resolving semantic ambiguity
  • shallow and deep semantic processing and reasoning
  • hybrid symbolic and statistical approaches to representing semantics
  • alternative approaches to compositional semantics
  • inference methods for computational semantics
  • recognizing textual entailment
  • learning by reading
  • methodologies and practices for semantic annotation
  • machine learning of semantic structures
  • statistical semantics
  • computational aspects of lexical semantics
  • semantics and ontologies
  • semantic web and natural language processing
  • semantic aspects of language generation
  • semantic relations in discourse and dialogue
  • semantics and pragmatics of dialogue acts
  • multimodal and grounded approaches to computing meaning
  • semantics-pragmatics interface

Definitely sounds like a topic map sort of meeting!

Exploring Twitter Data

Filed under: Splunk,Tweets — Patrick Durusau @ 4:32 pm

Exploring Twitter Data

From the post:

Want to explore popular content on Twitter with Splunk queries? The new Twitter App for Splunk 4.3 provides a scripted input that automatically extracts data from Twitter’s public 1% sample stream.

What could be better? Watching a twitter stream and calling it work. 😉

Machine Learning in All Languages: Introduction

Filed under: Javascript,Machine Learning,Perl,PHP,Ruby — Patrick Durusau @ 4:20 pm

Machine Learning in All Languages: Introduction by Burak Kanber.

From the post:

I love machine learning algorithms. I’ve taught classes and seminars and given talks on ML. The subject is fascinating to me, but like all skills fascination simply isn’t enough. To get good at something, you need to practice!

I also happen to be a PHP and Javascript developer. I’ve taught classes on both of these as well — but like any decent software engineer I have experience with Ruby, Python, Perl, and C. I just prefer PHP and JS. Before you flame PHP, I’ll just say that while it has its problems, I like it because it gets stuff done.

Whenever I say that Tidal Labs’ ML algorithms are in PHP, they look at me funny and ask me how it’s possible. Simple: it’s possible to write ML algorithms in just about any language. Most people just don’t care the learn the fundamentals strongly enough that they can write an algorithm from scratch. Instead, they rely on Python libraries to do the work for them, and end up not truly grasping what’s happening inside the black box.

Through this series of articles, I’ll teach you the fundamental machine learning algorithms in a variety of languages, including:

  • PHP
  • Javascript
  • Perl
  • C
  • Ruby

Just started so too soon to comment but thought it might be of interest.

New ‘The Future of Apache Hadoop’ Season!

Filed under: Hadoop,Hadoop YARN,Hortonworks,Zookeeper — Patrick Durusau @ 3:37 pm

OK, the real title is: Four New Installments in ‘The Future of Apache Hadoop’ Webinar Series

From the post:

During the ‘Future of Apache Hadoop’ webinar series, Hortonworks founders and core committers will discuss the future of Hadoop and related projects including Apache Pig, Apache Ambari, Apache Zookeeper and Apache Hadoop YARN.

Apache Hadoop has rapidly evolved to become the leading platform for managing, processing and analyzing big data. Consequently there is a thirst for knowledge on the future direction for Hadoop related projects. The Hortonworks webinar series will feature core committers of the Apache projects discussing the essential components required in a Hadoop Platform, current advances in Apache Hadoop, relevant use-cases and best practices on how to get started with the open source platform. Each webinar will include a live Q&A with the individuals at the center of the Apache Hadoop movement.

Coming to a computer near you:

  • Pig Out on Hadoop (Alan Gates): Wednesday, September 12 at 10:00 a.m. PT / 1:00 p.m. ET
  • Deployment and Management of Hadoop Clusters with Ambari (Matt Foley): Wednesday, September 26 at 10:00 a.m. PT / 1:00 p.m. ET
  • Scaling Apache Zookeeper for the Next Generation of Hadoop Applications (Mahadev Konar): Wednesday, October 17 at 10:00 a.m. PT / 1:00 p.m. ET
  • YARN: The Future of Data Processing with Apache Hadoop ( Arun C. Murthy): Wednesday, October 31 at 10:00 a.m. PT / 1:00 p.m. ET

Registration is open so get it on your calendar!

What Do Real-Life Hadoop Workloads Look Like?

Filed under: Cloudera,Hadoop — Patrick Durusau @ 3:23 pm

What Do Real-Life Hadoop Workloads Look Like? by Yanpei Chen.

From the post:

Organizations in diverse industries have adopted Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.

Specific (and useful) to Hadoop installations but I suspect more useful for semantic processing in general.

Questions like:

  • What topics are “hot spots” of merging activity?
  • Where do those topics originate?
  • How do changes in merging rules impact the merging process?

are only some of the ones that may be of interest.

Naming and the Curse of Dimensionality

Filed under: Dimension Reduction,Names — Patrick Durusau @ 3:23 pm

Wikipedia introduces its article on the Curse of Dimensionality with:

In numerical analysis the curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.

There are multiple phenomena referred to by this name in domains such as sampling, combinatorics, machine learning and data mining. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient.

The term curse of dimensionality was coined by Richard E. Bellman when considering problems in dynamic optimization.[1][2]

The “curse of dimensionality” is often used as a blanket excuse for not dealing with high-dimensional data. However, the effects are not yet completely understood by the scientific community, and there is ongoing research. On one hand, the notion of intrinsic dimension refers to the fact that any low-dimensional data space can trivially be turned into a higher dimensional space by adding redundant (e.g. duplicate) or randomized dimensions, and in turn many high-dimensional data sets can be reduced to lower dimensional data without significant information loss. This is also reflected by the effectiveness of dimension reduction methods such as principal component analysis in many situations. For distance functions and nearest neighbor search, recent research also showed that data sets that exhibit the curse of dimensionality properties can still be processed unless there are too many irrelevant dimensions, while relevant dimensions can make some problems such as cluster analysis actually easier.[3][4] Secondly, methods such as Markov chain Monte Carlo or shared nearest neighbor methods[3] often work very well on data that were considered intractable by other methods due to high dimensionality.

But dimensionality isn’t limited to numerical analysis. Nor is its reduction.

Think about the number of dimensions along which you have information about your significant other, friends or co-authors. Or any other subject, abstract or concrete, that you care to name.

However many dimensions you can name for any given subject, in human discourse we don’t refer to that dimensionality as the “curse of dimensionality.”

In fact, we don’t notice the dimensionality at all. Why?

We reduce all those dimensions into a name for the subject and that name is what we use in human discourse.

Dimensional reduction to names goes a long way to explaining why we get confused by names.

Another speaker has reduced a different set of dimensions (which are not shown as it were) to the same name that we use as the reduction of a different set of dimensions.

Sometimes the same name will expand into a different set of dimensions and sometimes different names expand into the same set of dimensions.

One of those dimensions being the context of usage, which when our expansion of the name doesn’t fit, prompts us to ask the speaker for one or more additional dimensions to identify the subject of discussion.

We do that effortlessly, reducing and expanding dimensions to and from names in the course of a conversation. Or when reading or writing.

The number of dimensions for any name increases as we know more about any given subject. Not to mention being impacted by our interaction with others who use the same name as we adjust, repair or change the dimensions we expand or reduce for any particular name.

Dimensionality isn’t a curse. The difficulties we associate with dimensionality and numeric analysis are a consequence of using an underpowered tool, that’s all.

September 4, 2012

Getting data on your government

Filed under: Data Mining,Government Data,R — Patrick Durusau @ 6:52 pm

Getting data on your government

From the post:

I created an R package a while back to interact with some APIs that serve up data on what our elected represenatives are up to, including the New York Times Congress API, and the Sunlight Labs API.

What kinds of things can you do with govdat? Here are a few examples.

How do the two major parties differ in the use of certain words (searches the congressional record using the Sunlight Labs Capitol Words API)?

[text and code omitted]

Let’s get some data on donations to individual elected representatives.

[text and code omitted]

Or we may want to get a bio of a congressperson. Here we get Todd Akin of MO. And some twitter searching too? Indeed.

[text and code omitted]

I waver between thinking mining government data is a good thing and being reminded the government did voluntarily release it. In the latter case, it may be nothing more than a distraction.

Call for contribution: the RDataMining package…

Filed under: Data Mining,R — Patrick Durusau @ 6:35 pm

Call for contribution: the RDataMining package – an R package for data mining by Yanchang Zhao.

Join the RDataMining project to build a comprehensive R package for data mining http://www.rdatamining.com/package

We have started the RDataMining project on R-Forge to build an R package for data mining. The package will provide various functionalities for data mining, with contributions from many R users. If you have developed or will implement any data mining algorithms in R, please participate in the project to make your work available to R users worldwide.

Background

Although there are many R packages for various data mining functionalities, there are many more new algorithms designed and published every year, without any R implementations for them. It is far beyond the capability of a single team, even several teams, to build packages for oncoming new data mining algorithms. On the other hand, many R users developed their own implementations of new data mining algorithms, but unfortunately, used for their own work only, without sharing with other R users. The reason could be that they donot know or donot have time to build packages to share their code, or they might think that it is not worth building a package with only one or two functions.

Objective

To forester the development of data mining capability in R and facilitate sharing of data mining codes/functions/algorithms among R users, we started this project on R-Forge to collaboratively build an R package for data mining, with contributions from many R users, including ourselves.

Definitely worth considering if you are using R for data mining.

It also makes me think of the various public data dumps. I assume someone has mined some (most?) of those and has gained insights into their quirks.

Are there any projects gathering data mining tips or experiences with public data sets? Or are those buried in footnotes or asides, when they are recorded at all?

Sockets and Streams [Registration Open – Event 12 September – Hurry]

Filed under: Data Streams,News,Stream Analytics — Patrick Durusau @ 6:25 pm

Sockets and Streams

Wednesday, September 12
7 p.m.–10 p.m

The New York Times
620 Eighth Avenue
New York, NY
15th Floor

From the webpage:

Explore innovations in real-time web systems and content, as well as related topics in interaction design.

Nice way to spend an evening in New York City!

Expect to hear good reports!

Accumulo: Why The World Needs Another NoSQL Database

Filed under: Accumulo,NoSQL — Patrick Durusau @ 4:43 pm

Accumulo: Why The World Needs Another NoSQL Database by Jeff Kelly.

From the post:

If you’ve been unable to keep up with all the competing NoSQL databases that have hit the market over the last several years, you’re not alone. To name just a few, there’s HBase, Cassandra, MongoDB, Riak, CouchDB, Redis, and Neo4J.

To that list you can add Accumulo, an open source database originally developed at the National Security Agency. You may be wondering why the world needs yet another database to handle large volumes of multi-structured data. The answer is, of course, that no one of these NoSQL databases has yet checked all the feature/functionality boxes that most enterprises require before deploying a new technology.

In the Big Data world, that means the ability to handle the three V’s (volume, variety and velocity) of data, the ability to process multiple types of workloads (analytical vs. transactional), and the ability to maintain ACID (atomicity, consistency, isolation and durability) compliance at scale. With each new NoSQL entrant, hope springs eternal that this one will prove the NoSQL messiah.

So what makes Accumulo different than all the rest? According to proponents, Accumulo is capable of maintaining consistency even as it scales to thousands of nodes and petabytes of data; it can both read and write data in near real-time; and, most importantly, it was built from the ground up with cell-level security functionality.

It’s the third feature – cell-level security – that has the Big Data community most excited. Accumulo is being positioned as an all-purpose Hadoop database and a competitor to HBase. While HBase, like Accumulo, is able to scale to thousands of machines while maintaining a relatively high level of consistency, it was not designed with any security, let alone cell-level security, in mind.

The current security documentation on Accumulo reads (in part):

Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.

Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together.

If that sounds impressive, realize that:

  • Users can overwrite data they cannot see, unless you set the table visibility constraint.
  • Users can avoid the table visibility constraint, using the bulk import method. (Which you can also disable.)

More secure than a completely insecure solution but nothing to write home about, yet.

Can you imagine the complexity that is likely to be exhibited in an inter-agency context for security labels?

BTW, how do I determine the semantics of a proposed security label? What if it conflicts with another security label?

Helpful links: Apache Accumulo.

I first saw this at Alex Popescu’s myNoSQL.

Pig Performance and Optimization Analysis

Filed under: Hortonworks,Pig — Patrick Durusau @ 3:57 pm

Pig Performance and Optimization Analysis by Li Jie.

From the post:

In this post, Hortonworks Intern Li Jie talks about his work this summer on performance analysis and optimization of Apache Pig. Li is a PhD candidate in the Department of Computer Science at Duke University. His research interests are in the area of database systems and big data computing. He is currently working with Associate Professor Shivnath Babu.

If you need to optimize Pig operations, this is a very good starting place.

Be sure to grab a copy of Running TPC-H on Pig by Li Jie, Koichi Ishida, Xuan Wang and Muzhi Zhao, with its “Six Rules of Writing Efficient Pig Scripts.”

Expect to see all three of these authors in DBLP sooner rather than later.

DBLP: Shivnath Babu

Proceedings of the RuleML2012@ECAI Challenge

Filed under: Reasoning,RuleML — Patrick Durusau @ 2:33 pm

Proceedings of the RuleML2012@ECAI Challenge

The paper I mentioned on yesterday: Legal Rules, Text and Ontologies Over Time [The eternal “now?”] is part of these proceedings.

Which is a very good paper.

You will also find “reasoning” about complex tax transactions, such as seeking reimbursement from the government for taxes you have not paid. (What’s complex about that I cannot say. Merely reporting the description of: Missing Trader Fraud given in one of the papers. The taxes reported lost every year remind me of RIAA estimates on piracy.

And papers that fall in between.

The Spirit of XLDB (Extremely Large Databases) Past and Present

Filed under: Database,XLDB — Patrick Durusau @ 2:10 pm

The events page for XLDB has:

XLDB 2011 (Slides/Videos), as well as reports back to the 1st XLDB workshop.

Check back to find later proceedings.

Solr vs. ElasticSearch: Part 2 – Data Handling

Filed under: ElasticSearch,Search Engines,Solr — Patrick Durusau @ 1:47 pm

Solr vs. ElasticSearch: Part 2 – Data Handling by Rafał Kuć.

In the previous part of Solr vs. ElasticSearch series we talked about general architecture of these two great search engines based on Apache Lucene. Today, we will look at their ability to handle your data and perform indexing and language analysis.

  1. Solr vs. ElasticSearch: Part 1 – Overview
  2. Solr vs. ElasticSearch: Part 2 – Data Handling
  3. Solr vs. ElasticSearch: Part 3 – Querying
  4. Solr vs. ElasticSearch: Part 4 – Faceting
  5. Solr vs. ElasticSearch: Part 5 – API Usage Possibilities

Rafal takes a dive into indexing and data handling under Solr and ElasticSearch.

PS: Can you suggest a search engine that does not befoul URLs with tracking information? Or at least consistently presents a “clean” version alongside a tracking version?

Author Identifiers (At Least for CS)

Filed under: Bibliography,Identifiers — Patrick Durusau @ 1:33 pm

I enhanced the VLDB 2012 program with author queries to the DBLP Computer Science Bibliography for my own purposes.

After using that listing myself for a few days, it occurred to me that I should be using DBLP entries as author identifiers throughout my posts, at least when such entries exist.

For several reasons, but mostly:

  • DBLP maintains the publication listings (not by me!)
  • DBLP maintains pointers to other databases and resources (also not by me!)
  • DBLP maintains advanced search capabilities beyond authors (again, not by me!)

If you noticed not by me forming a pattern, you would be correct. There is a pattern.

The pattern?

Using DBLP author pages as identifiers, I leverage on (not duplicate) the work of the DBLP project.

To the benefit of my readers. (Not to mention myself.)

The DBLP link brings an author’s publication history, their co-authors, and additional bibliographic resources. (That’s a triple I like.)

It takes a moment to insert the link but the payoff is substantial.

When you cite a CS author in your blog, include their DBLP link. We will all thank you for it.

(I did that once upon a time but lapsed. Will be cleaning up older entries and trying to do better in the future.)

PS: Similar sources of identifiers for other disciplines?

September 3, 2012

Sell-an-Elephant-to-your-Boss-HOWTO

Filed under: Design,Marketing — Patrick Durusau @ 7:12 pm

Sell-an-Elephant-to-your-Boss-HOWTO by Aurimas Mikalauskas.

From the post:

Spoiler alert: If your boss does not need an elephant, he is definitely NOT going to buy one from you. If he will, he will regret it and eventually you will too.

I must appologize to the reader who was expecting to find an advice on selling useless goods to his boss. While I do use a similar technique to get a quarterly raise (no, I don’t), this article is actually about convincing your team, your manager or anyone else who has influence over project’s priorities, that pending system performance optimizations are a priority (assuming, they indeed are). However this headline was not very catchy and way too long, so I decided to go with the elephant instead.

System performance optimization is what I do day to day here at Percona. Looking back at the duration of an optimization project, I find that with bigger companies (bigger here means it’s not a one-man show) it’s not the identification of performance problems that takes most of the time. Nor it is looking for the right solution. Biggest bottleneck in the optimization project is where solution gets approved and prioritized appropriately inside the company that came for performance optimization in the first place. Sometimes I would follow-up with the customer after a few weeks or a month just to find that nothing was done to implement suggested changes. When I would ask why, most of the time the answer is someting along those lines: my manager didn’t schedule/approve it yet.

I don’t want to say that all performance improvements are a priority and should be done right away, not at all. I want to suggest that you can check if optimizations at hand should be prioritized and if so – how to make it happen if you’re not the one who sets priorities.

Steps to follow:

  1. Estimate harm being done
  2. Estimate the cost of the solution
  3. Make it a short and clear statement
  4. Show the method
  5. The main problem
  6. The solution
  7. Overcome any obsticles
  8. Kick it off

I like number one (1) in particular.

If your client doesn’t feel a need, no amount of selling is going to make a project happen.

All steps to follow in any IT/semantic project.

« Newer PostsOlder Posts »

Powered by WordPress