Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 12, 2013

Big Data – Genomics – Bio4j

Filed under: BigData,Bio4j,Bioinformatics,Genomics — Patrick Durusau @ 3:12 pm

Berkeley Phylogenomics Group receives an NSF grant to develop a graph DB for Big Data challenges in genomics building on Bio4j

From the post:

The Sjölander Lab at the University of California, Berkeley, has recently been awarded a 250K US dollars EAGER grant from the National Science Foundation to build a graph database for Big Data challenges in genomics. Naturally, they’re building on Bio4j.

The project “EAGER: Towards a self-organizing map and hyper-dimensional information network for the human genome” aims to create a graph database of genome and proteome data for the human genome and related species to allow biologists and computational biologists to mine the information in gene family trees, biological networks and other graph data that cannot be represented effectively in relational databases. For these goals, they will develop on top of the pioneering graph-based bioinformatics platform Bio4j.

We are excited to see how Bio4j is used by top research groups to build cutting-edge bioinformatics solutions” said Eduardo Pareja, Era7 Bioinformatics CEO. “To reach an even broader user base, we are pleased to announce that we now provide versions for both Neo4j and Titan graph databases, for which we have developed another layer of abstraction for the domain model using Blueprints.”

EAGER stands for Early-concept Grants for Exploratory Research”, explained Professor Kimmen Sjölander, head of the Berkeley Phylogenomics Group: “NSF awards these grants to support exploratory work in its early stages on untested, but potentially transformative, research ideas or approaches”. “My lab’s focus is on machine learning methods for Big Data challenges in biology, particularly for graphical data such as gene trees, networks, pathways and protein structures. The limitations of relational database technologies for graph data, particularly BIG graph data, restrict scientists’ ability to get any real information from that data. When we decided to switch to a graph database, we did a lot of research into the options. When we found out about Bio4j, we knew we’d found our solution. The Bio4j team has made our development tasks so much easier, and we look forward to a long and fruitful collaboration in this open-source project”.

Always nice to see great projects get ahead!

Kudos to the Berkeley Phylogenomics Group!

November 11, 2013

Hadoop – 100x Faster… [With NO ETL!]

Filed under: ETL,Hadoop,HDFS,MapReduce,Topic Maps — Patrick Durusau @ 8:32 pm

Hadoop – 100x Faster. How we did it… by Nikita Ivanov.

From the post:

Almost two years ago, Dmitriy and I stood in front of a white board at GridGain’s office thinking: “How can we deliver the real-time performance of GridGain’s in-memory technology to Hadoop customers without asking them rip and replace their systems and without asking them to move their datasets off Hadoop?”.

Given Hadoop’s architecture – the task seemed daunting; and it proved to be one of the more challenging engineering puzzles we have had to solve.

After two years of development, tens of thousands of lines of Java, Scala and C++ code, multiple design iterations, several releases and dozens of benchmarks later, we finally built a product that can deliver real-time performance to Hadoop customers with seamless integration and no tedious ETL. Actual customers deployments can now prove our performance claims and validate our product’s architecture.

Here’s how we did it.

The Idea – In-Memory Hadoop Accelerator

Hadoop is based on two primary technologies: HDFS for storing data, and MapReduce for processing these data in parallel. Everything else in Hadoop and the Hadoop ecosystem sits atop these foundation blocks.

Originally, neither HDFS nor MapReduce were designed with real-time performance in mind. In order to deliver real-time processing without moving data out of Hadoop onto another platform, we had to improve the performance of both of these subsystems. (emphasis added)

The highlighted phrase is the key isn’t it?

In order to deliver real-time processing without moving data out of Hadoop onto another platform

ETL is down time, expense and risk of data corruption.

Given a choice between making your current data platform (of whatever type) more robust or risking a migration to a new data platform, which one would you choose?

Bear in mind those 2.5 million spreadsheets that Felienne mentions in her presentation.

Are you really sure you want to ETL on all you data?

As opposed to making your most critical data more robust and enhanced by other data? All while residing where it lives right now.

Are you ready to get off the ETL merry-go-round?

Day 14: Stanford NER…

Day 14: Stanford NER–How To Setup Your Own Name, Entity, and Recognition Server in the Cloud by Shekhar Gulati.

From the post:

I am not a huge fan of machine learning or natural text processing (NLP) but I always have ideas in mind which require them. The idea that I will explore during this post is the ability to build a real time job search engine using twitter data. Tweets will contain the name of the company which if offering a job, the location of the job, and name of the contact person at the company. This requires us to parse the tweet for Person, Location, and Organisation. This type of problem falls under Named Entity Recognition.

A continuation of Shekhar’s Learning 30 Technologies in 30 Days… but one that merits a special shout out.

In part because you can consume the entities that other “recognize” or you can be in control of the recognition process.

It isn’t easy but on the other hand, it isn’t free from hidden choices and selection biases.

I would prefer those were my hidden choices and selection biases, if you don’t mind. 😉

Using Hive to interact with HBase, Part 1

Filed under: HBase,Hive — Patrick Durusau @ 8:07 pm

Using Hive to interact with HBase, Part 1 by Nick Dimiduk.

From the post:

This is the first of two posts examining the use of Hive for interaction with HBase tables. Check back later in the week for the concluding article.

One of the things I’m frequently asked about is how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it. I’ve done a bit of research in this area, so hopefully this will be useful to someone besides myself. This is a topic that we did not get to cover in HBase in Action, perhaps these notes will become the basis for the 2nd edition 😉 These notes are applicable to Hive 0.11.x used in conjunction with HBase 0.94.x. They should be largely applicable to 0.12.x + 0.96.x, though I haven’t tested everything yet.

The hive project includes an optional library for interacting with HBase. This is where the bridge layer between the two systems is implemented. The primary interface you use when accessing HBase from Hive queries is called the BaseStorageHandler. You can also interact with HBase tables directly via Input and Output formats, but the handler is simpler and works for most uses.

If you want to be on the edge of Hive/HBase interaction, start here.

Be forewarned that you are in a folklore, JIRA issue, etc., place but you will be ahead of the less brave.

Spreadsheets:… [95% Usage]

Filed under: Documentation,Spreadsheets,Topic Maps — Patrick Durusau @ 7:57 pm

Spreadsheets: The Ununderstood Dark Matter of IT by Felienne Hermans.


Spreadsheets are used extensively in industry: they are the number one tool for financial analysis and are also prevalent in other domains, such as logistics and planning. Their flexibility and immediate feedback make them easy to use for non-programmers. But they are as easy to build, as they are difficult to analyze, maintain and check. Felienne’s research aims at developing methods to support spreadsheet users to understand, update and improve spreadsheets. Inspiration was taken from classic software engineering, as this field is specialized in the analysis of data and calculations. In this talk Felienne will summarize her recently completed PhD research on the topic of spreadsheet structure visualization, spreadsheet smells and clone detection, as well as presenting a sneak peek into the future of spreadsheet research as Delft University.

Some tidbits to interest you in the video:

“95% of all U.S. corporations still use spreadsheets.”

“Spreadsheet can have a long life, 5 years on average.”

“No docs, errors, long life. It looks like software!”

Designing a tool for software users are using, as opposed to designing tools users ought to be using.

What a marketing concept!

Not a lot of details at the PerfectXL website.

PerfectXL analyzes spreadsheets but doesn’t address the inability of spreadsheets to capture robust metadata about data or its processing in a spreadsheet.

Pay particular attention to how Felienne distinguishes a BI dashboard from a spreadsheet. You have seen that before in this blog. (Hint: Search for “F-16” or “VW.”)

No doubt you will also like Felienne’s blog.

I first saw this in a tweet by Lars Marius Garshol.

November 10, 2013

Are You A Facebook Slacker? (Or, “Don’t “Like” Me, Support Me!”)

Filed under: Facebook,Marketing,Psychology,Social Media — Patrick Durusau @ 8:09 pm

Their title reads: The Nature of Slacktivism: How the Social Observability of an Initial Act of Token Support Affects Subsequent Prosocial Action by Kirk Kristofferson, Katherine White, John Peloza. (Kirk Kristofferson, Katherine White, John Peloza. The Nature of Slacktivism: How the Social Observability of an Initial Act of Token Support Affects Subsequent Prosocial Action. Journal of Consumer Research, 2013; : 000 DOI: 10.1086/674137)


Prior research offers competing predictions regarding whether an initial token display of support for a cause (such as wearing a ribbon, signing a petition, or joining a Facebook group) subsequently leads to increased and otherwise more meaningful contributions to the cause. The present research proposes a conceptual framework elucidating two primary motivations that underlie subsequent helping behavior: a desire to present a positive image to others and a desire to be consistent with one’s own values. Importantly, the socially observable nature (public vs. private) of initial token support is identified as a key moderator that influences when and why token support does or does not lead to meaningful support for the cause. Consumers exhibit greater helping on a subsequent, more meaningful task after providing an initial private (vs. public) display of token support for a cause. Finally, the authors demonstrate how value alignment and connection to the cause moderate the observed effects.

From the introduction:

We define slacktivism as a willingness to perform a relatively costless, token display of support for a social cause, with an accompanying lack of willingness to devote significant effort to enact meaningful change (Davis 2011; Morozov 2009a).

From the section: The Moderating Role of Social Observability: The Public versus Private Nature of Support:

…we anticipate that consumers who make an initial act of token support in public will be no more likely to provide meaningful support than those who engaged in no initial act of support.

Four (4) detailed studies and an extensive review of the literature are offered to support the author’s conclusions.

The only source that I noticed missing was:

10 Two men went up into the temple to pray; the one a Pharisee, and the other a publican.

11 The Pharisee stood and prayed thus with himself, God, I thank thee, that I am not as other men are, extortioners, unjust, adulterers, or even as this publican.

12 I fast twice in the week, I give tithes of all that I possess.

13 And the publican, standing afar off, would not lift up so much as his eyes unto heaven, but smote upon his breast, saying, God be merciful to me a sinner.

14 I tell you, this man went down to his house justified rather than the other: for every one that exalteth himself shall be abased; and he that humbleth himself shall be exalted.

King James Version, Luke 18: 10-14.

The authors would reverse the roles of the Pharisee and the publican, to find the Pharisee contributes “meaningful support,” and the publican has not.

We contrast token support with meaningful support, which we define as consumer contributions that require a significant cost, effort, or behavior change in ways that make tangible contributions to the cause. Examples of meaningful support include donating money and volunteering time and skills.

If you are trying to attract “meaningful support” for your cause or organization, i.e., avoid slackers, there is much to learn here.

If you are trying to move beyond the “cheap grace” (Bonhoeffer)* of “meaningful support” and towards “meaningful change,” there is much to be learned here as well.

Governments, corporations, ad agencies and even your competitors are manipulating the public understanding of “meaningful support” and “meaningful change.” And acceptable means for both.

You can play on their terms and lose, or you can define your own terms and roll the dice.


* I know the phrase “cheap grace” from Bonhoeffer but in running a reference to ground, I saw a statement in Wikipedia that Bonhoeffer learned that phrase from Adam Clayton Powell, Sr.. Homiletics have never been a strong interest of mine but I will try to run down some sources on sermons by Adam Clayton Powell, Sr.

Erik Meijer and Rich Hickey – Clojure and Datomic

Filed under: Clojure,Datomic,Programming — Patrick Durusau @ 2:12 pm

Expert to Expert: Erik Meijer and Rich Hickey – Clojure and Datomic

From the description:

At GOTO Chicago Functional Programming Night, Erik Meijer and Rich Hickey sat down for a chat about the latest in Rich’s programming language, Clojure, and also a had short discussion about one of Rich’s latest projects, Datomic, a database written in Clojure. Always a pleasure to get a few titans together for a random discussion. Thank you Erik and Rich!

A bit dated (2012) but very enjoyable!

Ten Simple Rules for Reproducible Computational Research

Filed under: Documentation,Science — Patrick Durusau @ 12:03 pm

Ten Simple Rules for Reproducible Computational Research by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, Eivind Hovig. (Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285)

From the article:

Replication is the cornerstone of a cumulative science [1]. However, new tools and technologies, massive amounts of data, interdisciplinary approaches, and the complexity of the questions being asked are complicating replication efforts, as are increased pressures on scientists to advance their research [2]. As full replication of studies on independently collected data is often not feasible, there has recently been a call for reproducible research as an attainable minimum standard for assessing the value of scientific claims [3]. This requires that papers in experimental science describe the results and provide a sufficiently clear protocol to allow successful repetition and extension of analyses based on original data [4].

The importance of replication and reproducibility has recently been exemplified through studies showing that scientific papers commonly leave out experimental details essential for reproduction [5], studies showing difficulties with replicating published experimental results [6], an increase in retracted papers [7], and through a high number of failing clinical trials [8], [9]. This has led to discussions on how individual researchers, institutions, funding bodies, and journals can establish routines that increase transparency and reproducibility. In order to foster such aspects, it has been suggested that the scientific community needs to develop a “culture of reproducibility” for computational science, and to require it for published claims [3].

We want to emphasize that reproducibility is not only a moral responsibility with respect to the scientific field, but that a lack of reproducibility can also be a burden for you as an individual researcher. As an example, a good practice of reproducibility is necessary in order to allow previously developed methodology to be effectively applied on new data, or to allow reuse of code and results for new projects. In other words, good habits of reproducibility may actually turn out to be a time-saver in the longer run.

The rules:

Rule 1: For Every Result, Keep Track of How It Was Produced

Rule 2: Avoid Manual Data Manipulation Steps

Rule 3: Archive the Exact Versions of All External Programs Used

Rule 4: Version Control All Custom Scripts

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

Rule 7: Always Store Raw Data behind Plots

Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Rule 9: Connect Textual Statements to Underlying Results

Rule 10: Provide Public Access to Scripts, Runs, and Results

To bring this a little closer to home, would another researcher be able to modify your topic map or RDF store with some certainty as to the result?

Or take over the maintenance/modification of a Hadoop ecosystem without hand holding by the current operator?

Being unable to answer either of those questions with “yes,” doesn’t show up as a line item in your current budget.

However, when the need to “reproduce” or modify your system becomes mission critical, it may be a budget (and job) busting event.

What’s your tolerance for job ending risk?

I forgot to mention I first saw this in “Ten Simple Rules for Reproducible Computational Research” – An Excellent Read for Data Scientists by Sean Murphy.

November 9, 2013

Full-Text Indexing PDFs in Javascript

Filed under: Indexing,Javascript,PDF — Patrick Durusau @ 8:35 pm

Full-Text Indexing PDFs in Javascript by Gary Sieling.

From the post:

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem. (Emphasis added.)

The need for a full-text indexer without using one of the major indexing packages had not occurred to me.

Access to the user’s machine might be limited by time, for example. You would not want to waste cycles spinning up a major indexer when you don’t know the installed software.

Something to add to your USB stick. 😉

Analyzing Social Media Networks using NodeXL [D.C., Nov. 13th]

Filed under: Graphs,Microsoft,Networks,NodeXL,Visualization — Patrick Durusau @ 8:22 pm

Analyzing Social Media Networks using NodeXL by Marc Smith.

From the post:

I am excited to have the opportunity to present a NodeXL workshop with Data Community DC on November 13th at 6pm in Washington, D.C.

In this session I will describe the ways NodeXL can simplify the process of collecting, storing, analyzing, visualizing and publishing reports about connected structures. NodeXL supports the exploration of social media with import features that pull data from personal email indexes on the desktop, Twitter, Flickr, Youtube, Facebook and WWW hyperlinks.

NodeXL allows non-programmers to quickly generate useful network statistics and metrics and create visualizations of network graphs. Filtering and display attributes can be used to highlight important structures in the network. Innovative automated layouts make creating quality network visualizations simple and quick.

Apologies for the short notice but I just saw the workshop announcement today.

If you are in the D.C. area and have any interest in graphs or visualization at all, you need to catch this presentation.

If you don’t believe me, take a look at the NodeXL gallery that Marc mentions in his post:

Putting graph visualization into the hands of users?

Migrating to MapReduce 2 on YARN (For Users)

Filed under: Hadoop YARN,MapReduce 2.0 — Patrick Durusau @ 8:10 pm

Migrating to MapReduce 2 on YARN (For Users) by Sandy Ryza.

From the post:

In Apache Hadoop 2, YARN and MapReduce 2 (MR2) are long-needed upgrades for scheduling, resource management, and execution in Hadoop. At their core, the improvements separate cluster resource management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically between MapReduce and other parallel processing frameworks, such as Cloudera Impala; allow more sensible and finer-grained resource configuration for better cluster utilization; and permit Hadoop to scale to accommodate more and larger jobs.

In this post, users of CDH (Cloudera’s distribution of Hadoop and related projects) who program MapReduce jobs will get a guide to the architectural and user-facing differences between MapReduce 1 (MR1) and MR2. (MR2 is the default processing framework in CDH 5, although MR1 will continue to be supported.) Operators/administrators can read a similar post designed for them here.

From further within the post:

MR2 supports both the old (“mapred”) and new (“mapreduce”) MapReduce APIs used for MR1, with a few caveats. The difference between the old and new APIs, which concerns user-facing changes, should not be confused with the difference between MR1 and MR2, which concerns changes to the underlying framework. CDH 4 and CDH 5 support the new and old MapReduce APIs as well as both MR1 and MR2. (Now, go back and read this paragraph again, because the naming is often a source of confusion.) (Emphasis added.)

And under Job Configuration:

As in MR1, job configuration options can be specified on the command line, in Java code, or in the mapred-site.xml on the client machine in the same way they previously were. Most job configuration options, with rare exceptions, that were available in MR1 work in MR2 as well. For consistency and clarity, many options have been given new names. The older names are deprecated, but will still work for the time being. The exceptions are mapred.child.ulimit and all options relating to JVM reuse, which are no longer supported. (Emphasis added.)

That’s all very reassuring.

Are your MapReduce engineers using the old names (deprecated) or the new names or some combination of both?

As software evolves, changing of names cannot be avoided and no doubt Cloudera has tried to avoid gratuitous name changes.

But at the bottom line, isn’t it your responsibility to track internal use of names? For consistently and maintenance?

Hue: New Search feature: Graphical facets

Filed under: Hadoop,Hue — Patrick Durusau @ 4:54 pm

Hue: New Search feature: Graphical facets

A very short video demonstrating graphical facets in Hue.

If you aren’t already interested in Hue, you will be!

November 8, 2013

Restructuring the Web with Git

Filed under: Git,Github,Subject Identity — Patrick Durusau @ 8:04 pm

Restructuring the Web with Git by Simon St. Laurent.

From the post:

Web designers? Git? Github? Aren’t those for programmers? At Artifact, Christopher Schmitt showed designers how much their peers are already doing with Github, and what more they can do. Github (and the underlying Git toolset) changes the way that all kinds of people work together.

Sharing with Git

As amazing as Linux may be, I keep thinking that Git may prove to be Linux Torvalds’ most important contribution to computing. Most people think of it, if they think of it at all, as a tool for managing source code. It can do far more, though, providing a drastically different (and I think better) set of tools for managing distributed projects, especially those that use text.

Git tackles an unwieldy problem, managing the loosely structured documents that humans produce. Text files are incredibly flexible, letting us store everything from random notes to code of all kinds to tightly structured data. As awesome as text files are—readable, searchable, relatively easy to process—they tend to become a mess when there’s a big pile of them.

Simon makes a good argument for the version control and sharing aspects of Github.

But Github doesn’t offer any features (that I am aware of) to manage the semantics of the data stored at Github.

For example, if I search for “greek,” I am returned results that include the Greek language, Greek mythology, New Testament Greek, etc.

There are only four hundred and sixty-five (465) results as of today but even if I look at all of them, I have no reason to think I have found all the relevant resources.

For example, a search on Greek Mythology would miss:

Myths-and-myth-makers–Old-Tales-and-Superstitions-Interpreted-by-Comparative-Mythology_1061, which has one hundred and four (104) references to Greek gods/mythology.

Moreover, now having discovered this work should be returned on a search for Greek Mythology, how do I impart that knowledge to the system so that future users will find that work?

Github works quite well, but it has a ways to go before it improves on the finding of documents.

Diagrams for hierarchical models: New drawing tools

Filed under: Graphics,Visualization — Patrick Durusau @ 7:41 pm

Diagrams for hierarchical models: New drawing tools

From the post:

Two new drawing tools for making hierarchical diagrams have been recently developed. One tool is a set of distribution and connector templates in LibreOffice Draw and R, created by Rasmus Bååth. Another tool is scripts for making the drawings in LaTeX via TikZ, created by Tinu Schneider. Here is an example of a diagram made by Tinu Schneider, using TikZ/LaTeX with Rasmus Bååth’s distribution icons:

New tools for your diagram drawing toolbelt!

Pragmatic Cypher Optimization (2.0 M06)

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 7:34 pm

Pragmatic Cypher Optimization (2.0 M06)

From the post:

I’ve seen a few stack overflow and google group questions about queries that are slow, and I think there are some things that need to be said regarding Cypher optimization. These techniques are a few ways of improving your queries that aren’t necessarily intuitive. Before reading this, you should have an understanding of WITH (see my other post: The Mythical With).

First, let me throw out a nice disclaimer that these rules of thumb I’ve discovered are by no means definitively best practices, and you should measure your own results with cold and warm caches, running queries 3+ times to see realistic results with a warm cache.

Second, let me throw out another disclaimer, that Cypher is improving rapidly, and that these rules of thumb may only be valid for a few milestone releases. I’ll try to make future updates, but I’m sure there’s always danger of becoming out of date.

Ok, let’s get to it.

If you are looking for faster Cypher query results (who isn’t?), this is a good starting place for you!

How to use R … in MapReduce and Hive

Filed under: Hadoop,Hive,Hortonworks,R — Patrick Durusau @ 7:28 pm

How to use R and other non-Java languages in MapReduce and Hive by Tom Hanlon.

From the post:

I teach for Hortonworks and in class just this week I was asked to provide an example of using the R statistics language with Hadoop and Hive. The good news was that it can easily be done. The even better news is that it is actually possible to use a variety of tools: Python, Ruby, shell scripts and R to perform distributed fault tolerant processing of your data on a Hadoop cluster.

In this blog post I will provide an example of using R, with Hive. I will also provide an introduction to other non-Java MapReduce tools.

If you wanted to follow along and run these examples in the Hortonworks Sandbox you would need to install R.

The Hortonworks Sandbox just keeps getting better!

Facebook’s Presto 10X Hive Speed (mostly)

Filed under: Facebook,Hive,Presto — Patrick Durusau @ 5:59 pm

Facebook open sources its SQL-on-Hadoop engine, and the web rejoices by Derrick Harris.

From the post:

Facebook has open sourced Presto, the interactive SQL-on-Hadoop engine the company first discussed in June. Presto is Facebook’s take on Cloudera’s Impala or Google’s Dremel, and it already has some big-name fans in Dropbox and Airbnb.

Technologically, Presto and other query engines of its ilk can be viewed as faster versions of Hive, the data warehouse framework for Hadoop that Facebook created several years ago. Facebook and many other Hadoop users still rely heavily on Hive for batch-processing jobs such as regular reporting, but there has been a demand for something letting users perform ad hoc, exploratory queries on Hadoop data similar to how they might do them using a massively parallel relational database.

Presto is 10 times faster than Hive for most queries, according to Facebook software engineer Martin Traverso in a blog post detailing today’s news.

I think my headline is the more effective one. 😉

You won’t know anything until you download Presto, read the documentation, etc.

Presto homepage.

The first job is to get your attention, then you have to get the information necessary to be informed.

From Derrick’s post, which points to other SQL-on-Hadoop options, interesting times are ahead!

JML [Java Machine Learning]

Filed under: Java,Machine Learning — Patrick Durusau @ 5:48 pm

JML [Java Machine Learning] by Mingjie Qian.

From the webpage:

JML is a pure Java library for machine learning. The goal of JML is to make machine learning methods easy to use and speed up the code translation from MATLAB to Java. Tutorial-JML.pdf

Current version implements logistic regression, Maximum Entropy modeling (MaxEnt), AdaBoost, LASSO, KMeans, spectral clustering, Nonnegative Matrix Factorization (NMF), sparse NMF, Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) (by Gibbs sampling based on by Gregor Heinrich), joint l_{2,1}-norms minimization, Hidden Markov Model (HMM), Conditional Random Field (CRF), etc. just for examples of implementing machine learning methods by using this general framework. The SVM package LIBLINEAR is also incorporated. I will try to add more important models such as Markov Random Field (MRF) to this package if I get the time:)

JML library’s another advantage is its complete independence from feature engineering, thus any preprocessed data could be run. For example, in the area of natural language processing, feature engineering is a crucial part for MaxEnt, HMM, and CRF to work well and is often embedded in model training. However, we believe that it is better to separate feature engineering and parameter estimation. On one hand, modularization could be achieved so that people can simply focus on one module without need to consider other modules; on the other hand, implemented modules could be reused without incompatibility concerns.

JML also provides implementations of several efficient, scalable, and widely used general purpose optimization algorithms, which are very important for machine learning methods be applicable on large scaled data, though particular optimization strategy that considers the characteristics of a particular problem is more effective and efficient (e.g., dual coordinate descent for bound constrained quadratic programming in SVM). Currently supported optimization algorithms are limited-memory BFGS, projected limited-memory BFGS (non-negative constrained or bound constrained), nonlinear conjugate gradient, primal-dual interior-point method, general quadratic programming, accelerated proximal gradient, and accelerated gradient descent. I would always like to implement more practical efficient optimization algorithms. (emphasis in original)

Something else “practical” for your weekend. 😉

OrientDB becomes distributed…

Filed under: Cloud Computing,Graphs,OrientDB,Zookeeper — Patrick Durusau @ 5:21 pm

OrientDB becomes distributed using Hazelcast, leading open source in-memory data grid

From the post:

Hazelcast and Orient Technologies today announced that OrientDB has gained a multi-master replication feature powered by Hazelcast.

Clustering multiple server nodes is the most significant feature of OrientDB 1.6. Databases can be replicated across heterogeneous server nodes in multi-master mode achieving the best of scalability and performance.

“I think one of the added value of OrientDB against all the NoSQL products is the usage of Hazelcast while most of the others use Yahoo ZooKeeper to manage the cluster (discovery, split brain network, etc) and something else for the transport layer.” said Luca Garulli, CEO of Orient Technologies. “With ZooKeeper configuration is a nightmare, while Hazelcast let you to add OrientDB servers with ZERO configuration. This has been a big advantage for our clients and everything is much more ‘elastic’, specially when deployed on the Cloud. We’ve used Hazelcast not only for the auto-discovery, but also for the transport layer. Thanks to this new architecture all our clients can scale up horizontally by adding new servers without stopping or reconfigure the cluster”.

“We are amazed by the speed with which OrientDB has adopted Hazelcast and we are delighted to see such excellent technologists teaming up with Hazelcast.” said Talip Ozturk, CEO of Hazelcast. “We work hard to make the best open source in-memory data grid on the market and are happy to see it being used in this way.” (emphasis added)

It was just yesterday that I was writing about configuration issues in the Hadoop ecosystem, that includes Zookeeper. Hadoop Ecosystem Configuration Woes?

Where there is smoke, is there fire?

Property Graphs Model and API Community Group

Filed under: Graphs,W3C — Patrick Durusau @ 5:00 pm

Property Graphs Model and API Community Group

From the webpage:

This group will explore the Property Graph data model and API and decide whether this area is ripe for standardization. Property Graphs are used to analyze social networks and in other Big Data applications using NoSQL databases.

The group may want to investigate several extensions to the data model. For example, should nodes be typed; what datatypes are allowed for property values; can properties have multiple values and should we add collection types such as sets and maps to the data model? At the same time, we need to bear in mind that there are several Property Graph vendors and implementations and we may not want to deviate significantly from current practice.

Existing Property Graph APIs are either navigational e.g. Tinkerpop or declarative e.g. Neo4j. For a W3C standard we may want to design a more HTTP and REST-oriented interface in the style of OData Protocol and OData URL Conventions. In this style, you construct URls for collections of nodes and edges. For example, a GET on http://server/nodes would return the collection of nodes on the server. A GET on http://server/nodes/in(type = ‘knows’ ) would return the collection of incoming arcs with type = ‘knows’ and a GET on http://server/nodes/out(type = ‘created’ ) would return the collection of outgoing arcs with type = ‘created’. Once a collection of nodes or arcs is selected with the URI, query operators can be used to add functions to select properties to be returned. Functions can also be used to return aggregate properties such as count and average.

The group will deliver a recommendation to the W3C regarding whether and how the Property Graph work should be taken forward towards standardization.

Potentially an interesting blend of property graphs and ODATA.

No W3C membership is required for this community group, so if you are interested, join the community!

Sqooping Data with Hue

Filed under: Cloudera,Hadoop,Hue — Patrick Durusau @ 4:47 pm

Sqooping Data with Hue by Abraham Elmahrek.

From the post:

Hue, the open source Web UI that makes Apache Hadoop easier to use, has a brand-new application that enables transferring data between relational databases and Hadoop. This new application is driven by Apache Sqoop 2 and has several user experience improvements, to boot.

Sqoop is a batch data migration tool for transferring data between traditional databases and Hadoop. The first version of Sqoop is a heavy client that drives and oversees data transfer via MapReduce. In Sqoop 2, the majority of the work was moved to a server that a thin client communicates with. Also, any client can communicate with the Sqoop 2 server over its JSON-REST protocol. Sqoop 2 was chosen instead of its predecessors because of its client-server design.

I knew I was missing one or more Hadoop ecosystem components yesterday! Hadoop Ecosystem Configuration Woes? I left Hue out but also some others.

The Hadoop “ecosystem” varies depending on which open source supporter you read. I didn’t take the time to cross-check my list against all the major supporters. Will be correcting that over the weekend.

This will give you something “practical” to do over the weekend. 😉

ParLearning 2014

ParLearning 2014 The 3rd International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics.


Workshop Paper Due: December 30, 2013
Author Notification: February 14, 2014
Camera-ready Paper Due: March 14, 2014
Workshop: May 23, 2014 Phoenix, AZ, USA

From the webpage:

Data-driven computing needs no introduction today. The case for using data for strategic advantages is exemplified by web search engines, online translation tools and many more examples. The past decade has seen 1) the emergence of multicore architectures and accelerators as GPGPUs, 2) widespread adoption of distributed computing via the map-reduce/hadoop eco-system and 3) democratization of the infrastructure for processing massive datasets ranging into petabytes by cloud computing. The complexity of the technological stack has grown to an extent where it is imperative to provide frameworks to abstract away the system architecture and orchestration of components for massive-scale processing. However, the growth in volume and heterogeneity in data seems to outpace the growth in computing power. A “collect everything” culture stimulated by cheap storage and ubiquitous sensing capabilities contribute to increasing the noise-to-signal ratio in all collected data. Thus, as soon as the data hits the processing infrastructure, determining the value of information, finding its rightful place in a knowledge representation and determining subsequent actions are of paramount importance. To use this data deluge to our advantage, a convergence between the field of Parallel and Distributed Computing and the interdisciplinary science of Artificial Intelligence seems critical. From application domains of national importance as cyber-security, health-care or smart-grid to providing real-time situational awareness via natural interface based smartphones, the fundamental AI tasks of Learning and Inference need to be enabled for large-scale computing across this broad spectrum of application domains.

Many of the prominent algorithms for learning and inference are notorious for their complexity. Adopting parallel and distributed computing appears as an obvious path forward, but the mileage varies depending on how amenable the algorithms are to parallel processing and secondly, the availability of rapid prototyping capabilities with low cost of entry. The first issue represents a wider gap as we continue to think in a sequential paradigm. The second issue is increasingly recognized at the level of programming models, and building robust libraries for various machine-learning and inferencing tasks will be a natural progression. As an example, scalable versions of many prominent graph algorithms written for distributed shared memory architectures or clusters look distinctly different from the textbook versions that generations of programmers have grown with. This reformulation is difficult to accomplish for an interdisciplinary field like Artificial Intelligence for the sheer breadth of the knowledge spectrum involved. The primary motivation of the proposed workshop is to invite leading minds from AI and Parallel & Distributed Computing communities for identifying research areas that require most convergence and assess their impact on the broader technical landscape.

Taking full advantage of parallel processing remains a distant goal. This workshop looks like a good concrete step towards that goal.

November 7, 2013

16 Reasons Data Scientists are Difficult to Manage

Filed under: Data Science,Humor — Patrick Durusau @ 7:22 pm

16 Reasons Data Scientists are Difficult to Manage

No spoilers. Go read Amy’s post.

I think it would have worked better as:

Data Scientist Scoring Test.

With values associated with the answers.


Hot Topics: The DuraSpace Community Webinar Series

Filed under: Archives,Data Preservation,DSpace,Preservation — Patrick Durusau @ 7:14 pm

Hot Topics: The DuraSpace Community Webinar Series

From the DuraSpace about page:

DuraSpace supported open technology projects provide long-term, durable access to and discovery of digital assets. We put together global, strategic collaborations to sustain DSpace and Fedora, two of the most widely-used repository solutions in the world. More than fifteen hundred institutions use and help develop these open source software repository platforms. DSpace and Fedora are directly supported with in-kind contributions of development resources and financial donations through the DuraSpace community sponsorship program.

Like most of you, I’m familiar with DSpace andFedora but I wasn’t familiar with the “Hot Topics” webinar series. I was following a link from Recommended! “Metadata and Repository Services for Research Data Curation” Webinar by Imma Subirats, when I encountered the “Hot Topics” page.

  • Series Six: Research Data in Repositories
  • Series Five: VIVO–Research Discovery and Networking
  • Series Four: Research Data Management Support
  • Series Three: Get a Head on Your Repository with Hydra End-to-End Solutions
  • Series Two: Managing and Preserving Audio and Video in your Digital Repository
  • Series One: Knowledge Futures: Digital Preservation Planning

Each series consists of three (3) webinars, all with recordings, most with slides as well.

Warning: Data curation doesn’t focus on the latest and coolest GPU processing techniques.

But, in ten to fifteen years when GPU techniques are like COBOL is now, good data curation will enable future students to access those techniques.

I think that is worthwhile.


Creating Knowledge out of Interlinked Data…

Filed under: Linked Data,LOD,Open Data,Semantic Web — Patrick Durusau @ 6:55 pm

Creating Knowledge out of Interlinked Data – STATISTICAL OFFICE WORKBENCH by Bert Van Nuffelen and Karel Kremer.

From the slides:

LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany.

LOD2 will integrate and syndicate Linked Data with existing large-scale applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment.

LOD2 Stack Release 3.0 overview

Connecting the dots: Workbench for Statistical Office

In case you are interested:

LOD2 homepage

Ubuntu 12.04 Repository

VM User / Password: lod2demo / lod2demo

LOD2 blog

The LOD2 project expires in August of 2014.

Linked Data is going to be around, one way or the other, for quite some time.

My suggestion: Grab the last VM from LOD2 and a copy of its OS, store in a location that migrates as data systems change.

Enhancing Time Series Data by Applying Bitemporality

Filed under: Data,Time,Time Series,Timelines,Topic Maps — Patrick Durusau @ 5:30 pm

Enhancing Time Series Data by Applying Bitemporality (It’s not just what you know, it’s when you know it) by Jeffrey Shmain.

A “white paper” and all that implies but it raises the interesting question of setting time boundaries for the validity of data.

From the context of the paper, “bitemporality” means setting a start and end time for the validity of some unit of data.

We all know the static view of the world presented by most data systems is false. But it works well enough in some cases.

The problem is that most data systems don’t allow you to choose static versus some other view of the world.

In part because to get a non-static view, you have to modify your data system (often not a good idea) or migrate to another data system (which is expensive and not risk free) to obtain a non-static view of the world.

Jeffrey remarks in the paper that “all data is time series data” and he’s right. Data arrives at time X, was sent at time T, was logged at time Y, was seen by the CIO at Z, etc. To say nothing of tracking changes to that data.

Not all cases require that much detail but if you need it, wouldn’t it be nice to have?

Your present system may limit you to static views but topic maps can enhance your system in place. Avoiding the dangers of upgrading in place and/or migrating into unknown perils and hazards.

When did you know you needed time based validity for your data?

For a bit more technical view of bitemporality. (authored by Robbert van Dalen)

Hadoop Ecosystem Configuration Woes?

Filed under: Documentation,Hadoop,Topic Maps — Patrick Durusau @ 3:15 pm

After listening to Kathleen Ting (Cloudera) describe how 44% of support tickets for the Hadoop ecosystem arise from misconfiguration (Dealing with Data in the Hadoop Ecosystem…), I started to wonder how many opportunities there are for misconfiguration in the Hadoop ecosystem?

That’s probably not an answerable question, but we can look at how configurations are documented in the Hadoop ecosystem:

Comment in the Hadoop ecosystem:

  • Accumulo – XML <!– comment –>
  • Avro – Schemas defined in JSON (no comment facility)
  • Cassandra – “#” comment indicator
  • Chukwa – XML <!– comment –>
  • Falcon – XML <!– comment –>
  • Flume – “#” comment indicator
  • Hadoop – XML <!– comment –>
  • Hama – XML <!– comment –>
  • HBase – XML <!– comment –>
  • Hive – XML <!– comment –>
  • Knox – XML <!– comment –>
  • Mahout – XML <!– comment –>
  • PIG – C style comments
  • Sqoop – “#” comment indicator
  • Tex – XML <!– comment –>
  • ZooKeeper – text but no apparent ability to comment (Zookeeper Administrator’s Guide)

I read that to mean:

1 Component, Pig uses C style comments

2 Components, Avro and ZooKeeper, have no ability for comments at all.

3 Components, Cassandra, Flume and Sqoop use “#” for comments

10 Components, Accumulo, Chukwa, Falcon, Hama, Hadoop, HBase, Hive, Knox, Mahout and Tex presumably support XML comments

A full one third of the Hadoop ecosystem uses a non-XML comments, if comments are permitted at all. The other two-thirds of the ecosystem uses XML comments in some files and not others.

The entire ecosystem lacks a standard way to associate value or settings in one component with values or settings in another component.

To say nothing of associating values or settings with releases of different components.

Without looking at the details of the possible settings for each component, does that seem problematic to you?

Dealing with Data in the Hadoop Ecosystem…

Filed under: Cloudera,Data,Hadoop — Patrick Durusau @ 1:15 pm

Dealing with Data in the Hadoop Ecosystem – Hadoop, Sqoop, and ZooKeeper by Rachel Roumeliotis.

From the post:

Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram 0:22]

  • ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
  • Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
  • Sqoop is a bulk data transfer tool [Discussed at 2:47]
  • Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
  • ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]
  • Conference interview so not deep but interesting.

    For example, reported that 44% of production errors could be traced to misconfiguration errors.

    dagre – Graph layout for JavaScript

    Filed under: D3,Graphs,Graphviz,Javascript,Visualization — Patrick Durusau @ 10:39 am

    dagre – Graph layout for JavaScript by Chris Pettitt.

    From the webpage:

    Dagre is a JavaScript library that makes it easy to lay out directed graphs on the client-side.

    Key priorities for this library are:

    1. Completely client-side computed layout. There are great, feature-rich alternatives, like graphviz, if client-side layout is not a requirement for you.

    2. Speed. Dagre must be able to draw medium sized graphs quickly, potentially at the cost of not being able to adopt more optimal or exact algorithms.

    3. Rendering agnostic. Dagre requires only very basic information to lay out graphs, such as the dimensions of nodes. You’re free to render the graph using whatever technology you prefer. We use D3 in some of our examples and highly recommend it if you plan to render using CSS and SVG.

    Note that dagre is current a pre-1.0.0 library. We will do our best to maintain backwards compatibility for patch level increases (e.g. 0.0.1 to 0.0.2) but make no claim to backwards compatibility across minor releases (e.g. 0.0.1 to 0.1.0). Watch our CHANGELOG for details on changes.

    You are delivering content to the client side, yes?

    I don’t have a feel for what “medium sized graphs” are the target so would appreciate comments on your experiences with this library.

    One of the better readmes I have seen on GitHub.

    I first saw this in a tweet by Chris Diehl.

    A Proposed Taxonomy of Plagiarism

    Filed under: Plagiarism,Taxonomy — Patrick Durusau @ 10:09 am

    A Proposed Taxonomy of Plagiarism Or, what we talk about when we talk about plagiarism by Rick Webb.

    From the post:

    What with the recent Rand Paul plagiarism scandal, I’d like to propose a new taxonomy of plagiarism. Some plagiarism is worse than others, and the basic definition of plagiarism that most people learned in school is only part of it.

    Chris Hayes started off his show today by referencing the Wikipedia definition of plagiarism: “the ‘wrongful appropriation’ and ‘purloining and publication’ of another author’s ‘language, thoughts, ideas, or expressions,’ and the representation of them as one’s own original work.” The important point here that most people overlook is the theft of ideas. We all learn in school that plagiarism exists if we wholesale copy and paste other people’s words. But ideas are actually a big part of it.

    Interesting read but I am not sure the taxonomy is fine grained enough.

    Topic maps, like any other publication, has the potential for plagiarism. But I would make plagiarism distinctions for topic maps content based upon its intended audience.

    For example, if I were writing a topic map about topic maps, there would be a lot of terms and subjects which I would use, relying on the background of the audience to know they did not originate with me.

    But when I moved into the first instance of an idea being proposed, etc., then I should be using more formal citation because that enables the reader to track the development of a particular idea or strategy. It would be inappropriate to talk about tolog, for example, without crediting Lars Marius Garshol with its creation and clearly distinguishing any statements about tolog as being from particular sources.

    All topic map followers already know those facts but in formal writing, you should help the reader with tracking down the sources you relied upon.

    Completely different case in a committee discussion of tolog, no one is going to footnote their comments and hopefully if you are participating in a discussion of tolog, you are aware of its origins.

    On the Rand Paul “scandal,” I think the media reaction cheapens the notion of plagiarism.

    A better response to Rand Paul (you pick the topic) would be:

    [Senator Paul], what you’ve just said is one of the most insanely idiotic things I have ever heard. At no point in your rambling, incoherent response were you even close to anything that could be considered a rational thought. Everyone in this room is now dumber for having listened to it. I award you no points, and may God have mercy on your soul. (Billy Madison)

    A new slogan for CNN (original): CNN: Spreading Dumbness 24X7.

    « Newer PostsOlder Posts »

    Powered by WordPress