Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 26, 2011

Using Tools You Already Have

Filed under: Data Analysis,Shell Scripting — Patrick Durusau @ 6:22 pm

Using Tools You Already Have

A useful post on why every data scientist should know something about bash scripting.

The Vizosphere

Filed under: Visualization — Patrick Durusau @ 6:22 pm

The Vizosphere from Flowing Data:

From the post:

There are lots of people on Twitter who talk visualization. Moritz Stefaner had some fun with Gephi for a view of a whole lot of those people. He calls it the Vizosphere.

This map shows 1645 twitter accounts related to the topic of information visualization. The accounts were determined as follows: For a subjective selection of “seed accounts”, the twitter API was queried for followers and friends. In order to be included into the map, a user account needed to have at least 5 links (i.e. follow or being followed) to one of these accounts. The size of the network nodes indicates the number of followers within this network.

Interesting visualization but it occurs to me that visualizing relationships, on Twitter or elsewhere, is like drawing out a family tree. Interesting for a while but only for a while.

But then I wonder what would make such a visualization more interesting/useful?

Some possibilities:

  1. What are the directions of the who followed who relationships?
  2. What are the directions of retweeting in the relationships?

And beyond the world of Twitter:

  1. Geographic locations of the users.
  2. (Other dimension of your choice)

The visualization by Moritz demonstrates an interesting technique.

My concern is that we may not ask what’s missing in a visualization?

Using MySQL as a NoSQL…

Filed under: MySQL — Patrick Durusau @ 6:21 pm

Using MySQL as a NoSQL – A story for exceeding 750,000 qps on a commodity server by Yoshinori Matsunobu.

From the post:

Most of high scale web applications use MySQL + memcached. Many of them use also NoSQL like TokyoCabinet/Tyrant. In some cases people have dropped MySQL and have shifted to NoSQL. One of the biggest reasons for such a movement is that it is said that NoSQL performs better than MySQL for simple access patterns such as primary key lookups. Most of queries from web applications are simple so this seems like a reasonable decision.

Like many other high scale web sites, we at DeNA(*) had similar issues for years. But we reached a different conclusion. We are using “only MySQL”. We still use memcached for front-end caching (i.e. preprocessed HTML, count/summary info), but we do not use memcached for caching rows. We do not use NoSQL, either. Why? Because we could get much better performance from MySQL than from other NoSQL products. In our benchmarks, we could get 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients. We also have got excellent performance on production environments.

Maybe you can’t believe the numbers, but this is a real story. In this long blog post, I’d like to share our experiences.

Perhaps MySQL will be part of your next topic map system!

Spring Data Graph 1.1.0.RC1

Filed under: Neo4j,Spring Data — Patrick Durusau @ 6:20 pm

Spring Data Graph 1.1.0.RC1 with Neo4j support Released

From the post:

We are pleased to announce that a new release candidate (1.1.0.RC1) of the Spring Data Graph project with Neo4j support is now available!

The primary goal of the Spring Data project is to make it easier to build Spring-powered applications that use new data access technologies such as non-relational databases, map-reduce frameworks, and cloud based data services.

The Graph Neo4j module provides integration with the Neo4j graph database. Back in 2010, Rod Johnson and Emil Eifrem started brainstorming about Spring and Neo4j integration including transparent persistence and cross-store support. After an initial prototype it has been further developed in close cooperation between the VMware and Neo Technology development teams.

To learn more about the project, visit the Spring Data Graph Project Homepage.

Feedback requested!

July 25, 2011

From Technologist to Philosopher

Filed under: TMRM,Topic Maps — Patrick Durusau @ 6:44 pm

From Technologist to Philosopher: Why you should quit your technology job and get a Ph.D. in the humanities by Damon Horowitz.

Created a startup that was acquired by Google. That is some measure of success.

If you want to create an exceptional company, hire humanists.

Read the essay to find out why.

Subject Recognition Measure?

Filed under: Similarity — Patrick Durusau @ 6:43 pm

I ran across the following passage this weekend:

Speed of Processing: Reaction Time. The speed with which subjects can judge statements about category membership is on of the most widely used measures of processing in semantic memory research within the human information-processing framework. Subjects typically are required to respond true or false to statements of the form: X item is a member of Y category, where the dependent variable of interest is reaction time. In such tasks, for natural language categories, responses of true are invariably faster for the items that have been rated more prototypical.

Principles of Categoization by Eleanor Rosch, in Cognition and Categorization, edited by Eleanor Rorsch and Barbara Lloyd, Lawrence Erlbaum Associates, Publishers, Hillsdale, New Jersey, 1978.

This could be part of a topic map authoring UI that asks users to recognize and place subjects into categories. The faster a user responds, the greater the confidence in their answer.

I borrowed the book where the essay appears to read Amos Tversky’s challenge to the geometric approach to similarity. More on that later this week.

Fully De-Amortized Cuckoo Hashing for Cache-Oblivious Dictionaries and Multimaps

Filed under: Key-Value Stores,Multimaps — Patrick Durusau @ 6:42 pm

Fully De-Amortized Cuckoo Hashing for Cache-Oblivious Dictionaries and Multimaps by Michael T. Goodrich, Daniel S. Hirschberg, Michael Mitzenmacher, and Justin Thaler.

Abstract:

A dictionary (or map) is a key-value store that requires all keys be unique, and a multimap is a key-value store that allows for multiple values to be associated with the same key. We design hashing-based indexing schemes for dictionaries and multimaps that achieve worst-case optimal performance for lookups and updates, with a small or negligible probability the data structure will require a rehash operation, depending on whether we are working in the the external-memory (I/O) model or one of the well-known versions of the Random Access Machine (RAM) model. One of the main features of our constructions is that they are fully de-amortized, meaning that their performance bounds hold without one having to tune their constructions with certain performance parameters, such as the constant factors in the exponents of failure probabilities or, in the case of the external-memory model, the size of blocks or cache lines and the size of internal memory (i.e., our external-memory algorithms are cache oblivious). Our solutions are based on a fully de-amortized implementation of cuckoo hashing, which may be of independent interest. This hashing scheme uses two cuckoo hash tables, one “nested” inside the other, with one serving as a primary structure and the other serving as an auxiliary supporting queue/stash structure that is super-sized with respect to traditional auxiliary structures but nevertheless adds negligible storage to our scheme. This auxiliary structure allows the success probability for cuckoo hashing to be very high, which is useful in cryptographic or data-intensive applications.

Would you say that topic maps qualify as “data-intensive applications?”

Or does that depend upon the topic map?

Stratified B-Tree and Versioned Dictionaries

Filed under: B-trees,Data Structures — Patrick Durusau @ 6:41 pm

Stratified B-Tree and Versioned Dictionaries by Andy Twigg (Acunu). (video)

Abstract:

A classic versioned data structure in storage and computer science is the copy-on-write (CoW) B-tree — it underlies many of today’s file systems and databases, including WAFL, ZFS, Btrfs and more. Unfortunately, it doesn’t inherit the B-tree’s optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO to scale. Yet, nothing better has been developed since. We describe the `stratified B-tree’, which beats all known semi-external memory versioned B-trees, including the CoW B-tree. In particular, it is the first versioned dictionary to achieve optimal tradeoffs between space, query and update performance.

I haven’t had time to watch the video but you can find some other resources on stratified B-Trees at Andy’s post All about stratified B-trees.

Performance of Graph vs. Relational Databases

Filed under: Database,Graphs,SQL — Patrick Durusau @ 6:41 pm

Performance of Graph vs. Relational Databases by Josh Adell.

Short but interesting exploration of performance differences between relational and graph databases.

Visualizing NetworkX graphs in the browser using D3

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 6:40 pm

Visualizing NetworkX graphs in the browser using D3 by Drew Conway.

From the post:

During one of our impromptu sprints at SciPy 2011, the NetworkX team decided it would be nice to add the ability to export networks for visualization with the D3 JavaScript library. This would allow people to post their visualizations online very easily. Mike Bostock, the creator and maintainer of D3, also has a wonderful example of how to render a network using a force-directed layout in the D3 examples gallery.

So, we decided to insert a large portion of Mike’s code into the development version of NetworkX in order to allow people to quickly export networks to JSON and visualize them in the browser. Unfortunately, I have not had the chance to write any tests for this code, so it is only available in my fork of the main NetworkX repository on Github. But, if you clone this repository and install it you will have the new features (along with an additional example file for building networks for web APIs in NX).

You have to see the visualization in the post to get the full impact. You won’t be disappointed!

Interesting Neural Network Papers at ICML 2011

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 6:39 pm

Interesting Neural Network Papers at ICML 2011 by Richard Socher.

Brief comments on eight (8) papers and the ICML 2011 conference.

Highly recommended, particularly if you are interested in neural networks and/or machine learning in connection with your topic maps.

The conference website: The 28th International Conference on Machine Learning, has pointers to the complete proceedings as well as videos of all Session A talks.

Kudos to the conference and its organizers for making materials from the conference available!

From Big Data to New Insights

Filed under: Daytona,NSF — Patrick Durusau @ 6:38 pm

From Big Data to New Insights

From the Office of Science and Technology Policy:

Today [18 July 2011], Microsoft is announcing the availability of a new tool called Daytona that will make it easier for researchers to harness the power of “cloud computing” to discover insights in huge quantities of data.

Daytona, which will be freely available to the research community, builds on an existing cloud computing collaboration between the National Science Foundation and Microsoft. In April, NSF announced that it was funding 13 teams to take advantage of Microsoft’s offer to provide free access to its Windows Azure cloud. Among other things, these projects will improve our understanding of large watersheds such as the Savannah River Basin, enable more and better use of renewable energy through improved weather forecasting, predict the interactions between proteins, and make cloud computing more secure, reliable, and accessible over mobile devices.

The new partnership, along with NSF collaborations with other leading IT companies, will help researchers access the computing power and storage capacity they need to tackle the big questions in their field. That’s important because researchers in a growing number of fields are generating extremely large data sets, commonly referred to as “Big Data.” For example, the size of DNA sequencing databases is increasing by a factor of 10 every 18 months! Researchers need better tools to help them store, index, search, visualize, and analyze these data, allowing them to discover new patterns and connections.

So far as I know, the issues of heterogeneous data remain largely unexplored in connection with BigData. Since heterogeneous data has proven problematic with “Small Data,” I not no doubt it will prove equally if not more difficult with Big Data.

This is the one of the offices to contact in the United States on such issues. Other US offices?

Similar offices in other countries?

Whiz-Kid on Hadoop

Filed under: Daytona,Excel Datascope,Hadoop — Patrick Durusau @ 6:37 pm

Cloudera Whiz-Kid Lipcon Talks Hadoop, Big Data with SiliconANGLE’s Furrier

From the post:

Hadoop, the Big Data processing and analytics framework, isn’t your average open source project.

“If you look at a lot of the open source software that’s been popular out of Apache and elsewhere, its sort of like an open source replacement for something you can already get elsewhere,” said Todd Lipcon, a senior software engineer at Cloudera. “I think Hadoop is kind of unique in that it’s the only option for doing this kind of analysis.”

Lipcon is right. Open Office is an open source office suite alternative to Microsoft Office. MySQL is an open source database alternative to Oracle. Hadoop is an open source Big Data framework alternative for …. Well, there is no alternative.

Now that Daytona has been released by MS along with Excel DataScope, it would be interesting to know how Todd Lipcon sees the ease of use issue?

Powerful technology (LaTeX anyone?) may far exceed the capabilities of (insert your favorite word processor) but if the difficulty of use factor is too high, poorer alternatives will occupy most of the field.

That may give people with the more powerful technology a righteous feeling, but I am not interested in feeling righteous.

I am interested in winning, which means having a powerful technology that can be used by a wide variety of users of varying skill levels.

Some will use it poorer or barely invoking its capabilities. Others will make good but unimaginative use of it. Still others will push the envelope in terms of what it can do. All are legitimate and all are valuable in their own way.

July 24, 2011

User Generated Content and the Social Graph
(thoughts on merging)

Filed under: FlockDB,Gizzard,Merging,Social Graphs — Patrick Durusau @ 6:48 pm

User Generated Content and the Social Graph by Chris Chandler.

Uses Twitter as a case study. Covers Gizzard and FlockDB, both of which were written in Scala.

Wants to coin the term “MoSQL! (More than SQL).”

A couple of points of interest to topic mappers.

Relationships maintained as forward and backward edges. That is:

“A follows B” and

“B is followed by A”

Twitter design decision: Never delete edges!

Curious if any of the current topic map implementations follow that strategy with regard to merging? Thinking that the act of creating a new item either references other existing item (in which case create new version of those items) or it is an entirely new item.

In either case, a subsequent call returns a set of matching items and if more than one, take the most recent one by timestamp.

As Chris says, disk space is cheap.

Works for Twitter.

Data Triage with SAS

Filed under: Data — Patrick Durusau @ 6:47 pm

Data Triage with SAS

Deeply amusing and useful post on basics of looking at data to spot obvious issues.

It really doesn’t matter how clever your analysis may be if your data is incorrect or more likely, your assumptions about the data are incorrect.

Take heed.

A practical introduction to MochiWeb

Filed under: Erlang,Web Applications — Patrick Durusau @ 6:47 pm

A practical introduction to MochiWeb

From the post:

Bob Ippolito, creator or MochiWeb, describes it as “an Erlang library for building lightweight HTTP servers”. It’s not a framework: it doesn’t come with URL dispatch, templating or data persistence. Despite not having an official website or narrative documentation, MochiWeb is a popular choice to build web services in Erlang. The purpose of this article is to help you to get started by gradually building a microframework featuring templates and URL dispatch. Persistence will not be covered.

Just in case you are interested in building web services in Erlang for your topic map application.

MongoDB and the Democratic Party

Filed under: MongoDB,NoSQL — Patrick Durusau @ 6:46 pm

MongoDB and the Democratic Party – A Case Study by Pramod Sadalage.

Interesting case study for an application that managed contacts of the Democratic Party (US) for fund raising and voter turnout efforts on election day.

Talks about elimination of duplicate records but given the breath of the talk, the speaker doesn’t go into any detail.

Pay particular attention to the data structure that is created for this project.

Note that any organization can have a different ID for any particular person. That is a local organization can query by its identifier and its ID for a person. And it gets back the information on that person. (I assume the IDs used by other organizations is filtered out of the return.)

Granted it isn’t aggregation of unbounded information for any particular voter from an unknown number of sources but it is a low cost solution to the national ID (for this data set) and providing access via local IDs problem. That “pattern” could prove to be useful in other cases.

Real World CouchDB

Filed under: CouchDB,NoSQL,Web Applications — Patrick Durusau @ 6:46 pm

Real World CouchDB by John Wood.

Very good overview of CouchDB, including its limitations.

Two parts really caught my attention:

First, the “crash only” design. CouchDB doesn’t shut down, its process is killed. There’s a data integrity test!

Second, the “scale down architecture.” Can run CouchDB plus data on a mobile device. Synches up when connectivity is restored but otherwise, application based on CouchDB can continue working. CouchDB supports delivery of HTML and Javascript so supports basic web apps.

CouchDB looks like a good candidate for delivery of topic map content.


I wanted to include a link to a CouchDB app for the Afghan War Diaries but the site isn’t responding. You can see the source code for the app at: https://github.com/benoitc/afgwardiary.

KNIME Version 2.4.0 released

Filed under: Data Analysis,Data Integration,Data Mining — Patrick Durusau @ 6:45 pm

KNIME Version 2.4.0 released

From the release notice:

We have just released KNIME v2.4, a feature release with a lot of new functionality and some bug fixes. The highlights of this release are:

  • Enhancements around meta node handling (collapse/expand & custom dialogs)
  • Usability improvements (e.g. auto-layout, fast node insertion by double-click)
  • Polished loop execution (e.g. parallel loop execution available from labs)
  • Better PMML processing (added PMML preprocessing, which will also be presented at this year's KDD conference)
  • Many new nodes, including a whole suite of XML processing nodes, cross-tabulation and nodes for data preprocessing and data mining, including ensemble learning methods.

In case you aren’t familiar with KNIME, it is self-described as:

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is currently being used actively by over 6,000 professionals all over the world, in both industry and academia.

What would you do the same/differently for a topic map interface?

Neo4j, the open source Java graph database, and Windows Azure

Filed under: Java,Neo4j,Windows Azure — Patrick Durusau @ 6:45 pm

Neo4j, the open source Java graph database, and Windows Azure by Josh Sandhu.

From the post:

Recently I was travelling in Europe. I alwasy find it a pleasure to see a mixture of varied things nicely co-mingling together. Old and new, design and technology, function and form all blend so well together and there is no better place to see this than in Malmö Sweden at the offices of Diversify Inc., situated in a building built in the 1500’s with a new savvy workstyle. This also echoed at the office of Neo Technology in a slick and fancy incubator, Minc, situated next to the famous Turning Torso building and Malmö University in the new modern development of the city.

My new good friends, Diversify’s Magnus Mårtensson, Micael Carlstedt, Björn Ekengren, Martin Stenlund and Neo Technology’s Peter Neubauer hosted my colleague Anders Wendt from Microsoft Sweden, and me. The topic of this meeting was about Neo Technology’s Neo4j, open source graph database, and Windows Azure. Neo4j is written in Java, but also has a RESTful API and supports multiple languages. The database works as an object-oriented, flexible network structure rather than as strict and static tables. Neo4j is also based on graph theory and it has the ability to digest and work with lots of data and scale is well suited to the cloud. Diversify has been doing some great work getting Java to work with Windows Azure and has given us on the Interoperability team a lot of great feedback on the tools Microsoft is building for Java. They have also been working with some live customers and have released a new case study published in Swedish and an English version made available by Diversify on their blog.

The most interesting part of the interviews was the statement that getting a Java application to run in Azure wasn’t hard. Getting a Java application to run well in Azure was another matter.

That was the disappointing aspect of this post as well. So other steps are required to get Neo4j to run well on Azure. How about something more than the general statement? Something that developers could use to judge the difficulty in considering a move to Azure?

Supplemental materials on getting Neo4j to run well on Azure would take this from a “we are all excited” piece, despite there being some disclosed set of issues, to being a substantive contribution towards overcoming interoperability issues to everyone’s benefit.

July 23, 2011

Information Propagation in Twitter’s Network

Filed under: Networks,Similarity,Social Networks — Patrick Durusau @ 3:12 pm

Information Propagation in Twitter’s Network

From the post:

It’s well-known that Twitter’s most powerful use is as tool for real-time journalism. Trying to understand its social connections and outstanding capacity to propagate information, we have developed a mathematical model to identify the evolution of a single tweet.

The way a tweet is spread through the network is closely related with Twitter’s retweet functionality, but retweet information is fairly incomplete due to the fight for earning credit/users by means of being the original source/author. We have taken into consideration this behavior and our approach uses text similarity measures as complement of retweet information. In addition, #hashtags and urls are included in the process since they have an important role in Twitter’s information propagation.

Once we designed (and implemented) our mathematical model, we tested it with some Twitter’s topics we had tracked using a visualization tool (Life of a Tweet) . Our conclusiones after the experiments were:

  1. Twitter’s real propagation is based on information (tweets’ content) and not on Twitter’s structure (retweet).
  2. Based on we can detect Twitter’s real propagation, we can retrieve Twitter’s real networks.
  3. Text similarity scores allow us to select how fuzzy are the tweet’s connections and, in extension, the network’s connections. This means that we can set a minimun threshold to determine when two tweets contain the same concept.

Interesting. Useful for anyone who want to grab “real” connections and networks to create topics for merging further information about the same.

You may want to also look at: Meme Diffusion Through Mass Social Media which is about a $900K NSF project on tracking memes through social media.

Admittedly an important area of research but the results I would view with a great deal of caution. Here’s why:

  1. Memes travel through news outlets, print, radio, TV, websites
  2. Memes travel through social outlets, such as churches, synagogues, mosques, social clubs
  3. Memes travel through business relationships and work places
  4. Memes travel through family gatherings and relationships
  5. Memes travel over cell phone conversations as well as tweets

That some social media is easier to obtain and process than others doesn’t make it a reliable basis for decision making.

Scala Style Guide

Filed under: Scala — Patrick Durusau @ 3:11 pm

Scala Style Guide

From the webpage:

In lieu of an official style guide from EPFL, or even an unofficial guide from a community site like Artima, this document is intended to outline some basic Scala stylistic guidelines which should be followed with more or less fervency. Wherever possible, this guide attempts to detail why a particular style is encouraged and how it relates to other alternatives. As with all style guides, treat this document as a list of rules to be broken. There are certainly times when alternative styles should be preferred over the ones given here.

Question: Is it a sign of maturity for a programming language to start having religious wars over styles?

Just curious. Thought this might mark a milestone in the development of Scala.

Introduction to Oozie

Filed under: Hadoop,MapReduce,Oozie,Pig — Patrick Durusau @ 3:10 pm

Introduction to Oozie

From the post:

Tasks performed in Hadoop sometimes require multiple Map/Reduce jobs to be chained together to complete its goal. [1] Within the Hadoop ecosystem, there is a relatively new component Oozie [2], which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task. In this article we will introduce Oozie and some of the ways it can be used.

What is Oozie ?

Oozie is a Java Web-Application that runs in a Java servlet-container – Tomcat and uses a database to store:

  • Workflow definitions
  • Currently running workflow instances, including instance states and variables

Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language).

Workflow management for Hadoop!

50 Billion Things on the Internet by 2020

Filed under: Marketing — Patrick Durusau @ 3:09 pm

Cisco: 50 Billion Things on the Internet by 2020 [Infographic]

Read the post for estimates up to 1 trillion by 2013 or 2015, depending upon whose speculation you are being paid to agree with.

It is an amusing infographic but more for what it doesn’t say than for what it does.

A fairly flat place that only talks about devices, cows, sensors and the like. Which is good, but surely only part of the story.

What about all the relationships between those devices? How do we identify/address them? Or their states over some time sequence?

The Cisco “Planetary Skin” sounds like an exciting project but even more so if those sensors and their data is correlated to other information.

To be sure we are in for “a really big show,” whatever number you happen to prefer.

Lucene.net is back on track

Filed under: Lucene,Search Algorithms,Search Engines — Patrick Durusau @ 3:08 pm

Lucene.net is back on track by Simone Chiaretta

From the post:

More than 6 months ago I blogged about Lucene.net starting his path toward extinction. Soon after that, due to the “stubbornness” of the main committer, a few forks appeared, the biggest of which was Lucere.net by Troy Howard.

At the end of the year, despite the promises of the main committer of complying to the request of the Apache board by himself, nothing happened and Lucene.net went really close to be being shut down. But luckily, the same Troy Howard that forked Lucene.net a few months before, decided, together with a bunch of other volunteers, to resubmit the documents required by the Apache Board for starting a new project into the Apache Incubator; by the beginning of February the new proposal was voted for by the Board and the project re-entered the incubator.

If you are interested in search engines and have .Net skills (or want to acquire them), this would be a good place to start.

The Pathology of Graph Databases

Filed under: Graphs,Gremlin — Patrick Durusau @ 3:07 pm

The Pathology of Graph Databases by Marko A. Rodriguez.

If you want to learn Gremlin as a graph traversal language you would be hard pressed to find a better starting place.

The Beauty of Simplicity: Mastering Database Design Using Redis

Filed under: NoSQL,Redis — Patrick Durusau @ 3:07 pm

The Beauty of Simplicity: Mastering Database Design Using Redis by Ryan Briones.

Not so much teaching database design as illustrating how Redis forces you to think about the structure of the data you are storing.

Covers some Redis commands, other can be found at http://redis.io, along with the Redis distribution.

Apache Hadoop to get more user-friendly

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:07 pm

Apache Hadoop to get more user-friendly

From Paul Krill at InfoWorld:

Relief is on the way for users of the open source Apache Hadoop distributed computing platform who have wrestled with the complexity of the technology.

A planned upgrade to Hadoop distributed computing platform, which has become popular for analyzing large volumes of data, is intended to make the platform more user-friendly, said Eric Baldeschwieler, CEO of HortonWorks, which was unveiled as a Yahoo spinoff last month with the intent of building a support and training business around Hadoop. The upgrade also will feature improvements for high availability, installation, and data management. Due in beta releases later this year with a general availability release eyed for the second quarter of 2012, the release is probably going to be called Hadoop 0.23.

I don’t remember seeing any announcements that a product would become “less user-friendly.” You? 😉

Still, good news because it means not only will Hadoop become easier to use, so will its competitors.

July 22, 2011

You Too Can Use Hadoop Inefficiently!!!

Filed under: Algorithms,Graphs,Hadoop,RDF,SPARQL — Patrick Durusau @ 6:15 pm

The headline Hadoop’s tremendous inefficiency on graph data management (and how to avoid it) certainly got my attention.

But when you read the paper, Scalable SPARQL Querying of Large RDF Graphs, it isn’t Hadoop’s “tremendous inefficiency,” but actually that of SHARD, an RDF triple store that uses flat text files for storage.

Or as the authors say in their paper (6.3 Performance Comparison):

Figure 6 shows the execution time for LUBM in the four benchmarked systems. Except for query 6, all queries take more time on SHARD than on the single-machine deployment of RDF-3X. This is because SHARD’s use of hash partitioning only allows it optimize subject-subject joins. Every other type of join requires a complete redistribution of data over the network within a Hadoop job, which is extremely expensive. Furthermore, its storage layer is not at all optimized for RDF data (it stores data in flat files).

Saying that SHARD (not as well known as Hadoop), was using Hadoop inefficiently, would not have the “draw” of allegations about Hadoop’s failure to process graph data efficiently.

Sure, I write blog lines for “draw” but let’s ‘fess up in the body of the blog article. Readers shouldn’t have to run down other sources to find the real facts.

Implementing Electronic Lab Notebooks

Filed under: ELN Integration — Patrick Durusau @ 6:12 pm

Implementing Electronic Lab Notebooks

Implementing Electronic Lab Notebooks: Building the foundation

Bennett Lass is doing a series on electronic lab notebooks and I will be gathering them here.

There are two questions I have in mind:

  1. What happens when the description of the data being recorded in the ELN changes? How is old/new data captured for post-change searches?
  2. Not realistic I know but what happens when a researcher changes labs and consequently ELN solutions?

Update:

Implementing Electronic Lab Notebooks: Documenting Experiments (Part 3)
Implementing Electronic Lab Notebooks: Enabling Collaboration (Part 4)
Implementing Electronic Lab Notebooks: System Integration (Part 5)
Implementing Electronic Lab Notebooks: Research Management (Part 6)

« Newer PostsOlder Posts »

Powered by WordPress