Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 22, 2013

Creating a Simple Bloom Filter

Filed under: Bloom Filters,Searching — Patrick Durusau @ 4:10 pm

Creating a Simple Bloom Filter by Max Burstein.

From the post:

Bloom filters are super efficient data structures that allow us to tell if an object is most likely in a data set or not by checking a few bits. Bloom filters return some false positives but no false negatives. Luckily we can control the amount of false positives we receive with a trade off of time and memory.

You may have never heard of a bloom filter before but you’ve probably interacted with one at some point. For instance if you use Chrome, Chrome has a bloom filter of malicious URLs. When you visit a website it checks if that domain is in the filter. This prevents you from having to ping Google’s servers every time you visit a website to check if it’s malicious or not. Large databases such as Cassandra and Hadoop use bloom filters to see if it should do a large query or not.

I think you will appreciate the lookup performance difference. 😉

I first saw this at: Alex Popescu: Creating a Simple Bloom Filter in Python.

Everything You Wanted to Know About Machine Learning…

Filed under: Machine Learning — Patrick Durusau @ 3:51 pm

Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part One)

Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part Two)

by Charles Parker.

From Part One:

Recently, Professor Pedro Domingos, one of the top machine learning researchers in the world, wrote a great article in the Communications of the ACM entitled “A Few Useful Things to Know about Machine Learning“. In it, he not only summarizes the general ideas in machine learning in fairly accessible terms, but he also manages to impart most of the things we’ve come to regard as common sense or folk wisdom in the field.

It’s a great article because it’s a brilliant man with deep experience who is an excellent teacher writing for “the rest of us”, and writing about things we need to know. And he manages to cover a huge amount of ground in nine pages.

Now, while it’s very light reading for the academic literature, it’s fairly dense by other comparisons. Since so much of it is relevant to anyone trying to use BigML, I’m going to try to give our readers the Cliff’s Notes version right here in our blog, with maybe a few more examples and a little less academic terminology. Often I’ll be rephrasing Domingos, and I’ll indicate it where I’m quoting directly.

Perhaps not “everything” but certainly enough to spark an interest in knowing more!

My take away is: understanding machine learning, like understanding data, is critical to success with machine learning.

Not surprising but does get overlooked.

…O’Reilly Book on NLP with Java?

Filed under: Java,Natural Language Processing — Patrick Durusau @ 2:25 pm

Anyone Want to Write an O’Reilly Book on NLP with Java? by Bob Carpenter.

From the post:

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

Read Bob’s post and give him a shout if you are interested.

Would be a good exercise in learning how choices influence the “objective” outcomes.

…Obtaining Original Data from Published Graphs and Plots

Filed under: Data,Data Mining,Graphs — Patrick Durusau @ 2:13 pm

A Simple Method for Obtaining Original Data from Published Graphs and Plots

From the post:

Was thinking of how to extract data points for infant age and weight distribution from a printed graph and I landed at this old paper http://www.ajronline.org/content/174/5/1241.full . it pointed me to NIH Image which reminds me of an old software i used to use for lab practicals as an undergrad .. and upon reaching the NIH Image site, Indeed! imageJ is an ‘update’ of sorts to the NIH Image software ..

The “old paper?” “A Simple Method for Obtaining Original Data from Published Graphs and Plots,” by Chris L. Sistrom and Patricia J. Mergo, American Journal of Roentgenology, May 2000 vol. 174 no. 5 1241-1244.

Update to the URI in the article, http://rsb.info.nih.gov/nih-image/ is correct. (Original URI is missing a hyphen, “-“.)

The mailing list archives don’t show much traffic for the last several years.

When you need to harvest data from published graphs/plots, what do you use?

typeahead.js [Autocompletion Library]

Filed under: Interface Research/Design,Javascript,JQuery,Search Interface,Searching — Patrick Durusau @ 1:58 pm

typeahead.js

From the webpage:

Inspired by twitter.com‘s autocomplete search functionality, typeahead.js is a fast and fully-featured autocomplete library.

Features

  • Displays suggestions to end-users as they type
  • Shows top suggestion as a hint (i.e. background text)
  • Works with hardcoded data as well as remote data
  • Rate-limits network requests to lighten the load
  • Allows for suggestions to be drawn from multiple datasets
  • Supports customized templates for suggestions
  • Plays nice with RTL languages and input method editors

Why not use X?

At the time Twitter was looking to implement a typeahead, there wasn’t a solution that allowed for prefetching data, searching that data on the client, and then falling back to the server. It’s optimized for quickly indexing and searching large datasets on the client. That allows for sites without datacenters on every continent to provide a consistent level of performance for all their users. It plays nicely with Right-To-Left (RTL) languages and Input Method Editors (IMEs). We also needed something instrumented for comprehensive analytics in order to optimize relevance through A/B testing. Although logging and analytics are not currently included, it’s something we may add in the future.

A bit on the practical side for me, ;-), but I can think of several ways that autocompletion could be useful with a topic map interface.

Not just the traditional completion of a search term or phrase but offering possible roles for subjects already in a map and other uses.

If experience with XML and OpenOffice is any guide, the easier authoring becomes (assuming the authoring outcome is useful), the greater the adoption of topic maps.

It really is that simple.

I first saw this at: typeahead.js : Fully-featured jQuery Autocomplete Library.

Hadoop Adds Red Hat [More Hadoop Silos Coming]

Filed under: Hadoop,MapReduce,Red Hat,Semantic Diversity,Semantic Inconsistency — Patrick Durusau @ 1:27 pm

Red Hat Unveils Big Data and Open Hybrid Cloud Direction

From the post:

Red Hat, Inc. (NYSE: RHT), the world’s leading provider of open source solutions, today announced its big data direction and solutions to satisfy enterprise requirements for highly reliable, scalable, and manageable solutions to effectively run their big data analytics workloads. In addition, Red Hat announced that the company will contribute its Red Hat Storage Hadoop plug-in to the ApacheTM Hadoop® open community to transform Red Hat Storage into a fully-supported, Hadoop-compatible file system for big data environments, and that Red Hat is building a robust network of ecosystem and enterprise integration partners to deliver comprehensive big data solutions to enterprise customers. This is another example of Red Hat’s strategic commitment to big data customers and its continuing efforts to provide them with enterprise solutions through community-driven innovation.

The more Hadoop grows, the more Hadoop silos will as well.

You will need Hadoop and semantic skills to wire Hadoop silos together.

Re-wire with topic maps to avoid re-wiring the same Hadoop silos over and over again.

I first saw this at Red Hat reveal big data plans, open sources HDFS replacement by Elliot Bentley.

Stopping Theft: Don’t Lock Your Door, The Soap Opera Approach

Filed under: Cybersecurity,Security — Patrick Durusau @ 10:39 am

I am disappointed with the White House response to my suggestion on combating cybersecurity issues (Crowdsourcing Cybersecurity: A Proposal (Part 1) and Crowdsourcing Cybersecurity: A Proposal (Part 2)).

The White House proposal and my posts were on the same day, so it may not be a direct response to my posts. 😉

But in any event, here is the White House plan:

The plan, said Espinel, will increase U.S. diplomatic engagement on the issue. She didn’t specifically mention trade sanctions, though. The U.S., she said, will convey concerns to countries with high incidents of trade secret theft “with coordinated and sustained messages from the most senior levels of the administration.” The U.S. government, she said, will also work to establish coalitions with other countries that have been targeted by Cyber thieves, as well as use trade policy “to press other governments for better protection and enforcement.”

blah, blah, blah,….

Two foreign officials (in translation):

“How many calls did you get from the Assistant to the UnderSecretary of the Adjuntant from the Office of Cybersecurity Maintenance (AUAOCM)?” “Five yesterday.”

“I had four from the summer intern in the office of the Coordination of National Instructure, but I think they were trying to order pizza. The message was different from the duplicate/robocalls on cybersecurity.”

“I’m going to get caller id. Easier than trying to hunt down hackers.”

Hackers around the world are about to piss themselves over “coordinated and sustained messages from the most senior levels of the administration.”

Translated into every day crime: Don’t lock your doors, leave the TV tuned to soap operas.

I can’t speak for anyone else, but I plan on keeping my doors locked.

You?

Machine Biases: Stop Word Lists

Filed under: Indexing,Searching — Patrick Durusau @ 7:09 am

Oracle Text Search doesn’t work on some words (Stackoverflow)

From the post:

I am using Oracle’ Text Search for my project. I created a ctxsys.context index on my column and inserted one entry “Would you like some wine???”. I executed the query

select guid, text, score(10) from triplet where contains (text, ‘Would’, 10) > 0

it gave me no results. Querying ‘you’ and ‘some’ also return zero results. Only ‘like’ and ‘wine’ matches the record. Does Oracle consider you, would, some as stop words?? How can I let Oracle match these words? Thank you.

I mention this because Oracle Text Workaround for Stop Words List (Beyond Search) reports this as “interesting.”

Hardly. Stop lists are a common feature of text searching.

Moreover, the stop list in question, dates back in MySQL to 3.23.

11.9.4. Full-Text Stopwords excludes “you,” and “would” from indexing.

Suggestions of other stop word lists that exclude “you,” and “would?”

“…the flawed man versus machine dichotomy”

Filed under: Artificial Intelligence,BigData — Patrick Durusau @ 6:50 am

The backlash against Big Data has started

Kaiser Fung critiques a recent criticism of big data saying:

Andrew Gelman has a beef with David Brooks over his New York Times column called “What Data Can’t Do”. (link) I will get to Brooks’s critique soon–my overall feeling is, he created a bunch of sound bites, and could have benefited from interviewing people like Andrew and myself, who are skeptical of Big Data claims but not maniacally dismissive.

The biggest issue with Brooks’s column is the incessant use of the flawed man versus machine dichotomy. He warns: “It’s foolish to swap the amazing machine in your skull for the crude machine on your desk.” The machine he has in his mind is the science-fictional, self-sufficient, intelligent computer, as opposed to the algorithmic, dumb-and-dumber computer as it exists today and for the last many decades. A more appropriate analogy of today’s computer (and of the foreseeable future) is a machine that the human brain creates to automate mechanical, repetitious tasks at scale. This machine cannot function without human piloting so it’s man versus man-plus-machine, not man versus machine. (emphasis added)

I would have to plead guilty to falling into that “…flawed man versus machine dichotomy.”

And why not?

When machinery gives absurd answers, such as matching children to wanted terrorists and their human counterparts, blindly accept the conclusion, there is cause for concern.

Kaiser concludes:

Brooks made a really great point at the end of the piece, which I will paraphrase: any useful data is cooked. “The end result looks disinterested, but, in reality, there are value choices all the way through, from construction to interpretation.” Instead of thinking about this as cause for concern, we should celebrate these “value choices” because they make the data more useful.

This brings me back to Gelman’s reaction in which he differentiates between good analysis and bad analysis. Except for the simplest problems, any good analysis uses cooked data but an analysis using cooked data could be good or bad.

Perhaps my criticism should be of people who conceal their “value choices” amidst machinery.

There may be disinterested machines, but only the the absence of people and their input.

Yes?

February 21, 2013

Precursors to Simple Web Semantics

Filed under: RDF,Semantic Web,Semantics — Patrick Durusau @ 9:04 pm

A couple of precursors to Simple Web Semantics have been brought to my attention.

Wanted to alert you so you can consider these prior/current approaches while evaluating Simple Web Semantics.

The first one was from Rob Weir (IBM), who suggested I look at “smart tags” from Microsoft and sent the link to Soft tags (Wikipedia).

The second one was from Nick Howard (a math wizard I know) who pointed out the similarity to bookmarklets. On that see: Bookmarklet (Wikipedia).

I will be diving deeper into both of these technologies.

Not so much a historical study but what did/did not work, etc.

Other suggestions, directions, etc. are most welcome!

I have a another refinement to the syntax that I will be posting tomorrow.

Teradata Announces Aster Discovery Platform

Filed under: SQL,Teradata,Visualization — Patrick Durusau @ 8:53 pm

Teradata Announces Aster Discovery Platform

Teradata didn’t get the memo about February 20th being performance day either. So that makes two of us. 😉

From the post:

Teradata today introduced Teradata Aster Discovery Platform 5.10 a discovery solution with more than 20 new big data analytic capabilities, including purpose-built visualizations.

The platform was designed for customers to acquire, prepare, analyze, and visualize petabyte-sized volumes of multi-structured data in a single platform with a single structured query language (SQL) command.

“Existing and aspiring data scientists should take note. The Teradata Aster Discovery Platform is full of new capabilities that can empower them to accelerate their innovation and supply new options to their business users,” said Scott Gnau , president, Teradata Labs.

Teradata’s open platform is a suite of integrated hardware, software, and best-of-breed partner solutions, using business intelligence (BI), data integration, analytics, and visualization tools. It was built for use by any SQL-savvy analyst or business user, while being powerful and flexible enough for the most sophisticated data scientists.

“With newly added analytics and visualization functionality, the Teradata Aster Discovery Platform offers the convenience of a ‘data scientist in a box,'” said Dan Vesset, program vice president of business analytics and big data, IDC. “Much of the market attention has been on vendors trying to build SQL engines on Hadoop. Teradata Aster Discovery Platform already provides an ANSI SQL-compliant method with its SQL-MapReduce framework to acquire, prepare, analyze, and visualize data from any data source including Hadoop. Without the need to integrate multiple point solutions, customers using this Teradata technology are able to accelerate the discovery process and visualize information in new and exciting ways, and to focus the scarce expertise of data scientists on highest value- added tasks.”

Or maybe they did.

Aster Discovery Platform 5.10 will appear by the end of the 2nd quarter 2013.

See the post for a nice summary of coming features.

BTW, I hope you still have your SQL books. Looks like SQL is making a serious comeback. 😉

Neo4j: A Developer’s Perspective

Filed under: Graphs,Neo4j — Patrick Durusau @ 8:47 pm

Neo4j: A Developer’s Perspective

From the post:

In the age of NoSQL databases, where a new database seems to pop up every week, it is not surprising that even a larger number of articles related to them are written everyday. So when I started writing this blog on Neo4j, instead of describing how freaking awesome it is, I aimed to address the most common issues that a “regular” developer faces. By regular, I mean that, a developer, who is familiar with databases in general and knows the basics for Neo4j, but is a novice when it comes to actually using it.

A brief overview for those not familiar with Neo4j. Neo4j is a graph database. A graph database uses the concept of graph theory to store data. Graph Theory is the study of graphs, which are structures containing vertices and edges or in other words nodes and relationships. So, in a graph database, data is modeled in terms of nodes and relationships. Neo4j, at a first glance seems pretty much similar to any other graph database model that we encountered before. It has nodes, it has relationships, they are interconnected to form a complex graph and you traverse the graph in a specific pattern to get desired results.

I don’t think you will see anything new here but it is a useful post if you are unfamiliar with Neo4j.

I mention it primarily because of a comment objecting to the AGPL licensing of Neo4j.

Err, if I am writing a web application to sell to a client, why would I object to paying for a commercial license for Neo4j?

Or is there some subtlety to profiting off of free software that I am missing?

I first saw this at: DZone.

Why Business Intelligence Software Is Failing Business

Filed under: Design,Interface Research/Design,Usability — Patrick Durusau @ 8:37 pm

Why Business Intelligence Software Is Failing Business

From the post:

Business intelligence software is supposed to help businesses access and analyze data and communicate analytics and metrics. I have witnessed improvements to BI software over the years, from mobile and collaboration to interactive discovery and visualization, and our Value Index for Business Intelligence finds a mature set of technology vendors and products. But even as these products mature in capabilities, the majority lack features that would make them easy to use. Our recent research on next-generation business intelligence found that usability is the most important evaluation criteria for BI technology, outpacing functionality (49%) and even manageability (47%). The pathetic state of dashboards and the stupidity of KPI illustrate some of the obvious ways the software needs to improve for businesses to gain the most value from it. We need smarter business intelligence, and that means not just more advanced sets of capabilities that are designed for the analysts, but software designed for those who need to use BI information.

BI considerations

Our research finds the need to collaborate and share (67%) and inform and deliver (61%) are in the top five evaluation categories for software. A few communication improvements, highlighted below, would help organizations better utilize analytics and BI information.

Imagine that, usability is ahead of functionality.

Successful semantic software vendors will draw several lessons from this post.

Integrating Structured and Unstructured Data

Filed under: Data Integration,Integration,Structured Data,Unstructured Data — Patrick Durusau @ 8:05 pm

Integrating Structured and Unstructured Data by David Loshin.

It’s a checklist report but David comes up with useful commentary on the following seven points:

  1. Document clearly defined business use cases.
  2. Employ collaborative tools for the analysis, use, and management of semantic metadata.
  3. Use pattern-based analysis tools for unstructured text.
  4. Build upon methods to derive meaning from content, context, and concept.
  5. Leverage commodity components for performance and scalability.
  6. Manage the data life cycle.
  7. Develop a flexible data architecture.

It’s not going to save you planning time but may keep you from overlooking important issues.

My only quibble is that David doesn’t call out data structures as needing defined and preserved semantics.

Data is a no brainer but the containers of data, dare I say “Hadoop silos,” need to have semantics defined as well.

Data or data containers without defined and preserved semantics are much more costly in the long run.

Both in lost opportunity costs and after the fact integration costs.

Hadoop silos need integration…

Filed under: Data Integration,Hadoop,Semantic Diversity,Semantic Inconsistency — Patrick Durusau @ 7:50 pm

Hadoop silos need integration, manage all data as asset, say experts by Brian McKenna.

From the post:

Big data hype has caused infantile disorders in corporate organisations over the past year. Hadoop silos, an excess of experimentation, and an exaggeration of the importance of data scientists are among the teething problems of big data, according to experts, who suggest organisations should manage all data as an asset.

Steve Shelton, head of data services at consultancy Detica, part of BAE Systems, said Hadoop silos have become part of the enterprise IT landscape, both in the private and public sectors. “People focused on this new thing called big data and tried to isolate it [in 2011 and 2012],” he said.

The focus has been too concentrated on non-traditional data types, and that has been driven by the suppliers. The business value of data is more effectively understood when you look at it all together, big or otherwise, he said.

Have big data technologies been a distraction? “I think it has been an evolutionary learning step, but businesses are stepping back now. When it comes to information governance, you have to look at data across the patch,” said Shelton.

He said Detica had seen complaints about Hadoop silos, and these were created by people going through a proof-of-concept phase, setting up a Hadoop cluster quickly and building a team. But a Hadoop platform involves extra costs on top, in terms of managing it and integrating it into your existing business processes.

“It’s not been a waste of time and money, it is just a stage. And it is not an insurmountable challenge. The next step is to integrate those silos, but the thinking is immature relative to the technology itself,” said Shelton.

I take this as encouraging news for topic maps.

Semantically diverse data has been stores in semantically diverse datastores. Data, which if integrated, could provide business value.

Again.

There will always be a market for topic maps because people can’t stop creating semantically diverse data and data stores.

How’s that for long term market security?

No matter what data or data storage technology arises, semantic inconsistency will be with us always.

Pfizer swaps out ETL for data virtualization tools

Filed under: Data Virtualization,Data Warehouse,ETL,Virtualization,Visualization — Patrick Durusau @ 7:35 pm

Pfizer swaps out ETL for data virtualization tools by Nicole Laskowski.

From the post:

Pfizer Inc.’s Worldwide Pharmaceutical Sciences division, which determines what new drugs will go to market, was at a technological fork in the road. Researchers were craving a more iterative approach to their work, but when it came to integrating data from different sources, the tools were so inflexible that work slowdowns were inevitable.

At the time, the pharmaceutical company was using one of the most common integration practices known as extract, transform, load (ETL). When a data integration request was made, ETL tools were used to reach into databases or other data sources, copy the requested data sets and transfer them to a data mart for users and applications to access.

But that’s not all. The Business Information Systems (BIS) unit of Pfizer, which processes data integration requests from the company’s Worldwide Pharmaceutical Sciences division, also had to collect specific requirements from the internal customer and thoroughly investigate the data inventory before proceeding with the ETL process.

“Back then, we were basically kind of in this data warehousing information factory mode,” said Michael Linhares, a research fellow and the BIS team leader.

Requests were repetitious and error-prone because ETL tools copy and then physically move the data from one point to another. Much of the data being accessed was housed in Excel spreadsheets, and by the time that information made its way to the data mart, it often looked different from how it did originally.

Plus, the integration requests were time-consuming since ETL tools process in batches. It wasn’t outside the realm of possibility for a project to take up to a year and cost $1 million, Linhares added. Sometimes, his team would finish an ETL job only to be informed it was no longer necessary.

“That’s just a sign that something takes too long,” he said.

Cost, quality and time issues aside, not every data integration request deserved this kind of investment. At times, researchers wanted quick answers; they wanted to test an idea, cross it off if it failed and move to the next one. But ETL tools meant working under rigid constraints. Once Linhares and his team completed an integration request, for example, they were unable to quickly add another field and introduce a new data source. Instead, they would have to build another ETL for that data source to be added to the data mart.

Bear in mind that we were just reminded, Leveraging Ontologies for Better Data Integration, that you have to understand data to integrate data.

That lesson holds true for integrating data after data virtualization.

Where are you going to write down your understanding of the meaning of the data you virtualize?

So subsequent users can benefit from your understanding of that data?

Or perhaps add their understanding to yours?

Or to have the capacity to merge collections of such understandings?

I would say a topic map.

You?

Leveraging Ontologies for Better Data Integration

Filed under: Data Integration,Ontology — Patrick Durusau @ 7:24 pm

Leveraging Ontologies for Better Data Integration by David Linthicum.

From the post:

If you don’t understand application semantics ‑ simply put, the meaning of data ‑ then you have no hope of creating the proper data integration solution. I’ve been stating this fact since the 1990s, and it has proven correct over and over again.

Just to be clear: You must understand the data to define the proper integration flows and transformation scenarios, and provide service-oriented frameworks to your data integration domain, meaning levels of abstraction. This is applicable both in the movement of data from source to target systems, as well as the abstraction of the data using data virtualization approaches and technology, such as technology for the host of this blog.

This is where many data integration projects fall down. Most data integration occurs at the information level. So, you must always deal with semantics and how to describe semantics relative to a multitude of information systems. There is also a need to formalize this process, putting some additional methodology and technology behind the management of metadata, as well as the relationships therein.

Many in the world of data integration have begun to adopt the notion of ontology (or the instances of ontology: ontologies). Ontology is a term borrowed from philosophy that refers to the science of describing the kinds of entities in the world and how they are related.

Why should we care? Ontologies are important to data integration solutions because they provide a shared and common understanding of data that exists within the business domain. Moreover, ontologies illustrate how to facilitate communication between people and information systems. You can think of ontologies as the understanding of everything, and how everything should interact to reach a common objective. In this case the optimization of the business. (emphasis added)

The two bolded lines I wanted to call to your attention:

If you don’t understand application semantics ‑ simply put, the meaning of data ‑ then you have no hope of creating the proper data integration solution. I’ve been stating this fact since the 1990s, and it has proven correct over and over again.

I wasn’t aware understanding the “meaning of data” as a prerequisite to data integration was ever contested?

You?

I am equally unsure that having a “…common and shared understanding of data…” qualifies as an ontology.

Which is a restatement of the first point.

What interests me is how to go from non-common and non-shared understandings of data to capturing all the currently known understandings of the data?

Repeating what is uncontested or already agreed upon, isn’t going to help with that task.

NetGestalt for Data Visualization in the Context of Pathways

Filed under: Bioinformatics,Biomedical,Graphs,Networks,Visualization — Patrick Durusau @ 7:06 pm

NetGestalt for Data Visualization in the Context of Pathways by Stephen Turner.

From the post:

Many of you may be familiar with WebGestalt, a wonderful web utility developed by Bing Zhang at Vanderbilt for doing basic gene-set enrichment analyses. Last year, we invited Bing to speak at our annual retreat for the Vanderbilt Graduate Program in Human Genetics, and he did not disappoint! Bing walked us through his new tool called NetGestalt.

NetGestalt provides users with the ability to overlay large-scale experimental data onto biological networks. Data are loaded using continuous and binary tracks that can contain either single or multiple lines of data (called composite tracks). Continuous tracks could be gene expression intensities from microarray data or any other quantitative measure that can be mapped to the genome. Binary tracks are usually insertion/deletion regions, or called regions like ChIP peaks. NetGestalt extends many of the features of WebGestalt, including enrichment analysis for modules within a biological network, and provides easy ways to visualize the overlay of multiple tracks with Venn diagrams.

Stephen also points to documentation and video tutorials.

NetGestalt uses gene symbol as the gene identifier. Data that uses other gene identifiers must be mapped to gene symbols before uploading. (Manual, page 4)

An impressive alignment of data sources even with the restriction to gene symbols.

February 20, 2013

Crowdsourcing Cybersecurity: A Proposal (Part 2)

Filed under: Crowd Sourcing,Cybersecurity,Security — Patrick Durusau @ 9:29 pm

As you may already suspect, my proposal for increasing cybersecurity is transparency.

A transparency borne of crowdsourcing cybersecurity.

What are the consequences of the current cult of secrecy around cybersecurity?

Here’s my short list (feel free to contribute):

  • Governments have no source of reliable information on the security of their contractors, vendors, etc.
  • Corporations have no source of reliable information on the security of their contractors, partners and others.
  • Sysadmins outside the “inner circle” have no notice of the details of hacks, with which to protect their systems.
  • Consumers of software have no source of reliable information on how insecure software may or may not be.

Secrecy puts everyone at greater cybersecurity risk, not less.

Let’s end cybersecurity secrecy and crowdsource cybersecurity.

Here is a sketch of one way to do just that:

  1. Establish or re-use an agency or organization to offer bounties on hacks into systems.
  2. Sliding scale where penetration using published root passwords are worth less than more sophisticated hacks. But even a minimal hack is worth say $5,000.
  3. To collect the funds, a hacker must provide full hack details and proof of the hack.
  4. A hacker submitting a “proof of hackability” attack has legal immunity (civil and criminal).
  5. Hack has to be verified using the hack as submitted.
  6. Upon verification of the hack, the hacker is paid the bounty.
  7. One Hundred and Eighty (180) days after the verification of the hack, the name of the hacked organization, the full details of the hack and the hacker’s identity (subject to their permission), are published to a public website.

Finance such a proposal, if run by a government, by fines on government contractors who get hacked.

Defense contractors who aren’t cybersecure should not be defense contractors.

That’s how you stop loss of national security information.

Surprised it hasn’t occurred to anyone inside the beltway.


With greater transparency, hacks, software, origins of software, authors of software, managers of security, all become subject to mapping.

Would you hire your next security consultant from a firm that gets hacked on a regular basis?

Or would you hire a defense contractor that changed its skin to avoid identification as an “easy” hack?

Or retain a programmer who keeps being responsible for security flaws?

Transparency and a topic map could give you better answers to those questions than you have today.

Crowdsourcing Cybersecurity: A Proposal (Part 1)

Filed under: Crowd Sourcing,Cybersecurity,Security — Patrick Durusau @ 9:28 pm

Mandiant’s provocative but hardly conclusive report has created a news wave on cybersecurity.

Hardly conclusive because as Mandiant states:

we have analyzed the group’s intrusions against nearly 150 victims over seven years (page 2)

A little over twenty-one victims a year. And I thought hacking was common place. 😉

Allegations of hacking should require a factual basis other than “more buses were going the other way.” (A logical fallacy because you get on the first bus going your way.)

Here we have a tiny subset (if general hacking allegations have any credibility) of all hacking every year.

Who is responsible for the intrusions?

It is easy and commonplace to blame hackers, but there are other responsible parties.

The security industry that continues to protect the identity of the “victims” of hacks and shares hacking information with a group of insiders comes to mind.

That long standing cult of secrecy has not prevented, if you believe the security PR, a virtual crime wave of hacking.

In fact, every non-disclosed hack, leaves thousands if not hundreds of thousands of users, institutions, governments and businesses with no opportunity to protect themselves.

And, if you are hiring a contractor, say a defense contractor, isn’t their record with protecting your data from hackers a relevant concern?

If users, institutions, governments and businesses had access to the details of hacking reports, who was hacked, who in the organization was responsible for computer security, how the hack was performed, etc., then we could all better secure our computers.

Or be held accountable for failing to secure our computers. By management, customers and/or governments.

Decades of diverting attention from poor security practices, hiding those who practice poor security, and cultivating a cult of secrecy around computer security, hasn’t diminished hacking.

What part of that lesson is unclear?

Or do you deny the reports by Mandiant and others?

It really is that clear: Either Mandiant and others are inventing hacking figures out of whole clothe or the cult of cybersecurity secrecy has failed to stop hacking.

Interested? See Crowdsourcing Cybersecurity: A Proposal (Part 2) for my take on a solution.


Just as a side note, President Obama’s Executive Order — Improving Critical Infrastructure Cybersecurity appeared on February 12, 2013. Compare: Mandiant Releases Report Exposing One of China’s Cyber Espionage Groups released February 19, 2013.

Is Mandiant trying to ride on the President’s coattails as they say?

Or just being opportunistic with the news cycle?

Connected into the beltway security cult?

Hard to say, probably impossible to know. Interesting timing none the less.

I wonder who will be on the various panels, experts, contractors under the Cybersecurity executive order?

Don’t you?

Graph Databases, GPUs, and Graph Analytics

Filed under: GPU,Graph Analytics,Graphs — Patrick Durusau @ 9:25 pm

Graph Databases, GPUs, and Graph Analytics by Bryan Thompson.

From the post:

For people who want to track what we’ve been up to on the XDATA project, there are three surveys articles that we’ve produced:

Literature Review of Graph Databases (Bryan Thompson, SYSTAP)
Large Scale Graph Algorithms on the GPU (Yangzihao Wang and John Owens, UC Davis)
Graph Pattern Matching, Search, and OLAP (Dr. Xifeng Yan, UCSB)

Simply awesome reading.

It may be too early to take off for a weekend of reading but I wish….

Co-EM algorithm in GraphChi

Filed under: GraphChi,Graphs — Patrick Durusau @ 9:25 pm

Co-EM algorithm in GraphChi by Danny Bickson.

From the post:

Following the previous post about label propagation, as well as some request from US based startup to implement this method in GraphChi, I have decided to write a quick tutorial for Co-EM algorithm.

You cannot live by press reports alone. 😉

Enjoy some time with GraphChi!

Cascading into Hadoop with SQL

Filed under: Cascading,Hadoop,Lingual,SQL — Patrick Durusau @ 9:24 pm

Cascading into Hadoop with SQL by Nicole Hemsoth.

From the post:

Today Concurrent, the company behind the Cascading Hadoop abstraction framework, announced a new trick to help developers tame the elephant.

The company, which is focused on simplifying Hadoop, has introduced a SQL parser that sits on top of Cascading with a JDBC Interface. Concurrent says that they’ll be pushing out over the next couple of weeks with hopes that developers will take it under their wing and support the project.

According to the company’s CTO and founder, Chris Wensel, the goal is to get the commuity to rally around a new way to let non-programmers make use of data that’s locked in Hadoop clusters and let them more easily move applications onto Hadoop clusters.

The newly-announced approach to extending the abstraction is called Lingual, which is aimed at putting Hadoop within closer sights for those familiar with SQL, JDBC and traditional BI tools. It provides what the company calls “true SQL for Cascading and Hadoop” to enable easier creation and running of applications on Hadoop and again, to tap into that growing pool of Hadoop-seekers who lack the expertise to back mission-critical apps on the platform.

Wensel says that Lingual’s goal is to provide an ANSI-standard SQL interface that is designed to play well with all of the big name distros running on site or in cloud environments. This will allow a “cut and paste” capability for existing ANSI SQL code from traditional data warehouses so users can access data that’s locked away on a Hadoop cluster. It’s also possible to query and export data from Hadoop right into a wide range of BI tools.

Another example of meeting a large community of uses where they are, not where you would like for them to be.

Targeting a market that already exists is easier than building a new one from the ground up.

LucidWorks™ Teams with MapR™… [Not 26% but 5-6% + not from Big Data]

Filed under: LucidWorks,MapR — Patrick Durusau @ 9:24 pm

LucidWorks™ Teams with MapR™ Technologies to Offer Best-in-Class Big Data Analytics Solution

Performance Day just keeps on going!

From the press release:

REDWOOD CITY, Calif. – February 20, 2013 – Big Data provides a very real opportunity for organizations to drive business decisions by utilizing new information that has yet to be tapped. However, it is increasingly apparent that organizations are struggling to make effective use of this new multi-structured content for data-driven decision-making. According to a report from the Economist Intelligence Unit, the challenge is not so much the volume, but instead it is the pressing need to analyze and act on Big Data in real-time.

Existing business intelligence (BI) tools have simply not been designed to provide spontaneous search on multi-structured data in motion. Responding directly to this need, LucidWorks, the company transforming the way people access information, and MapR Technologies, the Hadoop technology leader, today announced the integration between LucidWorks Search™ and MapR. Available now, the combined solution allows organizations to easily search their MapR Distributed File System (DFS) in a natural way to discover actionable insights from information maintained in Hadoop.

“Organizations that wait to address big data until this evolution is well under way will lose out competitively in their vertical markets, compared to organizations that have aggressively pursued big data flexibility. Aggressive organizations will demonstrate faster, more accurate analysis and decisions relating to their tactical operations and strategic planning.”

  • Source: Big Data Drives Rapid Changes in Infrastructure and $232 Billion in IT Spending Through 2016, Gartner Group

Integration Solution Highlights

  • Combines the best of Big Data with Search with an integrated and fully distributed solution
  • Supports a pre-defined MapR target data source within LucidWorks Search
  • Enables users to create and configure the MapR data source directly from the LucidWorks Search administration console
  • Leverages enterprise security features offered by both MapR and LucidWorks Search

The Economist Intelligence Unit study found that global companies experienced a 26 percent improvement in performance over the last three years when big data analytics were applied to the decision-making process. And now, those data-savvy executives are forecasting a 41 percent improvement over the next three years. The integration between LucidWorks Search and MapR makes it easier to put Big Data analytics in motion.

I’m really excited about this match up but you know I can’t simply let claims like “…global companies experienced a 26 percent improvement in performance….” slide by. 😉

If you go read the report,
The Deciding Factor: Big Data & Decision Making
, you will find at page six (6):

On average, survey participants say that big data has improved their organisations’ performance in the past three years by 26%, and they are optimistic that it will improve performance by an average of 41% in the next three years. While “performance” in this instance is not rigorously specified, it is a useful gauge of mood.

The measured difference in performance, from:

firms that emphasise decision-making based on data and analytics performed 5-6% better—as measured by output and performance—than those that rely on intuition and experience for decision-making.

So, not 26% but 5-6% measured and the 5-6% is for decision-making on data and analytics, not big data.

You don’t find code written at either LucidWorks or MapR that is “close enough.” Both have well deserved reputations for clean code and hard work.

Why should communications fall short of that mark?

NoSQL is Great, But You Still Need Indexes [MongoDB for example]

Filed under: Fractal Trees,Indexing,MongoDB,NoSQL,TokuDB,Tokutek — Patrick Durusau @ 9:23 pm

NoSQL is Great, But You Still Need Indexes by Martin Farach-Colton.

From the post:

I’ve said it before, and, as is the nature of these things, I’ll almost certainly say it again: your database performance is only as good as your indexes.

That’s the grand thesis, so what does that mean? In any DB system — SQL, NoSQL, NewSQL, PostSQL, … — data gets ingested and organized. And the system answers queries. The pain point for most users is around the speed to answer queries. And the query speed (both latency and throughput, to be exact) depend on how the data is organized. In short: Good Indexes, Fast Queries; Poor Indexes, Slow Queries.

But building indexes is hard work, or at least it has been for the last several decades, because almost all indexing is done with B-trees. That’s true of commercial databases, of MySQL, and of most NoSQL solutions that do indexing. (The ones that don’t do indexing solve a very different problem and probably shouldn’t be confused with databases.)

It’s not true of TokuDB. We build Fractal Tree Indexes, which are much easier to maintain but can still answer queries quickly. So with TokuDB, it’s Fast Indexes, More Indexes, Fast Queries. TokuDB is usually thought of as a storage engine for MySQL and MariaDB. But it’s really a B-tree substitute, so we’re always on the lookout for systems where we can improving the indexing.

Enter MongoDB. MongoDB is beloved because it makes deployment fast. But when you peel away the layers, you get down to a B-tree, with all the performance headaches and workarounds that they necessitate.

That’s the theory, anyway. So we did some testing. We ripped out the part of MongoDB that takes care of secondary indices and plugged in TokuDB. We’ve posted the blogs before, but here they are again, the greatest hits of TokuDB+MongoDB: we show a 10x insertion performance, a 268x query performance, and a 532x (or 53,200% if you prefer) multikey index insertion performance. We also discussed covered indexes vs. clustered Fractal Tree Indexes.

Did somebody declare February 20th to be performance release day?

Did I miss that memo? 😉

Like every geek, I like faster. But, here’s my question:

Have there been any studies on the impact of faster systems on searching and decision making by users?

My assumption is the faster I get a non-responsive result from a search, the sooner I can improve it.

But that’s an assumption on my part.

Is that really true?

Introducing… Tez: Accelerating processing of data stored in HDFS

Filed under: DAG,Graphs,Hadoop YARN,MapReduce,Tez — Patrick Durusau @ 9:23 pm

Introducing… Tez: Accelerating processing of data stored in HDFS by Arun Murthy.

From the post:

MapReduce has served us well. For years it has been THE processing engine for Hadoop and has been the backbone upon which a huge amount of value has been created. While it is here to stay, new paradigms are also needed in order to enable Hadoop to serve an even greater number of usage patterns. A key and emerging example is the need for interactive query, which today is challenged by the batch-oriented nature of MapReduce. A key step to enabling this new world was Apache YARN and today the community proposes the next step… Tez

What is Tez?

Tez – Hindi for “speed” – (currently under incubation vote within Apache) provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

If you are familiar with Michael Sperberg-McQueen and Claus Huitfeldt’s work on DAGs, you would be as excited as I am! (Goddag for example.)

On any day this would be awesome work.

Even more so coming on the heels of two other major project announcements. Securing Hadoop with Knox Gateway and The Stinger Initiative: Making Apache Hive 100 Times Faster, both from Hortonworks.

The Stinger Initiative: Making Apache Hive 100 Times Faster

Filed under: Hive,Hortonworks — Patrick Durusau @ 9:23 pm

The Stinger Initiative: Making Apache Hive 100 Times Faster by Alan Gates.

From the post:

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop. Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface. The who’s who of business analytics have already adopted Hive.

Hive was originally built for large-scale operational batch processing and it is very effective with reporting, data mining and data preparation use cases. These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased in importance as well. At Hortonworks, we believe in the power of the open source community to innovate faster than any proprietary offering and the Stinger initiative is proof of this once again as we collaborate with others to improve Hive performance.

So, What is Stinger?

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

To this end, we have launched the Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases. All the while, HiveQL remains the same before and after these advancements so it just gets better. And in keeping with the ecosystem of existing tools, it is complementary to best-of-breed data warehouses and analytic platforms.

Leveraging on existing skills and infrastructure.

Who knows? Hortonworks maybe about to start a trend!

Securing Hadoop with Knox Gateway

Filed under: Hadoop,Knox Gateway,Security — Patrick Durusau @ 9:23 pm

Securing Hadoop with Knox Gateway by Kevin Minder

From the post:

Back in the day, in order to secure a Hadoop cluster all you needed was a firewall that restricted network access to only authorized users. This eventually evolved into a more robust security layer in Hadoop… a layer that could augment firewall access with strong authentication. Enter Kerberos. Around 2008, Owen O’Malley and a team of committers led this first foray into security and today, Kerberos is still the primary way to secure a Hadoop cluster.

Fast-forward to today… Widespread adoption of Hadoop is upon us. The enterprise has placed requirements on the platform to not only provide perimeter security, but to also integrate with all types of authentication mechanisms. Oh yeah, and all the while, be easy to manage and to integrate with the rest of the secured corporate infrastructure. Kerberos can still be a great provider of the core security technology but with all the touch-points that a user will have with Hadoop, something more is needed.

The time has come for Knox.

Timely news of an effort at security that doesn’t depend upon obscurity (or inner circle secrecy).

Hadoop installations, in topic map flow flow and not, need to pay attention to this project.

February 19, 2013

Literature Survey of Graph Databases

Filed under: 4store,Accumulo,Diplodocus,Graphs,Networks,RDF,SHARD,Urika,Virtuoso,YARS2 — Patrick Durusau @ 3:39 pm

Literature Survey of Graph Databases by Bryan Thompson.

I can understand Danny Bickson, Literature survey of graph databases, being excited about the coverage of GraphChi in this survey.

However, there are other names you will recognize as well (TOC order):

  • RDF3X
  • Diplodocus
  • GraphChi
  • YARS2
  • 4store
  • Virtuoso
  • Bigdata
  • SHARD
  • Graph partitioning
  • Accumulo
  • Urika
  • Scalable RDF query processing on clusters and supercomputers (a system with no name at Rensselaer Polytechnic)

As you can tell from the system names, the survey focuses on processing of RDF.

In reviewing one system, Bryan remarks:

Only small data sets were considered (100s of millions of edges). (emphasis added)

I think that captures the focus of the paper better than any comment I can make.

A must read for graph heads!

« Newer PostsOlder Posts »

Powered by WordPress