Archive for July, 2013

U.S. Code Available in Bulk XML

Wednesday, July 31st, 2013

House of Representatives Makes U.S. Code Available in Bulk XML.

From the press release:

As part of an ongoing effort to make Congress more open and transparent, House Speaker John Boehner (R-OH) and Majority Leader Eric Cantor (R-VA) today announced that the House of Representatives is making the United States Code available for download in XML format.

The data is compiled, updated, and published by the Office of Law Revision Counsel (OLRC). You can download individual titles – or the full code in bulk – and read supporting documentation here.

“Providing free and open access to the U.S. Code in XML is another win for open government,” said Speaker Boehner and Leader Cantor. “And we want to thank the Office of Law Revision Counsel for all of their work to make this project a reality. Whether it’s our ‘read the bill’ reforms, streaming debates and committee hearings live online, or providing unprecedented access to legislative data, we’re keeping our pledge to make Congress more transparent and accountable to the people we serve.”

In 2011, Speaker Boehner and Leader Cantor called for the adoption of new electronic data standards to make legislative information more open and accessible. With those standards in place, the House created the Legislative Branch Bulk Data Task Force in 2012 to expedite the process of providing bulk access to legislative information and to increase transparency for the American people.

Since then, the Government Printing Office (GPO) has begun providing bulk access to House legislation in XML. The Office of the Clerk makes full sessions of House floor summaries available in bulk as well.

The XML version of the U.S. Code will be updated quickly, on an ongoing basis, as new laws are enacted.

You can see a full list of open government projects underway in the House at

While applauding Congress, don’t forget Legal Information Institute at Cornell University Law School has been working on free access to public law for the past twenty-one (21) years.

I first saw this at: U.S. House of Representatives Makes U.S. Code Available in Bulk XML.

Applied Cryptography Engineering

Wednesday, July 31st, 2013

Applied Cryptography Engineering

From the post:

If you’re reading this, you’re probably a red-blooded American programmer with a simmering interest in cryptography. And my guess is your interest came from Bruce Schneier’s Applied Cryptography.

Applied Cryptography is a deservedly famous book that lies somewhere between survey, pop-sci advocacy, and almanac. It taught two generations of software developers everything they know about crypto. It’s literate, readable, and ambitious. What’s not to love?

Just this: as an instruction manual, Applied Cryptography is dreadful. Even Schneier seems to concede the point. This article was written with several goals: to hurry along the process of getting Applied Cryptography off the go-to stack of developer references, to point out the right book to replace it with, and to spell out what you else you need to know even after reading that replacement. Finally, I wrote this as a sort of open letter to Schneier and his co-authors.

Highly entertaining review of Applied Cryptography, its successor, Cryptography Engineering, and further reading on cryptography.

Personally I would pick up a copy of Applied Cryptography because of its place in the history of cryptography.

I first saw this in Nat Torkington’s Four short links: 29 July 2013.

To index is to translate

Tuesday, July 30th, 2013

To index is to translate by Fran Alexander.

From the post:

Living in Montreal means I am trying to improve my very limited French and in trying to communicate with my Francophone neighbours I have become aware of a process of attempting to simplify my thoughts and express them using the limited vocabulary and grammar that I have available. I only have a few nouns, fewer verbs, and a couple of conjunctions that I can use so far and so trying to talk to people is not so much a process of thinking in English and translating that into French, as considering the basic core concepts that I need to convey and finding the simplest ways of expressing relationships. So I will say something like “The sun shone. It was big. People were happy” because I can’t properly translate “We all loved the great weather today”.

This made me realise how similar this is to the process of breaking down content into key concepts for indexing. My limited vocabulary is much like the controlled vocabulary of an indexing system, forcing me to analyse and decompose my ideas into simple components and basic relationships. This means I am doing quite well at fact-based communication, but my storytelling has suffered as I have only one very simple emotional register to work with. The best I can offer is a rather laconic style with some simple metaphors: “It was like a horror movie.”

It is regularly noted that ontology work in the sciences has forged ahead of that in the humanities, and the parallel with my ability to express facts but not tell stories struck me. When I tell my simplified stories I rely on shared understanding of a broad cultural context that provides the emotional aspect – I can use the simple expression “horror movie” because the concept has rich emotional associations, connotations, and resonances for people. The concept itself is rather vague, broad, and open to interpretation, so the shared understanding is rather thin. The opposite is true of scientific concepts, which are honed into precision and a very constrained definitive shared understanding. So, I wonder how much of sense that I can express facts well is actually an illusion, and it is just that those factual concepts have few emotional resonances.

Is mapping a process of translation?

Are translations always less rich than the source?

Or are translations as rich but differently rich?

Parquet 1.0: Columnar Storage for Hadoop

Tuesday, July 30th, 2013

Announcing Parquet 1.0: Columnar Storage for Hadoop by Justin Kestelyn.

From the post:

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.

Today, we’re happy to tell you about a significant Parquet milestone: a 1.0 release, which includes major features and improvements made since the initial announcement. But first, we’ll revisit why columnar storage is so important for the Hadoop ecosystem.

What is Parquet and Columnar Storage?

Parquet is an open-source columnar storage format for Hadoop. Its goal is to provide a state of the art columnar storage layer that can be taken advantage of by existing Hadoop frameworks, and can enable a new generation of Hadoop data processing architectures such as Impala, Drill, and parts of the Hive ‘Stinger’ initiative. Parquet does not tie its users to any existing processing framework or serialization library.

The idea behind columnar storage is simple: instead of storing millions of records row by row (employee name, employee age, employee address, employee salary…) store the records column by column (all the names, all the ages, all the addresses, all the salaries). This reorganization provides significant benefits for analytical processing:

  • Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied.
  • Since column values are stored consecutively, a query engine can skip loading columns whose values it doesn’t need to answer a query, and use vectorized operators on the values it does load.

These effects combine to make columnar storage a very attractive option for analytical processing.

A little over four (4) months from announcement to a 1.0 release!

Now that’s performance!

The Hadoop ecosystem just keeps getting better.

Lucene 4 Performance Tuning

Tuesday, July 30th, 2013

From the description:

Apache Lucene has undergone a major overhaul influencing many of the key characteristics dramatically. New features and modification allow for new as well as fundamentally different ways of tuning the engine for best performance.

Tuning performance is essential for almost every Lucene based application these days – Search & Performance almost a synonyms. Knowing the details of the underlying software provides the basic tools to get the best out of your application. Knowing the limitations can safe you and your company a massive amount of time and money. This talks tries to explain design decision made in Lucene 4 compared to older versions and provide technical details how those implementations and design decisions can help to improve the performance of your application. The talk will mainly focus on core features like: Realtime & Batch Indexing Filter and Query performance Highlighting and Custom Scoring

The talk will contain a lot of technical details that require a basic understanding of Lucene, datastructures and algorithms. You don’t need to be an expert to attend but be prepared for some deep dive into Lucene. Attendees don’t need to be direct Lucene users, the fundamentals provided in this talk are also essential for Apache Solr or elasticsearch users.

If you want to catch some of the highlights of Lucene 4, this is the presentation for you!

It will be hard to not go dig deeper in a number of areas.

The new codec features were particularly impressive!

Large File/Data Tools

Tuesday, July 30th, 2013

Essential tools for manipulating big data files by Daniel Rubio.

From the post:

You can leverage several tools that are commonly used to manipulate big data files, which include: Regular expressions, sed, awk, WYSIWYG editors (e.g. Emacs, vi and others), scripting languages (e.g. Bash, Perl, Python and others), parsers (e.g. Expat, DOM, SAX and others), compression utilities (e.g. zip, tar, bzip2 and others) and miscellaneous Unix/Linux utilities (e.g. split, wc, sort, grep)


10 Awesome Examples for Viewing Huge Log Files in Unix by Ramesh Natarajan.

Viewing huge log files for trouble shooting is a mundane routine tasks for sysadmins and programmers.

In this article, let us review how to effectively view and manipulate huge log files using 10 awesome examples.

cover the same topic but with very little overlap (only grep as far as I can determine).

Are there other compilations of “tools” that would be handy for large data files?

Turning visitors into sales: seduction vs. analytics

Tuesday, July 30th, 2013

Turning visitors into sales: seduction vs. analytics by Mirko Krivanek.

From the post:

The context here is about increasing conversion rate, from website visitor to active, converting user. Or from passive newsletter subscriber to a lead (a user who opens the newsletter, clicks on the links, and converts). Here we will discuss the newletter conversion problem, although it applies to many different settings.


Of course, to maximize the total number of leads (in any situation), you need to use both seduction and analytics:

sales = f(seduction, analytics, product, price, competition, reputation)

How to assess the weight attached to each factor in the above formula, is beyond the scope of this article. First, even measuring “seduction” or “analytics” is very difficult. But you could use a 0-10 scale, with seduction = 9 representing a company doing significant efforts to seduce prospects, and analytics = 0 representing a company totally ignoring analytics.

I did not add a category for “seduction.” Perhaps if someone writes a topic map on seduction I will. 😉

Mirko’s seduction vs. analytics resonates with Kahneman’s fast versus slow thinking.

“Fast” thinking takes less effort by a reader and “slow” thinking takes more.

Forcing your readers to work harder, for marketing purposes, sounds like a bad plan to me.

“Fast”/seductive thinking should be the goal of your marketing efforts.

RTextTools: A Supervised Learning Package for Text Classification

Tuesday, July 30th, 2013

RTextTools: A Supervised Learning Package for Text Classification by Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, and Wouter van Atteveldt.


Social scientists have long hand-labeled texts to create datasets useful for studying topics from congressional policymaking to media reporting. Many social scientists have begun to incorporate machine learning into their toolkits. RTextTools was designed to make machine learning accessible by providing a start-to-finish product in less than 10 steps. After installing RTextTools, the initial step is to generate a document term matrix. Second, a container object is created, which holds all the objects needed for further analysis. Third, users can use up to nine algorithms to train their data. Fourth, the data are classified. Fifth, the classification is summarized. Sixth, functions are available for performance evaluation. Seventh, ensemble agreement is conducted. Eighth, users can cross-validate their data. Finally, users write their data to a spreadsheet, allowing for further manual coding if required.

Another software package that comes with a sample data set!

The congressional bills example reminds me of a comment by Trey Grainger in Building a Real-time, Big Data Analytics Platform with Solr.

Trey makes the point that “document” in Solr depends on how you define document. Which enables processing/retrieval at a much lower level than a traditional “document.”

If the congressional bills were broken down at a clause level, would the results be different?

Not something I am going to pursue today but will appreciate comments and suggestions if you have seen that tried in other contexts.

Fast Graph Kernels for RDF

Tuesday, July 30th, 2013

Fast Graph Kernels for RDF

From the post:

As a complement to two papers that we will present at the ECML/PKDD 2013 conference in Prague in September we created a webpage with additional material.

The first paper: “A Fast Approximation of the Weisfeiler-Lehman Graph Kernel for RDF Data” was accepted into the main conference and the second paper: “A Fast and Simple Graph Kernel for RDF” was accepted at the DMoLD workshop.

We include links to the papers, to the software and to the datasets used in the experiments, which are stored in figshare. Furthermore, we explain how to rerun the experiments from the papers using a precompiled JAR file, to make the effort required as minimal as possible.

Kudos to the authors for enabling others to duplicate their work!

Interesting to think of processing topics as sub-graphs consisting only of the subject identity properties. Deferring processing of other properties until the topic is requested.

Comparing MongoDB, MySQL, and TokuMX Data Layout

Tuesday, July 30th, 2013

Comparing MongoDB, MySQL, and TokuMX Data Layout by Zardosht Kasheff.

From the post:

A lot is said about the differences in the data between MySQL and MongoDB. Things such as “MongoDB is document based”, “MySQL is relational”, “InnoDB has a clustering key”, etc.. Some may wonder how TokuDB, our MySQL storage engine, and TokuMX, our MongoDB product, fit in with these data layouts. I could not find anything describing the differences with a simple google search, so I figured I’d write a post explaining how things compare.

So who are the players here? With MySQL, users are likely familiar with two storage engines: MyISAM, the original default up until MySQL 5.5, and InnoDB, the current default since MySQL 5.5. MongoDB has only one storage engine, and we’ll refer to it as “vanilla Mongo storage”. And of course, there is TokuDB for MySQL, and TokuMX.

First, let’s get some quick terminology out of the way. Documents and collections in MongoDB can be thought of as rows and tables in MySQL, respectively. And while not identical, fields in MongoDB are similar to columns in MySQL. A full SQL to MongoDB mapping can be found here. When I refer to MySQL, what I say applies to TokuDB, InnoDB, and MyISAM. When I say MongoDB, what I say applies to TokuMX and vanilla Mongo storage.

Great contrast of MongoDB and MySQL data formats.

Data formats are essential to understanding the capabilities and limitations of any software package.

Cryptology ePrint Archive

Tuesday, July 30th, 2013

Cryptology ePrint Archive

As a result of finding the paper in: Subject Identity Obfuscation?, I stumbled upon the Cryptography ePrint Archive.

A forbidding title list awaits, unless you are a cryptography expert.

Still, a resource to be aware of for the latest developments in cryptography.

Orwell’s Nineteen Eighty-Four fiction has become fact, more or less everywhere.

Savvy clients/customers will expect you to secure their data. Against surveillance by competitors and governments.

Reading in the Wall Street Journal stories like:

Microsoft provided a lengthy statement to the Guardian and other news outlets at the time the story was published. Microsoft on Tuesday released a blog post that largely repeated its earlier statements.

In the post Tuesday, for the first time, the company did address the encryption-cracking issue. Microsoft said in its statement that it “does not provide any government with the ability to break the encryption, nor does it provide the government with the encryption keys.”

Yet that’s not exactly what the Guardian claimed. The Guardian said Microsoft worked with the FBI to “come up with a solution that allowed the NSA to circumvent encryption” on online chats via, Microsoft’s Web-based email service.

is not going to increase the good will of your clients/customers.

Subject Identity Obfuscation?

Tuesday, July 30th, 2013

Computer Scientists Develop ‘Mathematical Jigsaw Puzzles’ to Encrypt Software

From the post:

UCLA computer science professor Amit Sahai and a team of researchers have designed a system to encrypt software so that it only allows someone to use a program as intended while preventing any deciphering of the code behind it. This is known in computer science as “software obfuscation,” and it is the first time it has been accomplished.

It was the line “…and this is the first time it has been accomplished.” that caught my attention.

I could name several popular scripting languages, at the expense of starting a flame war, that would qualify as “software obfuscation.” 😉

Further from the post:

According to Sahai, previously developed techniques for obfuscation presented only a “speed bump,” forcing an attacker to spend some effort, perhaps a few days, trying to reverse-engineer the software. The new system, he said, puts up an “iron wall,” making it impossible for an adversary to reverse-engineer the software without solving mathematical problems that take hundreds of years to work out on today’s computers — a game-change in the field of cryptography.

The researchers said their mathematical obfuscation mechanism can be used to protect intellectual property by preventing the theft of new algorithms and by hiding the vulnerability a software patch is designed to repair when the patch is distributed.

“You write your software in a nice, reasonable, human-understandable way and then feed that software to our system,” Sahai said. “It will output this mathematically transformed piece of software that would be equivalent in functionality, but when you look at it, you would have no idea what it’s doing.”

The key to this successful obfuscation mechanism is a new type of “multilinear jigsaw puzzle.” Through this mechanism, attempts to find out why and how the software works will be thwarted with only a nonsensical jumble of numbers.

The paper has this title: Candidate Indistinguishability Obfuscation and Functional Encryption for all circuits by Sanjam Garg and Craig Gentry and Shai Halevi and Mariana Raykova and Amit Sahai and Brent Waters.


In this work, we study indistinguishability obfuscation and functional encryption for general circuits:

Indistinguishability obfuscation requires that given any two equivalent circuits C_0 and C_1 of similar size, the obfuscations of C_0 and C_1 should be computationally indistinguishable.

In functional encryption, ciphertexts encrypt inputs x and keys are issued for circuits C. Using the key SK_C to decrypt a ciphertext CT_x = Enc(x), yields the value C(x) but does not reveal anything else about x. Furthermore, no collusion of secret key holders should be able to learn anything more than the union of what they can each learn individually.

We give constructions for indistinguishability obfuscation and functional encryption that supports all polynomial-size circuits. We accomplish this goal in three steps:

  • We describe a candidate construction for indistinguishability obfuscation for NC1 circuits. The security of this construction is based on a new algebraic hardness assumption. The candidate and assumption use a simplified variant of multilinear maps, which we call Multilinear Jigsaw Puzzles.
  • We show how to use indistinguishability obfuscation for NC1 together with Fully Homomorphic Encryption (with decryption in NC1) to achieve indistinguishability obfuscation for all circuits.
  • Finally, we show how to use indistinguishability obfuscation for circuits, public-key encryption, and non-interactive zero knowledge to achieve functional encryption for all circuits. The functional encryption scheme we construct also enjoys succinct ciphertexts, which enables several other applications.

When a paper has a table of contents following the abstract, you know it isn’t a short paper. Forty-three (43) pages counting the supplemental materials. Most of it very heavy sledding.

I think this paper has important implications for sharing topic map based data.

In general as with other data but especially with regard to subject identity and merging rules.

It may well be the case that a subject of interest to you exists in a topic map but if you can’t access its subject identity sufficient to create merging, it will not exist for you.

One can even imagine that a subject may be accessible for screen display but not for copying to a “Snowden drive.” 😉

BTW, I have downloaded a copy of the paper. Suggest you do the same.

Just in case it goes missing several years from now when government security agencies realize its potential.

New Community Forums for Cloudera Customers and Users

Monday, July 29th, 2013

New Community Forums for Cloudera Customers and Users by Justin Kestelyn.

From the post:

This is a great day for technical end-users – developers, admins, analysts, and data scientists alike. Starting now, Cloudera complements its traditional mailing lists with a new, feature-rich community forums intended for users of Cloudera’s Platform for Big Data! (Login using your existing credentials or click the link to register.)

Although mailing lists have long been a standard for user interaction, and will undoubtedly continue to be, they have flaws. For example, they lack structure or taxonomy, which makes consumption difficult. Search functionality is often less than stellar and users are unable to build reputations that span an appreciable period of time. For these reasons, although they’re easy to create and manage, mailing lists inherently limit access to knowledge and hence limit adoption.

The new service brings key additions to the conversation: functionality, search, structure and scalability. It is now considerably easier to ask questions, find answers (or questions to answer), follow and share threads, and create a visible and sustainable reputation in the community. And for Cloudera customers, there’s a bonus: your questions will be escalated as bonafide support cases under certain circumstances (see below).

Another way for you to participate in the Hadoop ecosystem!

BTW, the discussion taxonomy:

What is the reasoning behind your taxonomy?

We made a sincere effort to balance the requirements of simplicity and thoroughness. Of course, we’re always open to suggestions for improvements.

I don’t doubt the sincerity of the taxonomy authors. Not one bit.

But all taxonomies represent the “intuitive” view of some small group. There is no means to escape the narrow view of all taxonomies.

What we can do, at least with topic maps, is to allow groups to have their own taxonomies and to view data through those taxonomies.

Mapping between taxonomies means that addition via any of the taxonomies results in new data appearing as appropriate in other taxonomies.

Perhaps it was necessary to champion one taxonomy when information systems were fixed, printed representations of data and access systems.

But the need for a single taxonomy, if it ever existed, does not exist now. We are free to have any number of taxonomies for any data set, visible or invisible to other users/taxonomies.

More than thirty (30) years after the invention of the personal computer, we are still laboring under the traditions of printed information systems.

Isn’t it time to move on?

Big Data Garbage In, Even Bigger Garbage Out

Monday, July 29th, 2013

Big Data Garbage In, Even Bigger Garbage Out by Alex Woodie.

From the post:

People are doing some truly amazing things with big data sets and analytic tools. Tools like Hadoop have given us astounding capabilities to drive insights out of huge expanses of loosely structured data. And while the big data breakthroughs are expected to continue, don’t expect any progress to be made against that oldest of computer adages: “garbage in, garbage out.”

In fact, big data may even exacerbate the GIGO problem, according to Andrew Anderson, CEO of Celaton, a UK company that makes software designed to prevent bad data from being introduced into customer’s accounting systems.

“The ideal payoff for accumulating data is rapidly compounding returns,” Anderson writes in an essay on Economia, a publication of a UK accounting association. “By gaining more data on your own business, your clients, and your prospects, the idea is that you can make more informed decisions about your business and theirs based on clear insight. Too often however, these insights are based on invalid data, which can lead to a negative version of this payoff, to the power of ten.”

The problem may compound to the power of 100 if bad data is left to fester. Anderson calls this the “1-10-100 rule.” If a clerk makes a mistake entering data, it costs $1 to fix it immediately. After an hour–when the data has begun propagating across the system–the cost to fix it increases to $10.

Several months later, after the piece of data has become part of the company’s data reality and mailings have gone out to the wrong people and invoices have gone unpaid and new clients have not been contacted about new services, the cost of that single data error balloons to $100.

If you read the essay in Economia, you will find the 1-10-100 rule expressed in British pounds. With the current exchange rate, the cost would be higher here in the United States.

Still, the point is a valid one.

Decisions made on faulty data may be the correct decisions, but your odds worsen as the quality of the data goes down.

Majority voting and information theory: What am I missing?

Monday, July 29th, 2013

Majority voting and information theory: What am I missing? by Panos Ipeirotis.

From the post:

In crowdsourcing, redundancy is a common approach to ensure quality. One of the questions that arises in this setting is the question of equivalence. Let’s assume that a worker has a known probability q of giving a correct answer, when presented with a choice of n possible answers. If I want to simulate one high-quality worker workers of quality q, how many workers of quality q < q do we need?

If you step away from match / no match type merging tests for topics, the question that Panos poses comes into play.

There has been prior work in the area where the question was the impact of quality (q) being less than or greater than 0.5. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers by Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis.

Panos’ question is why can’t he achieve a theoretical quality of 1.0 if he uses two workers with q = 0.85?

I agree that using high quality workers in series can improve over all results. However, as I respond to his blog post, probabilities are not additive.

They are ever probabilities. Could have, on occasion, two 0.85 workers in series transmit an answer perfectly. But that is only one possible outcome out of number of possible outcomes.

What would your response be?

Twitter4j and Scala

Monday, July 29th, 2013

Using twitter4j with Scala to access streaming tweets by Jason Baldridge.

From the introduction:

My previous post provided a walk-through for using the Twitter streaming API from the command line, but tweets can be more flexibly obtained and processed using an API for accessing Twitter using your programming language of choice. In this tutorial, I walk-through basic setup and some simple uses of the twitter4j library with Scala. Much of what I show here should be useful for those using other JVM languages like Clojure and Java. If you haven’t gone through the previous tutorial, have a look now before going on as this tutorial covers much of the same material but using twitter4j rather than HTTP requests.

I’ll introduce code, bit by bit, for accessing the Twitter data in different ways. If you get lost with what should go where, all of the code necessary to run the commands is available in this github gist, so you can compare to that as you move through the tutorial.

Update: The tutorial is set up to take you from nothing to being able to obtain tweets in various ways, but you can also get all the relevant code by looking at the twitter4j-tutorial repository. For this tutorial, the tag is v0.1.0, and you can also download a tarball of that version.

Using Twitter4j with Scala to perform user actions by Jason Baldridge.

From the introduction:

My previous post showed how to use Twitter4j in Scala to access Twitter streams. This post shows how to control a Twitter user’s actions using Twitter4j. The primary purpose of this functionality is perhaps to create interfaces for Twitter like TweetDeck, but it can also be used to create bots that take automated actions on Twitter (one bot I’m playing around with is @tshrdlu, using the code in this tutorial and the code in the tshrdlu repository).

This post will only cover a small portion of the things you can do, but they are some of the more common things and I include a couple of simple but interesting use cases. Once you have these things in place, it is straightforward to figure out how to use the Twitter4j API docs (and Stack Overflow) to do the rest.

Jason continues his tutorial on accessing/processing Twitter streams using Twitter4j and Scala.

Since Twitter has enough status for royal baby names, your data should feel no shame being on Twitter. 😉

Not to mention tweeted IRIs can inform readers of content in excess of one hundred and forty (140) characters in length.

NIH Big Data to Knowledge (BD2K) Initiative [TM Opportunity?]

Sunday, July 28th, 2013

NIH Big Data to Knowledge (BD2K) Initiative by Shar Steed.

From the post:

The National Institutes of Health (NIH) has announced the Centers of Excellence for Big Data Computing in the Biomedical Sciences (U54) funding opportunity announcement, the first in its Big Data to Knowledge (BD2K) Initiative.

The purpose of the BD2K initiative is to help biomedical scientists fully utilize Big Data being generated by research communities. As technology advances, scientists are generating and using large, complex, and diverse datasets, which is making the biomedical research enterprise more data-intensive and data-driven. According to the BD2K website:

[further down in the post]

Data integration: An applicant may propose a Center that will develop efficient and meaningful ways to create connections across data types (i.e., unimodal or multimodal data integration).

That sounds like topic maps doesn’t it?

At least if we get away from black/white, match one of a set of IRIs or not, type merging practices.

For more details:

A webinar for applicants is scheduled for Thursday, September 12, 2013, from 3 – 4:30 pm EDT. Click here for more information.

Be aware of this workshop:

August 21, 2013 – August 22, 2013
NIH Data Catalogue
Francine Berman, Ph.D.

This workshop seeks to identify the least duplicative and burdensome, and most sustainable and scalable method to create and maintain an NIH Data Catalog. An NIH Data Catalog would make biomedical data findable and citable, as PubMed does for scientific publications, and would link data to relevant grants, publications, software, or other relevant resources. The Data Catalog would be integrated with other BD2K initiatives as part of the broad NIH response to the challenges and opportunities of Big Data and seek to create an ongoing dialog with stakeholders and users from the biomedical community.


Let’s see: “…least duplicative and burdensome, and most sustainable and scalable method to create and maintain an NIH Data Catalog.”

Recast existing data as RDF with a suitable OWL Ontology. – Duplicative, burdensome, not sustainable or scalable.

Accept all existing data as it exists and write subject identity and merging rules: Non-duplicative, existing systems persist so less burdensome, re-use of existing data = sustainable, only open question is scalability.

Sounds like a topic map opportunity to me.


Death & Taxes 2014 Poster and Interview

Sunday, July 28th, 2013

Death and Taxes by Randy Krum.

The new 2014 Death & Taxes poster has been released, and it is fantastic! Visualizing the President’s proposed budget for next year, each department and major expense item is represented with proportionally sized circles so the viewer can understand how big they are in comparison to the rest of the budget.

You can purchase the 24” x 36” printed poster for $24.95.

Great poster, even if I disagree with some of the arrangement of agencies. Homeland Security, for example, should be grouped with the military on the left side of the poster.

If you are an interactive graphics type, it would be really cool to have sliders for the agency budgets and display the agency results for changes.

Say we took 30 $Billion from the Department of Homeland Security and gave it to NASA. What space projects, funding for scientific research, rebuilding of higher education would that shift fund?

I’m not sure how you would graphically represent fewer delays at airports, no groping of children (no TSA), etc.

Also interesting from a subject identity perspective.

Identifying specific programs can be done by budget numbers, for example.

But here the question would be: How much funding results in program N being included in the “potentially” funded set of programs?

Unless every request is funded, there would have to be a ranking of requests against some fixed budget allocation.

Another aspect of Steve Pepper’s question concerning types being a binary choice in the current topic map model.

Very few real world choices, or should I say the basis for real world choices, are ever that clear.

Watching the Watchers

Saturday, July 27th, 2013

Hillbilly Tracking of Low Earth Orbit Satellites by Travis Goodspeed.

From the post:

At Black Hat DC in 2008, I watched Adam Laurie present a tool for mapping Ku-band satellite downlinks, which he has since rewritten as Satmap. His technique involves using an DVB-S card in a Linux computer as a receiver through a 90cm Ku-band dish with fixed elevation and a DiSEqC motor for azimuth motion. It was among the most inspirational talks I’d ever seen, and I had a blast recreating his setup and scanning the friendly skies. However, such a rig is limited to geostationary satellites in a small region of the sky; I wanted to see the whole sky, especially the moving targets.

In this article, I’ll demonstrate a method for modifying a naval telecommunications dish to track moving targets in the sky, such as those in Low Earth Orbit. My dish happily sits in Tennessee, while I direct it using my laptop or cellphone here in Europe. It can also run unattended, tracking moving targets and looking for downlink channels.

Low Earth orbit satellites? Oh, yeah, the ones that take high resolution pictures for surveillance purposes.

Probably good practice to developing a low-cost drone detection unit.

Not hard to imagine a loosely organized group sharing signal records for big data crunching and creation of flight maps.

I first saw this in Pete Warden’s Five Short Links for July 23, 2013.

Good UI

Saturday, July 27th, 2013

Good UI

Sixteen tips on creating a better UI with more promised to be on the way.

Newsletter promises two (2) per month.

The test of a UI is not whether the designer or you find it intuitive.

The test of a UI is whether an untrained user finds it intuitive.

I first saw this at Nat Torkington’s Four short links: 26 July 2013.

…Spatial Analytics with Hive and Hadoop

Saturday, July 27th, 2013

How To Perform Spatial Analytics with Hive and Hadoop by Carter Shanklin.

From the post:

One of the big opportunities that Hadoop provides is the processing power to unlock value in big datasets of varying types from the ‘old’ such as web clickstream and server logs, to the new such as sensor data and geolocation data.

The explosion of smart phones in the consumer space (and smart devices of all kinds more generally) has continued to accelerate the next generation of apps such as Foursquare and Uber which depend on the processing of and insight from huge volumes of incoming data.

In the slides below we look at a sample, anonymized data set from Uber that is available on Infochimps. We step through basics of analyzing the data in Hive and learn how a new using spatial analysis decide whether a new product offering is viable or not.

Great tutorial and slides!

My only reservation is the use of geo-location data to make a judgement about the potential for a new ride service.

Geo-location data is only way to determine potential for a ride service. Surveying potential riders would be another.

Or to put it another way, having data to crunch, doesn’t mean crunching data will lead to the best answer.

Targeting Phishing Victims

Friday, July 26th, 2013

Profile of Likely E-mail Phishing Victims Emerges in Human Factors/Ergonomics Research

From the webpage:

The author of a paper to be presented at the upcoming 2013 International Human Factors and Ergonomics Society Annual Meeting has described behavioral, cognitive, and perceptual attributes of e-mail users who are vulnerable to phishing attacks. Phishing is the use of fraudulent e-mail correspondence to obtain passwords and credit card information, or to send viruses.

In “Keeping Up With the Joneses: Assessing Phishing Susceptibility in an E-mail Task,” Kyung Wha Hong, Christopher M. Kelley, Rucha Tembe, Emergson Murphy-Hill, and Christopher B. Mayhorn, discovered that people who were overconfident, introverted, or women were less able to accurately distinguish between legitimate and phishing e-mails. She had participants complete a personality survey and then asked them to scan through both legitimate and phishing e-mails and either delete suspicious or spam e-mails, leave legitimate e-mails as is, or mark e-mails that required actions or responses as “important.”

“The results showed a disconnect between confidence and actual skill, as the majority of participants were not only susceptible to attacks but also overconfident in their ability to protect themselves,” says Hong. Although 89% of the participants indicted they were confident in their ability to identify malicious e-mails, 92% of them misclassified phishing e-mails. Almost 52% in the study misclassified more than half the phishing e-mails, and 54% deleted at least one authentic e-mail.

I would say that “behavioral, cognitive, and perceptual attributes” are a basis for identifying users. Or at least a certain type of users as a class.

Or to put it another way, a class of users is just as much a subject for discussion in a topic map as any of user individually.

It may be more important, either for targeting users for exploitation or protection to treat them as a class than as individuals.

BTW, these attributes don’t sound amenable to IRI identifiers or binary assignment choices.

Welcome BigCouch

Friday, July 26th, 2013

Welcome BigCouch

From the post:

Good news! Cloudant has announced the completion of the BigCouch merge. This is a huge step forward for CouchDB. So thank you to Cloudant, and thank you to the committers (particularly Robert Newson and Paul Davis) who slogged (and travelled the world to pair with each other) to make this happen.

What does this mean? Well, right now, the code is merged, but not released. So hold your clicks just a moment! Once the code has been tested, we will include it in one of our regular releases. (If you want to help us test, hop on to the dev@ mailing list!)

What’s new? The key accomplishment of the merged code is that BigCouch’s clustering capability, along with the rest of Cloudant’s other enhancements to CouchDB’s code base, will now be available in Apache CouchDB. This also includes improvements in compaction and replication speed, as well as boosts for high-concurrency access performance.

Painless replication has always been CouchDB’s biggest feature. Now we get to take advantage of Cloudant’s experience running large distributed clusters in production for four years. With BigCouch merged in, CouchDB will be able to replicate data at a much larger scale.

But wait! That’s not all! Cloudant has decided to terminate their BigCouch fork of CouchDB, and instead focus future development on Apache CouchDB. This is excellent news for CouchDB, even more excellent news for the CouchDB community.

Just a quick reminder about the CouchTM project that used CouchDB as its backend.

Fingerprinting Data/Relationships/Subjects?

Friday, July 26th, 2013

Virtual image library fingerprints data

From the post:

It’s inevitable. Servers crash. Applications misbehave. Even if you troubleshoot and figure out the problem, the process of problem diagnosis will likely involve numerous investigative actions to examine the configurations of one or more systems—all of which would be difficult to describe in any meaningful way. And every time you encounter a similar problem, you could end up repeating the same complex process of problem diagnosis and remediation.

As someone who deals with just such scenarios in my role as manager of the Scalable Datacenter Analytics Department at IBM Research, my team and I realized we needed a way to “fingerprint” known bad configuration states of systems. This way, we could reduce the problem diagnosis time by relying on fingerprint recognition techniques to narrow the search space.

Project Origami was thus born from this desire to develop an easier-to-use problem diagnosis system to troubleshoot misconfiguration problems in the data center. Origami, today a collaboration between IBM Open Collaborative Research, Carnegie Mellon University, the University of Toronto, and the University of California at San Diego, is a collection of tools for fingerprinting, discovering, and mining configuration information on a data center-wide scale. It uses public domain virtual image library, Olive, an idea created under this Open Collaborative Research a few years ago.

It even provides an ad-hoc interface to the users, as there is no rule language for them to learn. Instead, users give Origami an example of what they deem to be a bad configuration, which Origami fingerprints and adds to its knowledge base. Origami then continuously crawls systems in the data center, monitoring the environment for configuration patterns that match known bad fingerprints in its knowledge base. A match triggers deeper analytics that then examine those systems for problematic configuration settings.

Identifications of data, relationships and subjects could be expressed as “fingerprints.”

Searching by “fingerprints” would be far easier than any query language.

Reasoning that searching challenges users to bridge the semantic gap between them and content authors.

Query languages add another semantic gap, between users and query language designers.

Why useful results are obtained at all using query languages remains unexplained.

Announcing Solr Usability contest

Friday, July 26th, 2013

Announcing Solr Usability contest by Alexandre Rafalovitch.

From the post:

In collaboration with Packt Publishing and to celebrate the release of my new book Instant Apache Solr for Indexing Data How-to, we are organizing a contest to collect Solr Usability ideas.

I have written about the reasons behind the book before and the contest builds on that idea. Basically, I feel that a lot of people are able to start with Solr and get basic setup running, either directly or as part of other projects Solr is in. But then, they get stuck at a local-maximum of their understanding and have difficulty moving forward because they don’t fully comprehend how their configuration actually works or which of the parameters can be tuned to get results. And the difficulty is even greater when the initial Solr configuration is generated by an external system, such as Nutch, Drupal or SiteCore automatically behind the scenes.

The contest will run for 4 weeks (until mid-August 2013) and people suggesting the five ideas with most votes will get free electronic copies of my book. Of course, if you want to get the book now, feel free. I’ll make sure you will get rewarded in some other way, such as through advanced access to the upcoming Solr tools like SolrLint.

The results of the contest will be analyzed and fed into Solr improvement by better documentation, focused articles or feature requests on issue trackers. The end goal is not to give away a couple of books. There are much easier ways to do that. The goal is to improve Solr with specific focus on learning curve and easy adoption and integration.

Only five (5) suggestions so far?

Solr must have better tuning documentation than I have found. 😉

Do you have a suggestion?

Lucene/Solr Revolution EU 2013 – Reminder

Friday, July 26th, 2013

Lucene/Solr Revolution EU 2013 – Reminder

The deadline for submitting an abstract is August 2, 2013.

Key Dates:

June 3, 2013: CFP opens
August 2, 2013: CFP closes
August 12, 2013: Community voting begins
September 1, 2013: Community voting ends
September 22, 2013: All speakers notified of submission status

Top Five Reasons to Attend (according to conference organizers):

  • Learn:  Meet, socialize, collaborate, and network with fellow Lucene/Solr enthusiasts.
  • Innovate:  From field-collapsing to flexible indexing to integration with NoSQL technologies, you get the freshest thinking on solving the deepest, most interesting problems in open source search and big data.
  • Connect: The power of open source is demolishing traditional barriers and forging new opportunity for killer code and new search apps.
  • Enjoy:  We’ve scheduled fun into the conference! Networking breaks, Stump-the-Chump, Lightning talks and a big conference party!
  • Save:  Take advantage of packaged deals on accelerated two-day training workshops, coupled with conference sessions on real-world implementations presented by Solr/Lucene experts.

Let’s be honest. The real reason to attend is Dublin, Ireland in early November. (On average, 22 rainy days in November.) 😉

Take an umbrella, extra sweater or coat and enjoy!

Classification accuracy is not enough

Thursday, July 25th, 2013

Classification accuracy is not enough by Bob L. Sturm.

From the post:

Finally published is my article, Classification accuracy is not enough: On the evaluation of music genre recognition systems. I made it completely open access and free for anyone.

Some background: In my paper Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?, I perform three different experiments to determine how well two state-of-the-art systems for music genre recognition are recognizing genre. In the first experiment, I find the two systems are consistently making extremely bad misclassifications. In the second experiment, I find the two systems can be fooled by such simple transformations that they cannot possibly be listening to the music. In the third experiment, I find their internal models of the genres do not match how humans think the genres sound. Hence, it appears that the systems are not recognizing genre in the least. However, this seems to contradict the fact that they achieve extremely good classification accuracies, and have been touted as superior solutions in the literature. Turns out, Classification accuracy is not enough!


I look closely at what kinds of mistakes the systems make, and find they all make very poor yet “confident” mistakes. I demonstrate the latter by looking at the decision statistics of the systems. There is little difference for a system between making a correct classification, and an incorrect one. To judge how poor the mistakes are, I test with humans whether the labels selected by the classifiers describe the music. Test subjects listen to a music excerpt and select between two labels which they think was given by a human. Not one of the systems fooled anyone. Hence, while all the systems had good classification accuracies, good precisions, recalls, and F-scores, and confusion matrices that appeared to make sense, a deeper evaluation shows that none of them are recognizing genre, and thus that none of them are even addressing the problem. (They are all horses, making decisions based on irrelevant but confounded factors.)


If you have ever wondered what a detailed review of classification efforts would look like, you need wonder no longer!

Bob’s Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing? is thirty-six (36) pages that examines efforts at music genre recognition (MGR) in detail.

I would highly recommend this paper as a demonstration of good research technique.


Thursday, July 25th, 2013


From the webpage:

A JSON aware developer console to ElasticSearch.

A JSON aware interface to ElasticSearch. Comes with handy machinery such as syntax highlighting, autocomplete, formatting and code folding.

Once installed, you can click on the ElasticSearch icon next to your url bar to open Sense in a new tab.

Works with Chrome.

Has good reviews.

Do you know of a similar tool for Solr?

Made with D3.js

Thursday, July 25th, 2013

Made with D3.js Curated by Scott Murray.

From the webpage:

This gallery showcases a range of projects made with D3, arguably the most powerful JavaScript library for making visualizations on the web. Unlike many other software packages, D3 has a broad, interdisciplinary appeal. Released officially only in 2011, D3 has quickly been adopted as a tool of choice by practitioners creating interactive visualizations to be published on the web. And since D3 uses only JavaScript and web standards built into every current browser, no plug-ins are needed, and projects will typically run well on mobile devices. It can be used for dry quantitative charts, of course, but D3 really shines for custom work. Here is a selection work that shows off some of D3’s strengths.

Examples of the capabilities of D3.js.

These images may not accurately reflect your level of artistic talent.

Comparing text to data by importing tags

Thursday, July 25th, 2013

Comparing text to data by importing tags by Jonathan Stray.

From the post:

Overview sorts documents into folders based on the topic of each document, as determined by analyzing every word in each document. But it can also be used to see how the document text relates to the date of publication, document type, or any other field related to each document.

This is possible because Overview can import tags. To use this feature, you will need to get your documents into CSV file, which is a simple rows and columns spreadsheet format. As usual, the text of each document does in the “text” column. But you can also add a “tags” column which gives the tag or tags to be initially assigned to each document, separated by commas if more than one.

Jonathan demonstrates this technique on the Afghanistan War Logs.

Associations at the level of a document are useful.

Such as Jonathan suggests, document + date of publication; document + document type, etc.

But doesn’t that leave the reader with the last semantic mile to travel on their own?

That is I would rather have: document + source/author + term in document + data of publication and a host of other associations represented.

Otherwise, once I find the document, using tags perhaps, I have to retrace the steps of anyone who discovered the “document + source/author + term in document + data of publication” relationship before I did.

And anyone following me will have to retrace my steps.

How many searches get retraced in your department every month?