Archive for January, 2012

Multiple Recognitions

Tuesday, January 31st, 2012

Yesterday in The “L&O” Shortage I asked the question:

“…can something be recognized more than once?”

That may not be an artful way to frame the question. Perhaps better:

When an author uses some means for identification, whatever that may be, can it be recognized differently by different users?

One case that comes to mind in the interpretation of Egyptian Hieroglyphics over time. In addition to the attempts in the 16th and 17th centuries, which are now thought to be completely fantastic, there are the modern “accepted” translations as well as ancient Egyptian texts where it appears the scribe did not understand what was being copied.

If we are going to faithfully record the history of interpretation of such literature, we cannot flatten the “translated” texts to have the meanings we would assign to them today. The references of the then current literature would make no sense if we did.

Google Books is a valuable service but it is also a dangerous one for research purposes. In part because semantic drift occurs in any living language (or the interpretation of dead ones) and the results are reported without any warnings about such shifts.

Did you know, for example, that “cab” at one time was a slang reference to a house of prostitution? Would give new meaning to the statement: “I will call you a cab.” doesn’t it?

Before we can assign semantics to any word, we need to know what is being identified by that word. But knowing that any one word may represent multiple identifications.

Requirement: A system of identification must support the same identifiers resolving to different identifications.

The consequences of deciding otherwise on such a requirement, I will try to take up tomorrow.

Search and Exogenous Complexity – (inside vs. outside?)

Tuesday, January 31st, 2012

Search and Exogenous Complexity

Stephen Arnold writes:

I am now using the phrase “exogenous complexity” to describe systems, methods, processes, and procedures which are likely to fail due to outside factors. This initial post focuses on indexing, but I will extend the concept to other content centric applications in the future. Disagree with me? Use the comments section of this blog, please.

What is an outside factor?

Let’s think about value adding indexing, content enrichment, or metatagging. The idea is that unstructured text contains entities, facts, bound phrases, and other identifiable entities. A key word search system is mostly blind to the meaning of a number in the form nnn nn nnnn, which in the United States is the pattern for a Social Security Number. There are similar patterns in Federal Express, financial, and other types of sequences. The idea is that a system will recognize these strings and tag them appropriately; for example:

nnn nn nnn Social Security Number

Thus, a query for Social Security Numbers will return a string of digits matching the pattern. The same logic can be applied to certain entities and with the help of a knowledge base, Bayesian numerical recipes, and other techniques such as synonym expansion determine that a query for Obama residence will return White House or a query for the White House will return links to the Obama residence.

I am not sure the inside/outside division is helpful.

For example, Arnold starts with the issue:

First, there is the issue of humans who use language in unexpected or what some poets call “fresh” or “metaphoric” methods. English is synthetic in that any string of sounds can be used in quite unexpected ways.

True, but recall the overloading of owl:sameAs, which is entirely within a semantic system.

I mention that to make the point that while inside/outside may be useful informal metaphors, semantics are with us, always. Even in systems where one or more parties think the semantics are “obvious” or “defined.” Maybe, depends on who you ask.

The second issue is:

Second, there is the quite real problem of figuring out the meaning of short, mostly context free snippets of text.

But isn’t that an inside problem too? Search engines vacuum up content from a variety of contexts, not preserving the context that would make the “snippets of text” meaningful. Snippets of text have very different meanings in comp.compilers than in alt.religion.scientology. It is the searcher’s choice to treat both as a single pile of text.

His third point is:

Third, there is the issue of people and companies desperate for a solution or desperate for revenue. The coin has two sides. Individuals who are looking for a silver bullet find vendors who promise not just one silver bullet but an ammunition belt stuffed with the rounds. The buyers and vendors act out a digital kabuki.

But isn’t this an issue of design and requirements, which are “inside” issues as well?

No system can meet a requirement for universal semantic resolution with little or not human involvement. The questions are: How much better information do you need How much are you willing to pay for it? No free lunch when its comes to semantics, ever. That includes the semantics of the systems we use and the information to which they are applied.

The requirements for any search system must address both “inside” and “outside” issues and semantics.

(Apologies for the length but semantic complexity is one of my pet topics.)

The Heat in SharePoint Semantics: January 20 – January 27

Tuesday, January 31st, 2012

The Heat in SharePoint Semantics: January 20 – January 27

Stephen Arnold writes:

As always, SharePoint Semantics has delivered many posts that are vitally important to both SharePoint end users and search enthusiasts alike.

Read Stephen’s post and then see: SharePoint Semantics for yourself.

From the tone of the posts I would say there are at least two very large issues that topic maps can address:

First, there is the issue of working with SharePoint itself. From these posts and other reports, it would be very generous to say that SharePoint has “documentation.” True there are materials that come with it, but either it doesn’t answer the questions users have and/or it doesn’t answer any questions at all. Opinions differ.

Using a topic map to provide a portal with useful and findable information about SharePoint itself seems like an immediate commercial opportunity. Suspect like most technical advice sites you would have to rely on ad revenue but from the numbers, it looks like people needing Sharepoint help is only going to increase.

Second, it is readily apparent that it is one thing to create data and store it in Sharepoint. It is quite another to make that information findable by others.

I don’t think that is entirely a matter of poor design or coding on the part of MS. I have never seen a useful SharePoint site but site design is left up to users. Even MS can’t increase the information management capabilities of the average user. Or at least I have never seen MS software have that result. 😉

The findability inside a SharePoint installation is an issue that topic maps can address. Like SharePoint, topic maps won’t make users more capable but they can put better tools at their disposal to assist in finding data. That isn’t speculation on my part, there is at least one topic map vendor that provides that sort of service for SharePoint installations.

At the risk of sounding repetitive, I think offering better findability with topic maps isn’t going to be sufficient to drive market adoption. On the other hand, enhancing findability within contexts and applications that users are already using, may be the sweet spot we have been looking for.

Wordmap Taxonomy Connectors for SharePoint and Endeca (Logician/Ontologist not included or required)

Tuesday, January 31st, 2012

Wordmap Taxonomy Connectors for SharePoint and Endeca

From the post:

Wordmap SharePoint Taxonomy Connector

Integrate Wordmap taxonomies directly with Microsoft® SharePoint to classify documents as well as support SharePoint browsing and search capabilities. The SharePoint Taxonomy Connector allows you to overcome many of the difficulties of managing taxonomies inside the SharePoint environment by allowing you to use Wordmap to do all of the daily taxonomy management tasks.

Wordmap Endeca Taxonomy Connector

The Endeca® Information Access Platform thrives on robust, well-constructed and well-maintained taxonomies. Use Wordmap to do your taxonomy management and allow our Endeca Taxonomy Connector to push the taxonomy to Endeca as the foundation of the guided navigation experience. Wordmap’s Endeca Taxonomy Connector directly translates your taxonomies into the Endeca dimension.xsd format, pulled into Endeca on system start-up. The Wordmap Endeca Taxonomy Connector also allows you to leverage taxonomy in Endeca’s powerful auto-classification engine for improved content indexing.

Wordmap Taxonomy Connector Highlights:

  • No configuration needed for consuming systems,
  • Wordmap taxonomy data is sent in the preferred format of the search application for easy ingestion
  • Manage the taxonomy centrally and push out only relevant sections for indexing, navigation and search
  • Taxonomy is seamlessly integrated into the content lifecycle

Definitely a step in the right direction.

Points to consider as you plan your next topic map project:

  1. Require no configuration for consuming systems,
  2. Send in the preferred format of the search application for easy ingestion
  3. Taxonomy not managed by end users, automatically push out relevant sections for indexing, navigation and search
  4. Seamlessly integrate topic map into the content lifecycle

Interesting that “tagging” of the data is a requirement for use of this tool. I would have thought otherwise.

Any pointers to how often this has been chosen as a solution? The last entry on their news page is dated in 2009. Which may mean they don’t keep up their website very well or that they aren’t active enough to have any news.

They are owned by Earley and Associates, which does have an active website but I still didn’t see much news about Wordmap. Searching turned up some old materials but nothing really recent.

Little orange circles spell trouble

Tuesday, January 31st, 2012

Little orange circles spell trouble from Junk Charts.

From the post:

Economists Banerjee and Duflo are celebrated for their work on global poverty but they won’t be winning awards for the graphics on the website that supports their book “Poor Economics” (link). Thanks to a reader for the pointer.

Since it is visual you will have to go to the post (and the book’s website) to get the full impact of what the Junk Charts post is addressing.

Take this as fair warning: A graphic may make perfect sense to you and be nearly unintelligible to anyone else. Even being forewarned by the Junk Chart post, I was unprepared for how difficult and confusing the graphic was in fact.

It is the sort of graphic that I would expect to see on Jon Stewart, mocking some economic report.

Inside the Variation Toolkit: Tools for Gene Ontology

Tuesday, January 31st, 2012

Inside the Variation Toolkit: Tools for Gene Ontology by Pierre Lindenbaum.

From the post:

GeneOntologyDbManager is a C++ tool that is part of my experimental Variation Toolkit.

This program is a set of tools for GeneOntology, it is based on the sqlite3 library.

Pierre walks through building and using his GeneOntologyDbManager.

Rather appropriate to mention an area (bioinformatics) that is exploding with information on the same day as GPU and database posts. Plus I am sure you will find the Gene Ontology useful for topic map purposes.

SAP MaxDB Downloads

Tuesday, January 31st, 2012

SAP MaxDB Downloads

In a post I was reading it was mentioned that the download area for this database wasn’t easy to find. Indeed not!

But, that plus a page that says:

SAP MaxDB is the database management system developed and supported by SAP AG. It can be used as an alternative to databases from other vendors for your own or third-party applications.

Note: These installation packages can be used free of charge according to the SAP Community License Agreement for SAP MaxDB. These packages are not for use with SAP applications. For that purpose, refer to the Download Area in SAP Service Marketplace (login required). (emphasis added)

made it worthy of mention.

Suspect the use with “SAP applications” may have additional features or tighter integration. An interesting approach to the community versus commercial edition practice.

Interested in hearing your experiences with using the SAP MaxDB with topic maps.

Accelerating SQL Database Operations on a GPU with CUDA (merging spreadsheet data?)

Tuesday, January 31st, 2012

Accelerating SQL Database Operations on a GPU with CUDA by Peter Bakkum and Kevin Skadron.


Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the eff ort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.

This paper focuses on accelerating SELECT queries and describes the considerations in an efficient GPU implementation of the SQLite command processor. Results on an NVIDIA Tesla C1060 achieve speedups of 20-70X depending on the size of the result set.

Important lessons to be learned from this paper:

  • Don’t invent new languages for the average user to learn.
  • Avoid the need to modify existing programs
  • Write against common software

Remember that 75% of the BI market is still using spreadsheets. For all sorts of data but numeric data in particular.

I don’t have any experience with importing files into Excel but I assume there is a macro language that can used to create import processes.

Curious if there has been any work on creating import macros for Excel that incorporate merging as part of those imports?

That would:

  • Not be a new language for users to learn.
  • Avoid modification of existing programs (or data)
  • Be written against common software

I am not sure about the requirements for merging numeric data but that should make the exploration process all the more enjoyable.

PGStrom (PostgreSQL + GPU)

Tuesday, January 31st, 2012


From the webpage:

PG-Strom is a module of FDW (foreign data wrapper) of PostgreSQL database. It was designed to utilize GPU devices to accelarate sequential scan on massive amount of records with complex qualifiers. Its basic concept is CPU and GPU should focus on the workload with their advantage, and perform concurrently. CPU has much more flexibility, thus, it has advantage on complex stuff such as Disk-I/O, on the other hand, GPU has much more parallelism of numerical calculation, thus, it has advantage on massive but simple stuff such as check of qualifiers for each rows.

The below figure is a basic concept of PG-Strom. Now, on sequential scan workload, vanilla PostgreSQL does iteration of fetch a tuple and checks of qualifiers for each tuples. If we could consign GPU the workload of green portion, it enables to reduce workloads of CPU, thus, it shall be able to load more tuples in advance. Eventually, it should allow to provide shorter response-time on complex queries towards large amount of data.

Requires setting up the table for the GPU ahead of time but performance increase is reported to be 10x – 20x.

It occurs to me that GPUs should be well suited for graph processing. Yes? Will have to look into that and report back.

The Dwarf OLAP Engine

Tuesday, January 31st, 2012

The Dwarf OLAP Engine

From the webpage:

Dwarf is a patented (US Patent 7,133,876) highly compressed structure for computing, storing, and querying Data Cubes. It is a highly compressed structure with reduction reaching 1:1,000,000 depending on the data distribution. The method is based on finding prefix and suffix redundancies in high dimensional data. Prefix redundancies occur in dense areas of the cube and some existing techniques have utilized. However, we discovered suffix dependency is a lot more higher in sparse areas of multi-dimensional space. The two put together fuse the exponential sizes of high dimensional cubes into a dramatically condensed LOSSLESS store.

With the Dwarf technology, we managed to create the first lossless full PetaCube in a Dwarf store of 2.1GBytes and construction time 80 minutes. The PetaCube is on a 25-dimensional fact table which generates a full cube of a Petabyte in size if stored in binary (all possible 2^^25 un-indexed views/summary tables with two aggregate values). This a 1000-fold bigger than Microsoft’s TeraCube of the future. We also surpassed the fastest OLAP Council APB-1 benchmark density 5 published by Oracle. The Dwarf Cube creation time is 20 minutes and the size of it 3GB compared to Oracle’s 4.5 hours and 30+GB. We further pushed the APB-1 benchmark to its maximum possible density 40 in just 7 hours compute time and about 10GB in size. To the best of our knowledge, no one else has even tried this. This enormous storage reduction comes with NO loss of information and provides a fully indexed cube that includes the original fact table.

The most important aspect of this patented Dwarf technology is that its data fusion (prefix and suffix redundancy elimination) is discovered and eliminated BEFORE the cube is computed and this explains the dramatic reduction in compute time. A complete version of the Dwarf Cube software with full support of hierarchies is available to interested parties under an NDA and a 90-day evaluation agreement.

The Dwarf cube was mentioned in a thread on GPUs and database engines with one commenter lamenting the fact it is under patent.

Patented or not, a quick look at the literature and results for the Dwarf Cube make it an “item of interest” for complex data cubes.

True, not “real time” since you have to build the cube but “real time” is not a universal requirement.

Curious if any of the topic map vendors or research labs have investigated the use of Dwarf cubes as delivery structures for topic maps? (Saying delivery since the cubes would lend themselves to delivery of a computed artifact.)

The “L&O” Shortage

Monday, January 30th, 2012

Last week I mentioned that we are facing a critical shortage of both logicians and ontologists: Alarum – World Wide Shortage of Logicians and Ontologists.

This is the first of a number of posts on what we can do, facing this tidal wave of data with nary a logician or ontologist in sight.

I have a question that I think we need to answer before we get to the question of semantics.

Is it fair to say that identification comes before semantics? That is we have to recognize something (whatever that may be) before we can talk about its semantics?

I ask because I think it is important to take the requirements for data and its semantics one step at a time. And in particular to not jump ahead of ourselves with half-remembered bits of doggerel from grade school to propose syntactic solutions.

Or to put it differently, let’s make sure of what order steps need to be taken before we trip over our own feet.

That would be the requirements phase, as is well known to the successful programmers and startup folks among the audience.

So, is requirement #1 that something be recognized? Whether that is a file, format, subject of any sort or description. I don’t know but suspect we can’t even use logic on things we have yet to recognize.

Just to give you a hint about tomorrow or perhaps the next day, I have meetings tomorrow, can something be recognized more than once?

This may seem like a slow start but the time will pass more quickly than you think it will. There are a number of “perennial” issues that I will argue can be side-lined, in part because they have no answer other than personal preference.

The Most Brutal Man Page

Monday, January 30th, 2012

The most brutal man page

John Cook quotes The Linux Command Line by William Shotts as singling out the bash man page as the most brutal of all the man pages.

Maybe so, I don’t ever recall trying to read it in its entirety. But I haven’t made a systematic comparison of all the man pages for that matter.

But let’s take Shotts at his word, that the man page for bash is the worse.

Recalling that topic maps got its start in an X Windows documentation project, seems appropriate to see if topic maps could provide an “assist” with the most brutal man page ever.

I get seventy (70) pages as a PDF version of the Ubuntu bash man page. (Not Shotts reported eighty plus (80+) pages.)

A topic map for the bash man page would be of interest only to geeks but hopefully influential geeks. And if nothing else, it would be good warm up to take on something like USC Title 26 and its regulations, court decisions and opinions. 😉 (The tax code of the federal government in the United States.)


Monday, January 30th, 2012


If you don’t mind alpha code, ålenkå was pointed out in the bitmap posting I cited earlier today.

From its homepage:

Alenka is a modern analytical database engine written to take advantage of vector based processing and high bandwidth of modern GPUs.

Features include:

Vector-based processing
CUDA programming model allows a single operation to be applied to an entire set of data at once.

Self optimizing compression
Ultra fast compression and decompression performed directly inside GPU

Column-based storage
Minimize disk I/O by only accessing the relevant data

Fast database loads
Data load times measured in minutes, not in hours.

Open source and free

Apologies for the name spelling differences, Ålenkå versus Alenka. I suspect it has something to do with character support in whatever produced the readme file, but can’t say for sure.

The benchmarks (there is that term again) are impressive.

Would semantic benchmarks be different from the ones used in IR currently? Different from precision and recall? What about range (same subject but identified differently) or accuracy (different identifications but same subject, how many false positives)?

Topic maps and graphical structures

Monday, January 30th, 2012

Topic maps and graphical structures

Interesting webpage that explores the potential for adding probabilistic measures and operators to topic maps.

Moreover, it points out the lack of benchmarks for topic maps.

You might want to make note the last update was 4 November 2000.

Anyone care to point out any work on benchmarks for topic maps?

Suggestions for how to formulate benchmarks for topic maps?

Questions to myself would include:

  • Is the topic map being generated from source or is this a pre-created topic map being loaded into a topic map engine?
  • If a pre-created topic map, what syntax and/or data model is being tested?
  • What information items in the topic map will meet merging requirements? (by overall percentage and per item)
  • If created from source, what set of subjects need to result in items?
  • Use a common memory size/setting for comparisons.
  • Can we use existing corpora and tests to bootstrap topic map benchmarks?

What others would you ask?

1 Billion Insertions – The Wait is Over!

Monday, January 30th, 2012

1 Billion Insertions – The Wait is Over! by Tim Callaghan.

From the post:

iiBench measures the rate at which a database can insert new rows while maintaining several secondary indexes. We ran this for 1 billion rows with TokuDB and InnoDB starting last week, right after we launched TokuDB v5.2. While TokuDB completed it in 15 hours, InnoDB took 7 days.

The results are shown below. At the end of the test, TokuDB’s insertion rate remained at 17,028 inserts/second whereas InnoDB had dropped to 1,050 inserts/second. That is a difference of over 16x. Our complete set of benchmarks for TokuDB v5.2 can be found here.

Kudos to TokuDB team! Impressive performance!

Tim comments on iiBench:

iiBench [Indexed Insertion Benchmark] simulates a pattern of usage for always-on applications that:

  • Require fast query performance and hence require indexes
  • Have high data insert rates
  • Cannot wait for offline batch processing and hence require the indexes be maintained as data comes in

If this sounds familiar, could be an important benchmark to keep in mind.

BTW, do you know of any topic map benchmarks? Just curious.

Big Data is More Than Hadoop

Monday, January 30th, 2012

Big Data is More Than Hadoop by David Menninger.

From the post:

We recently published the results of our benchmark research on Big Data to complement the previously published benchmark research on Hadoop and Information Management. Ventana Research undertook this research to acquire real-world information about levels of maturity, trends and best practices in organizations’ use of large-scale data management systems now commonly called Big Data. The results are illuminating.

Volume, velocity and variety of data (the so-called three V’s) are often cited as characteristics of big data. Our research offers insight into each of these three categories. Regarding volume, over half the participating organizations process more than 10 terabytes of data, and 10% process more than 1 petabyte of data. In terms of velocity, 30% are producing more than 100 gigabytes of data per day. In terms of the variety of data, the most common types of big data are structured, containing information about customers and transactions.

However, one-third (31%) of participants are working with large amounts of unstructured data. Of the three V’s, nine out of 10 participants rate scalability and performance as the most important evaluation criteria, suggesting that volume and velocity of big data are more important concerns than variety.

This research shows that big data is not a single thing with one uniform set of requirements. Hadoop, a well-publicized technology for dealing with big data, gets a lot of attention (including from me), but there are other technologies being used to store and analyze big data.

Interesting work but especially for what the enterprises surveyed are missing about Big Data.

When I read “Volume, velocity and variety of data (the so-called three V’s) are often cited as characteristics of big data.” I was thinking that “variety” meant the varying semantics of the data. As is natural when collecting data from a variety of sources.

Nope. Completely off-base. “Variety” in the three V’s, at least for Ventura Research means:

The data being analyzed consists of a variety of data types. Rapidly increasing unstructured data and social media receive much of the attention in the big-data market, and the research shows these types of data are common among Hadoop users.

While the Ventura work is useful, at least for the variety leg of the Big Data stool, you will be better off with Ed Dumbill’s What is Big Data? where he points out for variety:

A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application. One such example is entity resolution, the process of determining exactly what a name refers to. Is this city London, England, or London, Texas? By the time your business logic gets to it, you don’t want to be guessing.

While data type variety is an issue, it isn’t one that is difficult to solve. Semantic variety on the other hand, is an issue that keeps on giving.

I think the promotional question for topic maps with regard to Big Data is: Do you still like the answer you got yesterday?

Topic maps can not only keep the question you asked yesterday and its answer, but the new question you want to ask today (and its answer). (Try that with fixed schemas.)

Google Analytics Tutorial: 8 Valuable Tips To Hustle With Data!

Monday, January 30th, 2012

Google Analytics Tutorial: 8 Valuable Tips To Hustle With Data! by Avinash Kaushik.

This is simply awesome! For several reasons.

I started to say because it’s an excellent guide to Google Analytics!

I started to say because it has so many useful outlinks to other resources and software.

And all that is very true, but not my “take away” from the post.

My “take away” from the post is that to succeed, “Analysis Ninjas” need to delivery useful results to users.

That means both information they are interested in seeing and delivered in a way that works for them.

The corollary is that data of no interest to users or delivered in ways users can’t understand or easily use, are losing strategies.

That means you don’t create web interfaces that mimic interfaces that failed for applications.

That means given the choice of doing a demo with Sumerian (something I would like to see) or something with the interest level of American Idol, you choose the American Idol type project.

Avinash has outlined some of the tools for data analysis. What you make of them is limited only by your imagination.

Using Bitmap Indexes in Query Processing

Monday, January 30th, 2012

Why are column oriented databases so much faster than row oriented databases?

Be sure to read all the comments. Some of the techniques described are covered by patents (according to the comments) but there are open source implementations of alternatives. There is also a good discussion of the trade-offs in using this technique.

Search terms: Hybrid Word Aligned Bitmaps, HWAB, EWAB, FastBit.

Follow the links in the post and comments for more resources.


From a topic map perspective, how would you structure a set of relational tables to represent the information items defined by the Topic Map Data Model? (Yes, it has been done before but no peeking! Your result will likely be very similar but I am interested in how you would structure the data. (If you want to think ahead, same question for the various NoSQL options.)

For the relational database, how would you structure a chain of selects to choose all the information items that should merge for any particular item. In other words, start off with the values of an item that should merge and construct a select that gathers up the other items with which it should merge.

Enumerate the operations you would need to perform post-select to present a “merged” information item to the final user.

Observation: Isn’t indexing the first step towards merging? That is we have to gather up all the relevant representatives of a subject before we can consider the mechanics of merging?

First seen at myNoSQL.

PHP and MongoDB Tutorial

Monday, January 30th, 2012

PHP and MongoDB Tutorial

Presentation by Derick Rethans on MongoDB and PHP. Walks through the most common aspects of using PHP with MongoDB.

From myNoSQL.

Why were they laughing?

Monday, January 30th, 2012

Why were they laughing?

An amusing posting from Junk Charts with charts of laughter in Federal Reserve’s FOMC meetings up to the recent crash.

Readers are cautioned about making comparisons based on time-series data.

The same caution applies to creating associations based on time-series data.

Still, an amusing post to start the week.

Munging, Modeling and Visualizing Data with R

Sunday, January 29th, 2012

Munging, Modeling and Visualizing Data with R by Xavier Léauté.

With a title like that, how could I resist?

From the post:

Yesterday evening Romy Misra from invited us to teach an introductory workshop to R for the San Francisco Data Mining meetup. Todd Holloway was kind enough to host the event at Trulia headquarters.

R can be a little daunting for beginners, so I wanted to give everyone a quick overview of its capabilities and enough material to get people started. Most importantly, the objective of this interactive session was to give everyone some time to try out some simple examples that would be useful in the future.

I hope everyone enjoyed learning some fun and easy ways to slice, model and visualize data, and that I piqued their interest enough to start exploring datasets on their own.

Slides and sample scripts follow.

First seen at Christophe Lalanne’s Bag of Tweets for January 2012.

BigCouch 0.4

Sunday, January 29th, 2012

BigCouch 0.4 by Alan Hoffman.

From the post:

It is a big day here at Cloudant HQ; we are announcing the release of BigCouch 0.4! This release, which brings BigCouch into API equivalence with Apache CouchDB 1.1.1, has been baking for a while, and we are excited that it is now ready for public consumption. Instructions for installing and using BigCouch 0.4 can be found on the BigCouch page. Users running Debian Squeeze, Ubuntu (LTS or newer) or RedHat / CentOS / Amazon Linux are welcome and encouraged to use our prebuilt distributions based on Erlang/OTP R14B01 and SpiderMonkey 1.8.5.

Later in the post Alan points out that Cloudant is dedicated to donating and integrating BigCouch into Apache CouchDB.

Is it just me or does it seem like at least for “big data,” that vendors are trying to build out open infrastructures? And doing so by supporting open source versions of their software? Good to see the potential for an evolving, common infrastructure that will support commercial, governmental or other services.

Semantic Enterprise Wiki (SMW)

Sunday, January 29th, 2012

Semantic Enterprise Wiki (SMW)

When I ran across:

Run gardening bots to detect inconsistencies in your wiki and continuously improve the quality of the authored knowledge

I thought of Jack Park and his interest in “knowledge gardening.”

But I don’t think Jack was only interested in weeding but also in cultivating diverse ideas, even if those were inconsistent with existing ideas.

I think of it as being the difference between a vibrant heritage garden versus a mono-culture Monsanto field.

Are you using this software? Thoughts/comments?

I haven’t installed it yet but am interested in the vocabulary and annotation features.

HadoopDB: Efficient Processing of Data Warehousing Queries in a Split Execution Environment

Sunday, January 29th, 2012

HadoopDB: Efficient Processing of Data Warehousing Queries in a Split Execution Environment

From the post:

The buzz about Hadapt and HadoopDB has been around for a while now as it is one of the first systems to combine ideas from two different approaches, namely parallel databases based on a shared-nothing architecture and map-reduce, to address the problem of large scale data storage and analysis.

This early paper that introduced HadooDB crisply summarizes some reasons why parallel database solutions haven’t scaled to hundreds machines. The reasons include –

  1. As the number of nodes in a system increases failures become more common.
  2. Parallel databases usually assume a homogeneous array of machines which becomes impractical as the number of machines rise.
  3. They have not been tested at larger scales as applications haven’t demanded more than 10′s of nodes for performance until recently.

Interesting material to follow on the HPCC vs. Hadoop post.

Not to take sides, just the beginning of the type of analysis that will be required.

HPCC vs Hadoop

Sunday, January 29th, 2012

HPCC vs Hadoop

Four factors as said to distinguish HPCC from Hadoop:

  • Enterprise Control Language (ECL)
  • Beyond MapReduce
  • Roxie Delivery Engine
  • Enterprise Ready

After viewing these summaries you may feel like you like information on which to make a choice between these two.

So you follow: Detailed Comparison of HPCC vs. Hadoop.

I’m afraid you are going to be disappointed there as well.

Not enough information to make an investment choice in an enterprise context in favor of either HPCC or Hadoop.

Do you have pointers to meaningful comparisons of these two platforms?

Or perhaps suggestions for what would make a meaningful comparison?

Are there features of HPCC that Hadoop should emulate?

Rooting out redundancy – The new Neo4j Property Store

Sunday, January 29th, 2012

Rooting out redundancy – The new Neo4j Property Store by Chris Gioran.

From the post:

So, for the last 2 months we’ve been working diligently, trying to create the 1.5 release of Neo4j. While on the surface it may look like little has changed, under the hood a huge amount of work has gone into a far more stable and usable HA implementation and rewriting the property storage layer to use far less disk space while maintaining all its features and providing a speed boost at the same time. In this post I will deal exclusively with the latter.

If you are interested in the how Neo4j is so damned good, you have come to the right place!

I would recommend this post to all programmers and certainly system architects. The former so they will understand “why” some higher level choices work at they do, the latter because our systems shape how we perceive both problems and solutions.

Not that anyone can step outside of their context, but being sensitive to the explicit choices a context makes, may make other choices possible.

SEALS – Community Page

Sunday, January 29th, 2012

SEALS – Semantic Evaluation At Large Scale – Community Page

The community page was added after my first post on the SEAL project.

The next community event:

SEALS to present evaluation results at ESWC 2012

SEALS is pleased to announce that the workshop Evaluation of Semantic Technologies (IWEST 2012) has been confirmed to take place at the leading semantic web conference, ESWC (Extended Semantic Web Conference) 2012, scheduled to take place May 27-31, 2012 in beautiful Crete, Greece.

This workshop will be a venue for researchers and tool developers, firstly, to initiate discussion about the current trends and future challenges of evaluating semantic technologies. Secondly, to support communication and collaboration with the goal of aligning the various evaluation efforts within the community and accelerating innovation in all the associated fields as has been the case with both the TREC benchmarks in information retrieval and the TPC benchmarks in database research.

A call for papers will be published soon. All SEALS community members and evaluation campaign participants are especially encouraged to submit and participate.

If you attend, I am particularly interested in the results of the discussion about “aligning the various evaluation efforts within the community….”

I say that because when the project started, the “about” page reported:

This is a very active research area, currently supported by more than 3000 individuals integrated in 360 organisations which have produced around 700 tools, but still suffers from a lack of standard benchmarks and infrastructures for assessing research outcomes. Due to its physically boundless nature, it remains relatively disorganized and lacks common grounds for assessing research and technological outcomes.

Sounds untidy, even diverse doesn’t it? 😉

To tell the truth, I am not bothered by the repetition of semantic diversity in efforts to reduce semantic diversity. I find it refreshing that our languages burst the bonds that would be imposed upon them on a regular basis. Tyrants of thought, social, political and economic arrangements, the well- and the ill-intended, all fail. (Some last longer than others but on a historical time scale, the governments of the East and West are ephemera. Their peoples, the originators of language and semantics, will persist.)

We can reduce semantic diversity when it is needful or to account for it, but even those efforts, as SEALS points out, exhibit the same semantic diversity as the area they purport to address.

Statistics Finland is making further utilisation of statistical data easier

Sunday, January 29th, 2012

Statistics Finland is making further utilisation of statistical data easier

From the post:

Statistics Finland has confirmed new Terms of Use for the utilisation of already published statistical data. In them, Statistics Finland grants a universal, irrevocable right to the use of the data published in its website service and in related free statistical databases. The right extends to use for both commercial and non-commercial purposes. The aim is to make further utilisation of the data easier and thereby increase the exploitation and effectiveness of statistics in society.

At the same time, an open interface has been built to the StatFin database. The StatFin database is a general database built with AC-Axis tools that is free-of-charge and contains a wide array of statistical data on a variety of areas in society. It contains data from some 200 sets of statistics, thousands of tables and hundreds of millions of individual data cells. The contents of the StatFin database have been systematically widened in the past few years and its expansion with various information contents and regional divisions will be continued even further.

Curious if the free commercial re-use of government collected data (paid for by taxpayers) favors established re-sellers of data or startups that will combine existing data in interesting ways. Thoughts?

First seen at Christophe Lalanne’s Bag of Tweets for January 2012.

the time for libraries is NOW

Saturday, January 28th, 2012

the time for libraries is NOW

Ned Potter outlines a call to arms for librarians!

Librarians need to aggressively make the case for libraries…., but I would tweak Ned’s message a bit.

Once upon a time, being the best, most complete, skilled, collection point or guide to knowledge was enough for libraries. People knew libraries and education were their tickets to economic/social mobility, out of the slums, to a better life.

Today people are mired in the vast sea of the “middle-class” and information is pushed upon willing and/or unwilling information consumers. Infotainment, advertising, spam of all types, vying for our attention, with little basis for distinguishing the useful from the useless, the important from the idiotic and graceful from the graceless.

Libraries and librarians cannot be heard in the vortex of noise that surrounds the average information consumer, while passively waiting for a question or reference interview.

Let’s drop pose of passivity. Librarians are passionate about libraries and the principles they represent.

Become information pushers. Bare-fisted information brawlers who fight for the attention of information consumers.

Push information on news channels, politicians, even patrons. When a local story breaks, feed the news sources with background material and expect credit for it. Same for political issues. Position papers that explore both sides of issues. Not bland finding aids but web-based multi-media resources with a mixture of web and more traditional resources.

Information consumers can be dependent on the National Enquirer, Jon Stewart, Rupert Murdock, or libraries/librarians. Your choice.

Citogenesis in science and the importance of real problems

Saturday, January 28th, 2012

Citogenesis in science and the importance of real problems

Daniel Lemire writes:

Many papers in Computer Science tell the following story:

  • There is a pre-existing problem P.
  • There are few relatively simple but effective solution to problem P. Among them is solution X.
  • We came up with a new solution X+ which is a clever variation on X. It looks good on paper.
  • We ran some experiments and tweaked our results until X+ looked good. We found a clever way to avoid comparing X+ and X directly and fairly, as it might then become obvious that the gains are small, or even negative! We would gladly report negative results, but then our paper could not be published.

It is a very convenient story for reviewers: the story is simple and easy to assess superficially. The problem is that sometimes, especially if the authors are famous and the idea is compelling, the results will spread. People will adopt X+ and cite it in their work. And the more they cite it, the more enticing it is to use X+ as every citation becomes further validation for X+. And why bother with algorithm X given that it is older and X+ is the state-of-the-art?

Occasionally, someone might try both X and X+, and they may report results showing that the gains due to X+ are small, or negative. But they have no incentive to make a big deal of it because they are trying to propose yet another better algorithm (X++).

But don’t we see the same thing in blogs? Where writers say “some,” “many,” “often,” etc., but none are claims that can be evaluated by others?

Make no mistake, given the rate of mis-citation that I find in published proceedings, I really want to agree with Daniel but I think the matter is more complex than simply saying that “engineers” work with “real” tests.

One of my pet peeves is that lack of history that I find in most CS papers. They may go back ten years but what about thirty or even forty years ago?

But as far as engineers, why is there so little code re-use if they are so interested in being efficient? Is re-writing code really more efficient or just NWH (Not Written Here)?