Archive for June, 2012

R-Uni (A List of Free R Tutorials and Resources in Universities webpages)

Saturday, June 30th, 2012

R-Uni (A List of Free R Tutorials and Resources in Universities webpages) by Pairach Piboonrungroj.

A list of eighty-seven (87) university-based resources on R.

I suspect there is a fair amount of duplication just in terms of resources cited at each of those resources.

Duplication/repetition isn’t necessarily bad, but imagine having a unique list of resources on R.

Or tagging in articles on R that link back into resources on R, in case you need a quick reminder on a function.

Time saver?

Simple network diagrams in R

Saturday, June 30th, 2012

Simple network diagrams in R by Steve Powell.

From the post:

Why study networks?

Development and aid projects these days are more and more often focussing on supporting networks, so tools to analyse networks are always welcome.

In this post I am going to present a very easy-to-use package for the stats program R which makes nice-looking graphs of these kinds of networks.

In a recent project for a client, one of the outcomes is to improve how a bunch of different local and regional organisations work together. The local organisations in particular are associated with one of three ethnicities, and one project goal is to encourage these organisations to work with one another other as peers.

One tool we used to look at this is the old friend of the educational psychologist, the sociogram. We made a numbered list of about 80 relevant local and regional organisations. Then we sent this list to each of the local organisations and asked them to list the five with which they communicated the most, the five with which they cooperated the most, and the five which they think make the biggest contribution to solving their collective problems.

You won’t always need to spin up a server farm at Amazon or Google for preliminary data analysis.

A tool for quick and dirty analysis but also capable of client friendly presentation.

igraph 0.6 Release

Saturday, June 30th, 2012

igraph 0.6 Release

From the introduction:

igraph is a free software package for creating and manipulating undirected and directed graphs. It includes implementations for classic graph theory problems like minimum spanning trees and network flow, and also implements algorithms for some recent network analysis methods, like community structure search.

The efficient implementation of igraph allows it to handle graphs with millions of vertices and edges. The rule of thumb is that if your graph fits into the physical memory then igraph can handle it.

OK, I’m packing n x GB of RAM, so should be able to do some serious damage.

There are too many changes, features and fixes to easily summarize them. See Release Notes 0.6.

Documentation for igraph is available for R, C and Python interfaces. Features are not always the same across interfaces.

I mention that because in this release, in the R interface, vertexes and edges are numbered from one. For C and Python, vertexes and edges continue to be counted from zero.


Saturday, June 30th, 2012

Algorithms by S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani.

From the webpage:

This is a penultimate draft of our soon to appear textbook.

For more information, visit

Table of contents


Chapter 0: Prologue

Chapter 1: Algorithms with numbers

Chapter 2: Divide-and-conquer algorithms

Chapter 3: Decompositions of graphs

Chapter 4: Paths in graphs

Chapter 5: Greedy algorithms

Chapter 6: Dynamic programming

Chapter 7: Linear programming

Chapter 8: NP-complete problems

Chapter 9: Coping with NP-completeness

Chapter 10: Quantum algorithms

Entire book (draft)

The published version was reviewed by Dean Kelley (Dean Kelley. 2009. Joint review of algorithms by Richard Johnsonbaugh and Marcus Schaefer (Pearson/Prentice-Hall, 004) and algorithms by Sanjoy Dasgupta, Christos Papadimitriou and Umesh Vazirani (McGraw-Hill, 008). SIGACT News 40, 2 (June 2009), 23-25. DOI=10.1145/1556154.1556159

Who noted:

….Eschewing a formal and traditional presentation, the authors focus on distilling the core of a problem and/or the fundamental idea that makes an algorithm work. Defi nitions, theorems and proofs are present, of course, but less visibly so and are less formally presented than in the other text reviewed here.

The result is a book which fi nds a rigorous, but nicely direct path through standard subjects such as divide-and-conquer, graph algorithms, greedy algorithms, dynamic programming, and NP-completeness. You won’t necessarily find every topic that you might want in all of these subjects, but the book doesn’t claim to be encyclopedic and the authors’ presentation doesn’t su ffer as a result of their choice of specifi c topics.

Nice collections of chapter problems provide opportunities to formalize parts of the presentation and explore additional topics. The text contains plenty of “asides” (digressions, illuminations, addenda, perspectives, etc.) presented as boxes on some of the pages. These little side trips are fun to read, enhance the presentation and can often lead to opportunitites to include additional material in the course. It has been my experience that even a disinterested, academically self-destructive student can fi nd something of sufficient interest in these excursions to grab their attention.

A good text on algorithms and one that merits a hard copy for the shelf!

An example of what to strive for when writing a textbook.

Inside the Open Data white paper: what does it all mean?

Saturday, June 30th, 2012

Inside the Open Data white paper: what does it all mean?

The Guardian reviews a recent white paper on open data in the UK:

Does anyone disagree with more open data? It’s a huge part of the coalition government’s transparency strategy, championed by Francis Maude in the Cabinet Office and key to the government’s self-image.

And – following on from a less-than enthusiastic NAO report on its achievements in April – today’s Open Data White Paper is the government’s chance to seize the inititative.

Launching the paper, Maude said:

Today we’re at a pivotal moment – where we consider the rules and ways of working in a data‑rich world and how we can use this resource effectively, creatively and responsibly. This White Paper sets out clearly how the UK will continue to unlock and seize the benefits of data sharing in the future in a responsible way

And this one comes with a spreadsheet too – a list of each department’s commitments.

So, what does it actually include? White Papers are traditionally full of official, yet positive-sounding waffle, but what about specific announcements? We’ve extracted the key commitments below.

Just in case you are interested in open data from the UK or open data more generally.

it is amusing that the Guardian touts privacy concerns while at the same time bemoaning that access to “The Postcode Address File (PAF®) is a database that lists all known UK Postcodes and addresses.” remains in doubt.

I would rather a little less privacy and a little less junk mail if you please.

50 Open Source Replacements for Proprietary Business Intelligence Software

Saturday, June 30th, 2012

50 Open Source Replacements for Proprietary Business Intelligence Software by Cynthia Harvey.

From the post:

In a recent Gartner survey, CIOs picked business intelligence and analytics as their top technology priority for 2012. The market research firm predicts that enterprises will spend more than $12 billion on business intelligence (BI), analytics and performance management software this year alone.

As the market for business intelligence solutions continues to grow, the open source community is responding with a growing number of applications designed to help companies store and analyze key business data. In fact, many of the best tools in the field are available under an open source license. And enterprises that need commercial support or other services will find many options available.

This month, we’ve put together a list of 50 of the top open source business intelligence tools that can replace proprietary solutions. It includes complete business intelligence platforms, data warehouses and databases, data mining and reporting tools, ERP suites with built-in BI capabilities and even spreadsheets. If we’ve overlooked any tools that you feel should be on the list, please feel free to note them in the comments section below.

A very useful listing of “replacements” for proprietary software in part because it includes links to the software to be replaced.

You will find it helpful in identifying software packages with common goals but diverse outputs, grist for topic map mills.

I tried to find a one-page display (print usually works) but you will have to endure the advertising clutter to see the listing.

PS: Remember that MS Excel seventy-five (75%) percent of the BI market. Improve upon/use an MS Excel result, you are closer to a commercially viable product. (BI’s Dirty Secrets – Why Business People are Addicted to Spreadsheets)

Predictive Analytics World

Saturday, June 30th, 2012

Predictive Analytics World

I mention a patent on “predictive coding” and now a five (5) day conference on predictive analytics?

The power of blogging? Or self-delusion. Your call. 😉

Seriously, if you are interested in predictive analytics, this looks like a good opportunity to learn more.

It has all the earmarks of a “vendor” conference so I predict you will be spending money but the contacts and basic information should be worth your while.

Suggestions of other predictive analytic resources that aren’t vendor posturing and useful as general introduction?

Reasoning that if it is information, then you should be using a topic map to either trail blaze or navigate it.

Guide to Intelligent Data Analysis

Saturday, June 30th, 2012

Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data, by Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. Series: Texts in Computer Science Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7.

Review snippet posted to book’s website:

“The clear and complete exposition of arguments, along with the attention to formalization and the balanced number of bibliographic references, make this book a bright introduction to intelligent data analysis. It is an excellent choice for graduate or advanced undergraduate courses, as well as for researchers and professionals who want get acquainted with this field of study. … Overall, the authors hit their target producing a textbook that aids in understanding the basic processes, methods, and issues for intelligent data analysis.” (Corrado Mencar, ACM Computing Reviews, April, 2011)

In some sense dated by not including the very latest improvement in the Hadoop ecosystem but all the more valuable for not focusing on ephemera. Rather it focuses on the principles of data analysis that are broadly applicable across data sets and tools.

The website includes slides and bibliographic references for use in teaching these materials.

I first saw this at KDNuggets.

Dilbert Summary – GOOD : Cirro Data Hub (CDH)

Saturday, June 30th, 2012

Cirro Data Hub (CDH)

A new product has appeared that promises:

The Cirro product suite provides a solution for accessing any data on any platform in any environment without having to be a developer or programmer. Cirro’s solution represents a new paradigm “to consistently ask questions and extract value from structured and unstructured data sources” using tools already available on user desktops. Designed to be used by non-technical analysts, Cirro’s products are cloud based and can run on public, virtual private and on-premise cloud environments. This solution seamlessly integrates with existing data warehouse and leverages existing in-house BI analytic investments and can also be used as a standalone departmental solution for data marts and mash up analytics. The result is unparalleled data accessibility, new insights to your business and more informed decisions – faster.

And when I looked for more detail I found:

The Cirro Data Hub offers a revolutionary method that simplifies total data access by federating queries across multiple sources of structured, semi-structured, and unstructured data. With Cirro single query joins can be done between data residing in HDFS and a RDBMS. In short, Cirro removes the complexity of accessing any data, at any time, on any platform. Cirro Data Hub is a fresh approach to the challenge of federated processing. Federation of query processing is about taking the processing to the data. When using Cirro Data Hub users do not need to concern themselves with the complexities of having to stage data, various operating systems and multiple query languages. Rather, users need only concern themselves with what data they want and what they want to do with it. Cirro Data Hub determines where the processing of a query occurs and issues appropriate data requests to all data sources involved. Supporting this new approach to the federation of query processing are a number of patent pending technologies such as a federated cost based optimizer, smart caching, dynamic query plan re-optimization, normalization of cost estimates and a metadata repository for unstructured data sources.

Total data processing, encompassing NO SQL, Hadoop, or large traditional RDBMS data, requires new approaches for the querying of massive volumes of a variety of data sources. Existing approaches of bringing all of the data to a single location for query processing are no longer practical. Cirro Data Hub is the industry leading solution for providing scalability of processing for the challenges of total data.

After reading this more than once, I have the distinct impression of the Dilbert management summary that reads: Good.

Optional reading exercise for my topic maps class? Or do graduate students have enough experience reading vacuous vendor prose (VVP)?

BTW, so your time spent reading this post wasn’t a complete waste: Dilbert.

I first saw this at KDNuggets.

Station Maps: Browser-Based 3D Maps of the London Underground

Saturday, June 30th, 2012

Station Maps: Browser-Based 3D Maps of the London Underground

From Information Asthetics:

Station Maps [] by programmer Andrew Godwin contains a large collection of browser-based (HTML5) 3D maps depicting different London Underground/DLR stations.

Most of the stations are modelled from memory in combination with a few diagrams found online. This means that the models are not totally accurate, but they should represent the right layout, shape and layering of the stations.

Every map has some underlying structure/ontology onto which other information is added.

Real time merging of train, security camera, security forces, event, etc., information onto such maps is one aspect of merging based on location/interest. Not all information is equally useful to all parties.

Google Compute Engine: Computing without limits

Friday, June 29th, 2012

Google Compute Engine: Computing without limits by Craig McLuckie.

From the post:

Over the years, Google has built some of the most high performing, scalable and efficient data centers in the world by constantly refining our hardware and software. Since 2008, we’ve been working to open up our infrastructure to outside developers and businesses so they can take advantage of our cloud as they build applications and websites and store and analyze data. So far this includes products like Google App Engine, Google Cloud Storage, and Google BigQuery.

Today, in response to many requests from developers and businesses, we’re going a step further. We’re introducing Google Compute Engine, an Infrastructure-as-a-Service product that lets you run Linux Virtual Machines (VMs) on the same infrastructure that powers Google. This goes beyond just giving you greater flexibility and control; access to computing resources at this scale can fundamentally change the way you think about tackling a problem.

Google Compute Engine offers:

  • Scale. At Google we tackle huge computing tasks all the time, like indexing the web, or handling billions of search queries a day. Using Google’s data centers, Google Compute Engine reduces the time to scale up for tasks that require large amounts of computing power. You can launch enormous compute clusters – tens of thousands of cores or more.
  • Performance. Many of you have learned to live with erratic performance in the cloud. We have built our systems to offer strong and consistent performance even at massive scale. For example, we have sophisticated network connections that ensure consistency. Even in a shared cloud you don’t see interruptions; you can tune your app and rely on it not degrading.
  • Value. Computing in the cloud is getting even more appealing from a cost perspective. The economy of scale and efficiency of our data centers allows Google Compute Engine to give you 50% more compute for your money than with other leading cloud providers. You can see pricing details here.

The capabilities of Google Compute Engine include:

  • Compute. Launch Linux VMs on-demand. 1, 2, 4 and 8 virtual core VMs are available with 3.75GB RAM per virtual core.
  • Storage. Store data on local disk, on our new persistent block device, or on our Internet-scale object store, Google Cloud Storage.
  • Network. Connect your VMs together using our high-performance network technology to form powerful compute clusters and manage connectivity to the Internet with configurable firewalls.
  • Tooling. Configure and control your VMs via a scriptable command line tool or web UI. Or you can create your own dynamic management system using our API.

Google Compute Engine Preview – Signup

Wondering how this will impact evaluations of CS papers? And what data sets will be used on a routine basis?

To say nothing of exploration of data/text mining.

Now if we can just get access to the majority of research literature, well, but that’s an issue for another forum.

Binary Search Tree

Friday, June 29th, 2012

Binary Search Tree by Stoimen Popov.

Nothing new but clearly explained and well illustrated, two qualities that make this post merit mentioning.

To say nothing of the related posts at the bottom of this one that cover related material in an equally effective manner.

BTW, if you do use these illustrations in slides or teaching, give credit where credit is due. It will encourage others to contribute as well.


Friday, June 29th, 2012

MuteinDB: the mutein database linking substrates, products and enzymatic reactions directly with genetic variants of enzymes by Andreas Braun, Bettina Halwachs, Martina Geier, Katrin Weinhandl, Michael Guggemos, Jan Marienhagen, Anna J. Ruff, Ulrich Schwaneberg, Vincent Rabin, Daniel E. Torres Pazmiño, Gerhard G. Thallinger, and Anton Glieder.


Mutational events as well as the selection of the optimal variant are essential steps in the evolution of living organisms. The same principle is used in laboratory to extend the natural biodiversity to obtain better catalysts for applications in biomanufacturing or for improved biopharmaceuticals. Furthermore, single mutation in genes of drug-metabolizing enzymes can also result in dramatic changes in pharmacokinetics. These changes are a major cause of patient-specific drug responses and are, therefore, the molecular basis for personalized medicine. MuteinDB systematically links laboratory-generated enzyme variants (muteins) and natural isoforms with their biochemical properties including kinetic data of catalyzed reactions. Detailed information about kinetic characteristics of muteins is available in a systematic way and searchable for known mutations and catalyzed reactions as well as their substrates and known products. MuteinDB is broadly applicable to any known protein and their variants and makes mutagenesis and biochemical data searchable and comparable in a simple and easy-to-use manner. For the import of new mutein data, a simple, standardized, spreadsheet-based data format has been defined. To demonstrate the broad applicability of the MuteinDB, first data sets have been incorporated for selected cytochrome P450 enzymes as well as for nitrilases and peroxidases.

Database URL:

Why is this relevant to topic maps or semantic diversity you ask?

I will let the author’s answer:

Information about specific proteins and their muteins are widely spread in the literature. Many studies only describe single mutation and its effects without comparison to already known muteins. Possible additive effects of single amino acid changes are scarcely described or used. Even after a thorough and time-consuming literature search, researchers face the problem of assembling and presenting the data in an easy understandable and comprehensive way. Essential information may be lost such as details about potentially cooperative mutations or reactions one would not expect in certain protein families. Therefore, a web-accessible database combining available knowledge about a specific enzyme and its muteins in a single place are highly desirable. Such a database would allow researchers to access relevant information about their protein of interest in a fast and easy way and accelerate the engineering of new and improved variants. (Third paragraph of the introduction)

I would have never dreamed that gene data would be spread to Hell and back. 😉

The article will give you insight into how gene data is collected, searched, organized, etc. All of which will be valuable to you whether you are designing or using information systems in this area.

I was a bit let down when I read about data formats:

Most of them are XML based, which can be difficult to create and manipulate. Therefore, simpler, spreadsheet-based formats have been introduced which are more accessible for the individual researcher.

I’ve never had any difficulties with XML based formats but will admit that may not be a universal experience. Sounds to me like the XML community should concentrate a bit less on making people write angle-bang syntax and more on long term useful results. (Which I think XML can deliver.)

Asgard for Cloud Management and Deployment

Friday, June 29th, 2012

Asgard for Cloud Management and Deployment

Amazon is touting the horn of one of its larger customers, Netflix when they say:

Our friends at Netflix have embraced AWS whole-heartedly. They have shared much of what they have learned about how they use AWS to build, deploy, and host their applications. You can read the Netflix Tech Blog benefit from what they have learned.

Earlier this week they released Asgard, a web-based cloud management and deployment tool, in open source form on GitHub. According to Norse mythology, Asgard is the home of the god of thunder and lightning, and therefore controls the clouds! This is the same tool that the engineers at Netflix use to control their applications and their deployments.

Asgard layers two additional abstractions on top of AWS — Applications and Clusters.

Even if you are just in the planning (dreaming?) stages of cloud deployment for your topic map application, it would be good to review the Netflix blog. On Asgard and others posts as well.

You know how I hate to complain, ;-), but the Elder Edda does not report “Asgard” as the “home of the god of thunder and lighting.” All the gods resided at Asgard.

Even the link in the quoted part of Jeff’s post gets that much right.

Most of the time old stories told aright are more moving than modern misconceptions.

Detecting Emergent Conflicts with Recorded Future + Ushahidi

Friday, June 29th, 2012

Detecting Emergent Conflicts with Recorded Future + Ushahidi by Ninja Shoes. (?)

From the post:

An ocean of data is available on the web. From this ocean of data, information can in theory be extracted and used by analysts for detecting emergent trends (trend spotting). However, to do this manually is a daunting and nearly impossible task. We in this study we describe a semi-automatic system in which data is automatically collected from selected sources, and to which linguistic analysis is applied to extract e.g., entities and events. After combining the extracted information with human intelligence reports, the results are visualized to the user of the system who can interact with it in order to obtain a better awareness of historic as well as emergent trends. A prototype of the proposed system has been implemented and some initial results are presented in the paper.

The paper in question.

A fairly remarkable bit of work that illustrates the current capabilities for mining the web and also its limitations.

The processing of news feeds for protest reports is interesting, but mistakes the result of years of activity as an “emergent” conflict.

If you were going to capture the data that would enable a human analyst to “predict” the Arab Spring, you would have to begin in union organizing activities. Not the sort of thing that is going to make news reports on the WWW.

For that you would need traditional human intelligence. From people who don’t spend their days debating traffic or reports with other non-native staffers. Or meeting with managers from Washington or Stockholm.

Or let me put it this way:

Mining the web doesn’t equal useful results. Just as mining for gold doesn’t mean you will find any.

Fusion and inference from multiple data sources in a commensurate space

Friday, June 29th, 2012

Fusion and inference from multiple data sources in a commensurate space by Zhiliang Ma, David J. Marchette and Carey E. Priebe. (Ma, Z., Marchette, D. J. and Priebe, C. E. (2012), Fusion and inference from multiple data sources in a commensurate space. Statistical Analy Data Mining, 5: 187–193. doi: 10.1002/sam.11142)


Given objects measured under multiple conditions—for example, indoor lighting versus outdoor lighting for face recognition, multiple language translation for document matching, etc.—the challenging task is to perform data fusion and utilize all the available information for inferential purposes. We consider two exploitation tasks: (i) how to determine whether a set of feature vectors represent a single object measured under different conditions; and (ii) how to create a classifier based on training data from one condition in order to classify objects measured under other conditions. The key to both problems is to transform data from multiple conditions into one commensurate space, where the (transformed) feature vectors are comparable and would be treated as if they were collected under the same condition. Toward this end, we studied Procrustes analysis and developed a new approach, which uses the interpoint dissimilarities for each condition. We impute the dissimilarities between measurements of different conditions to create one omnibus dissimilarity matrix, which is then embedded into Euclidean space. We illustrate our methodology on English and French documents collected from Wikipedia, demonstrating superior performance compared to that obtained via standard Procrustes transformation.

An early example of identity issues in topic maps from Steve Newcomb made this paper resonate for me. Steve used the example that his home has a set of geographic coordinates, a street address and a set of directions to arrive at his home, all of which identify the same subjects. All the things that can be said using one identifier can be gathered up with statements using the other identifiers.

While I still have reservations about the use of Euclidean space when dealing with non-Euclidean semantics, one has to admit that it is possible to derive some value from it.

I had to file an ILL for a print copy of the article. More to follow when it arrives.

Bruce: How Well Does Current Legislative Identifier Practice Measure Up?

Friday, June 29th, 2012

Bruce: How Well Does Current Legislative Identifier Practice Measure Up?

From Legal Informatics:

Tom Bruce of the Legal Information Institute at Cornell University Law School (LII) has posted Identifiers, Part 3: How Well Does Current Practice Measure Up?, on LII’s new legislative metadata blog, Making Metasausage.

In this post, Tom surveys legislative identifier systems currently in use. He recommends the use of URIs for legislative identifiers, rather than URLs or URNs.

He cites favorably the URI-based identifier system that John Sheridan and Dr. Jeni Tennison developed for the system. Tom praises Sheridan’s (here) and Tennison’s (here and here) writings on legislative URIs and Linked Data.

Tom also praises the URI system implemented by Dr. Rinke Hoekstra in the Leibniz Center for Law‘s Metalex Document Server for facilitating point-in-time as well as point-in-process identification of legislation.

Tom concludes by making a series of recommendations for a legislative identifier system:

See the post for his recommendations (in case you are working on such a system) and for other links.

I would point out that existing legislation has identifiers from before it receives the “better” identifiers specified here.

And those “old” identifiers will have been incorporated into other texts, legal decisions and the like.


We can’t re-write existing identifiers so it’s a good thing topic maps accept subjects having identifiers, plural.

National Centre for Text Mining (NaCTeM)

Friday, June 29th, 2012

National Centre for Text Mining (NaCTeM)

From the webpage:

The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.

On our website, you can find pointers to sources of information about text mining such as links to

  • text mining services provided by NaCTeM
  • software tools, both those developed by the NaCTeM team and by other text mining groups
  • seminars, general events, conferences and workshops
  • tutorials and demonstrations
  • text mining publications

Let us know if you would like to include any of the above in our website.

This is a real treasure trove of software, resources and other materials.

I will be working in reports on “finds” at this site for quite some time.

DDC 23 released as linked data at

Friday, June 29th, 2012

DDC 23 released as linked data at

From the post:

As announced on Monday at the seminar “Global Interoperability and Linked Data in Libraries” in beautiful Florence, an exciting new set of linked data has been added to All assignable classes from DDC 23, the current full edition of the Dewey Decimal Classification, have been released as Dewey linked data. As was the case for the Abridged Edition 14 data, we define “assignable” as including every schedule number that is not a span or a centered entry, bracketed or optional, with the hierarchical relationships adjusted accordingly. In short, these are numbers that you find attached to many WorldCat records as standard Dewey numbers (in 082 fields), as additional Dewey numbers (in 083 fields), or as number components (in 085 fields).

The classes are exposed with full number and caption information and semantic relationships expressed in SKOS, which makes the information easily accessible and parsable by a wide variety of semantic web applications.

This recent addition massively expands the data set by over 38.000 Dewey classes (or, for the linked data geeks out there, by over 1 million triples), increasing the number of classes available almost tenfold. If you like, take some time to explore the hierarchies; you might be surprised to find numbers for Maya calendar or transits of Venus (loyal blog readers will recognize these numbers).

All the old goodies are still there, of course. Depending on which type of user agent is accessing the data (e.g., a browser) a different representation is negotiated (HTML or various flavors of RDF). The HTML pages still include RDFa markup, which can be distilled into RDF by browser plug-ins and other applications without the user ever having to deal with the RDF data directly.

More details follow but that should be enough to capture your interest.

Good thing there is a pointer for the Maya calendar. Would hate for interstellar archaeologists to think we were too slow to invent a classification number for the disaster that is supposed to befall us this coming December.

I have renewed my ACM and various SIG memberships to run beyond December 2012. In the event of an actual disaster refunds will not be an issue. 😉

Clustering high dimensional data

Thursday, June 28th, 2012

Clustering high dimensional data by Ira Assent. (Assent, I. (2012), Clustering high dimensional data. WIREs Data Mining Knowl Discov, 2: 340–350. doi: 10.1002/widm.1062)


High-dimensional data, i.e., data described by a large number of attributes, pose specific challenges to clustering. The so-called ‘curse of dimensionality’, coined originally to describe the general increase in complexity of various computational problems as dimensionality increases, is known to render traditional clustering algorithms ineffective. The curse of dimensionality, among other effects, means that with increasing number of dimensions, a loss of meaningful differentiation between similar and dissimilar objects is observed. As high-dimensional objects appear almost alike, new approaches for clustering are required. Consequently, recent research has focused on developing techniques and clustering algorithms specifically for high-dimensional data. Still, open research issues remain. Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Each cluster groups objects that are similar to one another, whereas dissimilar objects are assigned to different clusters, possibly separating out noise. In this manner, clusters describe the data structure in an unsupervised manner, i.e., without the need for class labels. A number of clustering paradigms exist that provide different cluster models and different algorithmic approaches for cluster detection. Common to all approaches is the fact that they require some underlying assessment of similarity between data objects. In this article, we provide an overview of the effects of high-dimensional spaces, and their implications for different clustering paradigms. We review models and algorithms that address clustering in high dimensions, with pointers to the literature, and sketch open research issues. We conclude with a summary of the state of the art.

The author has a clever example (figure 4) of why adding dimensions can decrease the discernment of distinct groups in data. A problem that worsens as the number of dimensions increases.

Or does it? Or is it the case that by weighting all dimensions equally we get the result we deserve?

My counter-example would be introducing you to twin sisters. As the number of dimensions increased, so would the similarity that would befoul any clustering algorithm.

But the important dimension, their names, is sufficient to cluster attributes around the appropriate data points.

Is the “curse of dimensionality” rather a “failure to choose dimensions wisely?”

R and Data Mining (

Thursday, June 28th, 2012

R and Data Mining (

I have mentioned several resources from this site:

R Reference Card for Data Mining [Annotated TOC?]

An Example of Social Network Analysis with R using Package igraph

Book “R and Data Mining: Examples and Case Studies” on CRAN [blank chapters]

Online resources for handling big data and parallel computing in R

There are others I have yet to cover and new ones will be appearing. If you are using R for data mining, a good site to re-visit on a regular basis.

R Reference Card for Data Mining [Annotated TOC?]

Thursday, June 28th, 2012

R Reference Card for Data Mining

A good reference to have at hand.

For teaching/learning purposes, use this listing as an annotated table of contents and create an entry for each item demonstrating its use.

Will be a broader and deeper survey of R data mining techniques than you are likely to encounter otherwise.

First seen in Christophe Lalanne’s A bag of tweets / June 2012.

Pig as Teacher

Thursday, June 28th, 2012

Russell Jumey summarizes machine learning using Pig at the Hadoop Summit:

Jimmy Lin’s sold out talk about Large Scale Machine Learning at Twitter (paper available) (slides available) described the use of Pig to train machine learning algorithms at scale using Hadoop. Interestingly, learning was achieved using a Pig UDF StoreFunc (documentation available). Some interesting, related work can be found by Ted Dunning on github (source available).

The emphasis isn’t on innovation per se but in using Pig to create workflows that include machine learning on large data sets.

Read in detail for the Pig techniques (which you can reuse elsewhere) and the machine learning examples.

Flexible Indexing in Hadoop

Thursday, June 28th, 2012

Flexible Indexing in Hadoop by Dmitriy Ryaboy.

Summarized by Russell Jumey as:

There was much excitement about Dmitriy Ryaboy’s talk about Flexible Indexing in Hadoop (slides available). Twitter has created a novel indexing system atop Hadoop to avoid “Looking for needles in haystacks with snowplows,” or – using mapreduce over lots of data to pick out a few records. Twitter Analytics’s new tool, Elephant Twin goes beyond folder/subfolder partitioning schemes used by many, for instance bucketizing data by /year/month/week/day/hour. Elephant Twin is a framework for creating indexes in Hadoop using Lucene. This enables you to push filtering down into Lucene, to return a few records and to dramatically reduce the records streamed and the time spent on jobs that only parse a small subset of your overall data. A huge boon for the Hadoop Community from Twitter!

The slides plus a slide-by-slide transcript of the presentation is available.

Going in the opposite direction of some national security efforts, which are creating bigger haystacks for the purpose of having larger haystacks.

There are a number of legitimately large haystacks in medicine, physics, astronomy, chemistry and any number of other disciplines. Grabbing all phone traffic to avoid saying you choose the < 5,000 potential subjects of interest is just bad planning.


Thursday, June 28th, 2012


From the project page:

Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workflows. It presents a global view of all the map-reduce jobs derived from your workflow after planning and optimization. As jobs are submitted for execution on your Hadoop cluster, Ambrose updates its visualization to reflect the latest job status, polled from your process.

Ambrose provides the following in a web UI:

  • A chord diagram to visualize job dependencies and current state
  • A table view of all the associated jobs, along with their current state
  • A highlight view of the currently running jobs
  • An overall script progress bar

One of the items that Russell Jurney reports on in his summary of the Hadoop Summit 2012.

Limited to Pig at the moment but looks quite useful.

A Simple URL Shortener for Legal Materials:, by Ontolawgy

Thursday, June 28th, 2012

A Simple URL Shortener for Legal Materials:, by Ontolawgy

Legal Informatics reports on a URL shortener for legal citations.

Not exactly the usual URL shortener, it produces a “human readable” URL for U.S. Congress, U.S. Public Laws, the U.S. Code, and the Federal Register citations.


U.S. Public Law 111-148


For a plain text version:
or L. 111-148 text

Human readable citation practices existed at the time of the design of URLs. Another missed opportunity that we are still paying for.

My Review of Hadoop Summit 2012

Thursday, June 28th, 2012

My Review of Hadoop Summit 2012 by Russell Jumey.

I wasn’t present but given Russell’s comment, Hadoop Summit 2012 was a very exciting event.

I have been struggling with how to summarize an already concise post so I will just point you to Russell’s review of the conference.

There are a couple of items I will call out for special mention but in the mean time, go read the review.

Heavy use of equations impedes communication among biologists

Thursday, June 28th, 2012

Heavy use of equations impedes communication among biologists by Tim W. Fawcett and Andrew D. Higginson. (Proceedings of the National Academy of Sciences, June 25, 2012 DOI: 10.1073/pnas.1205259109)


Most research in biology is empirical, yet empirical studies rely fundamentally on theoretical work for generating testable predictions and interpreting observations. Despite this interdependence, many empirical studies build largely on other empirical studies with little direct reference to relevant theory, suggesting a failure of communication that may hinder scientific progress. To investigate the extent of this problem, we analyzed how the use of mathematical equations affects the scientific impact of studies in ecology and evolution. The density of equations in an article has a significant negative impact on citation rates, with papers receiving 28% fewer citations overall for each additional equation per page in the main text. Long, equation-dense papers tend to be more frequently cited by other theoretical papers, but this increase is outweighed by a sharp drop in citations from nontheoretical papers (35% fewer citations for each additional equation per page in the main text). In contrast, equations presented in an accompanying appendix do not lessen a paper’s impact. Our analysis suggests possible strategies for enhancing the presentation of mathematical models to facilitate progress in disciplines that rely on the tight integration of theoretical and empirical work.

I first saw this in Scientists Struggle With Mathematical Details, Study by Biologists Finds, where Higginson remarks on one intermediate solution:

Scientists need to think more carefully about how they present the mathematical details of their work. The ideal solution is not to hide the maths away, but to add more explanatory text to take the reader carefully through the assumptions and implications of the theory.

An excellent suggestion, considering that scientists don’t speak to each other in notation but in less precise natural language.

Data Integration Services & Hortonworks Data Platform

Thursday, June 28th, 2012

Data Integration Services & Hortonworks Data Platform by Jim Walker

From the post:

What’s possible with all this data?

Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…

Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.

For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us. Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).

Jim covers four advantages of using Talend:

  • Bridge the skills gap
  • HCatalog Integration
  • Connect to the entire enterprise
  • Graphic Pig Script Creation

Definitely something to keep in mind.