Archive for September, 2013

Kali Linux: The Ultimate Penetration-Testing Tool?

Monday, September 30th, 2013

Kali Linux: The Ultimate Penetration-Testing Tool? by David Strom.

From the post:’s version of Linux is an advanced penetration testing tool that should be a part of every security professional’s toolbox. Penetration testing involves using a variety of tools and techniques to test the limits of security policies and procedures. What Kali has done is collect just about everything you’ll need in a single CD. It includes more than 300 different tools, all of which are open source and available on GitHub. It’s incredibly well done, especially considering that it’s completely free of charge.

A new version, 1.0.5, was released earlier in September and contains more goodies than ever before, including the ability to install it on just about any Android phone, various improvements to its wireless radio support, near field communications, and tons more. Let’s take a closer look.

David gives a short summary of the latest release of Kali Linux.

A set of thumb drives should be on your short present list for the holiday season!

Lucene now has an in-memory terms dictionary…

Monday, September 30th, 2013

Lucene now has an in-memory terms dictionary, thanks to Google Summer of Code by Mike McCandless.

From the post:

Last year, Han Jiang’s Google Summer of Code project was a big success: he created a new (now, default) postings format for substantially faster searches, along with smaller indices.

This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.

In fact, he created two new terms dictionary implementations. The first, FSTTermsWriter/Reader, hold all terms and metadata in a single in-memory FST, while the second, FSTOrdTermsWriter/Reader, does the same but also supports retrieving the ordinal for a term (TermsEnum.ord()) and looking up a term given its ordinal (TermsEnum.seekExact(long ord)). The second one also uses this ord internally so that the FST is more compact, while all metadata is stored outside of the FST, referenced by ord.

Lucene continues to improve, rapidly!

How-to: Use HBase Bulk Loading, and Why

Monday, September 30th, 2013

How-to: Use HBase Bulk Loading, and Why by Jean-Daniel (JD) Cryans.

From the post:

Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.

This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.

Overview of Bulk Loading

If you have any of these symptoms, bulk loading is probably the right choice for you:

  • You needed to tweak your MemStores to use most of the memory.
  • You needed to either use bigger WALs or bypass them entirely.
  • Your compaction and flush queues are in the hundreds.
  • Your GC is out of control because your inserts range in the MBs.
  • Your latency goes out of your SLA when you import data.

Most of those symptoms are commonly referred to as “growing pains.” Using bulk loading can help you avoid them.

Great post!

I would be very leery of database or database-like software that doesn’t offer bulk loading.

OTexts​.org is launched

Monday, September 30th, 2013

OTexts​.org is launched by Rob J Hyn­d­man.

From the post:

The pub­lish­ing plat­form I set up for my fore­cast­ing book has now been extended to cover more books and greater func­tion­al­ity. Check it out at www​.otexts​.org.

So far, we have three com­plete books:

  1. Fore­cast­ing: prin­ci­ples and prac­tice, by Rob J Hyn­d­man and George Athana­sopou­los
  2. Sta­tis­ti­cal foun­da­tions of machine learn­ing, by Gian­luca Bon­tempi and Souhaib Ben Taieb
  3. Modal logic of strict neces­sity and poss­bibil­ity, by Evgeni Lati­nov is looking for readers, authors and donors.

Saying you support open access is one thing.

Supporting open access by contributing content or funding is another.

Classifying Non-Patent Literature…

Monday, September 30th, 2013

Classifying Non-Patent Literature To Aid In Prior Art Searches by John Berryman.

From the post:

Before a patent can be granted, it must be proven beyond a reasonable doubt that the innovation outlined by the patent application is indeed novel. Similarly, when defending one’s own intellectual property against a non-practicing entity (NPE – also known as a patent troll) one often attempts to prove that the patent held by the accuser is invalid by showing that relevant prior art already exists and that their patent is actual not that novel.

Finding Prior Art

So where does one get ahold of pertinent prior art? The most obvious place to look is in the text of earlier patents grants. If you can identify a set of reasonably related grants that covers the claims of the patent in question, then the patent may not be valid. In fact, if you are considering the validity of a patent application, then reviewing existing patents is certainly the first approach you should take. However, if you’re using this route to identify prior art for a patent held by an NPE, then you may be fighting an uphill battle. Consider that a very bright patent examiner has already taken this approach, and after an in-depth examination process, having found no relevant prior art, the patent office granted the very patent that you seek to invalidate.

But there is hope. For a patent to be granted, it must not only be novel among the roughly 10Million US Patents that currently exist, but it must also be novel among all published media prior to the application date – so called non-patent literature (NPL). This includes conference proceeding, academic articles, weblogs, or even YouTube videos. And if anyone – including the applicant themselves – publicly discloses information critical to their patent’s claims, then the patent may be rendered invalid. As a corollary, if you are looking to invalidate a patent, then looking for prior art in non-patent literature is a good idea! While tools are available to systematically search through patent grants, it is much more difficult to search through NPL. And if the patent in question truly is not novel, then evidence must surely exists – if only you knew where to look.

More suggestions than solutions but good suggestions, such as these, are hard to come by.

John suggests using existing patents and their classifications as a learning set to classify non-patent literature.

Interesting but patent language is highly stylized and quite unlike the descriptions you encounter in non-patent literature.

It would be an interesting experiment to take some subset of patents and their classifications along with a set of non-patent literature, known to describe the same “inventions” covered by the patents.

Suggestions for subject areas?

Design Patterns for Distributed…

Sunday, September 29th, 2013

Design Patterns for Distributed Non-Relational Databases by Todd Lipcon.

A bit dated (2009) but true design patterns should find refinement, not retirement.


  • Consistent Hashing
  • Consistency Models
  • Data Models
  • Storage Layouts
  • Log-Structured Merge Trees

Curious if you would suggest substantial changes to these patterns some four (4) years later?

JavaZone 2013

Sunday, September 29th, 2013

JavaZone 2013 (videos)

The JavaZone tweet I saw earlier today said five (5) lost videos had been found so all one hundred and forty-nine (149) videos are up for viewing!

I should have saved this one for the holidays but at one or two a day, you may be done by the holidays! 😉

Hands-On Category Theory

Sunday, September 29th, 2013

Hands-On Category Theory by James Earl Douglas.

From the webpage:

James explores category theory concepts in Scala REPL. This is the first version of the video, we have a separate screen capture and will publish a merged version later, with a reference here.

I don’t often see “hands-on” and “category theory” in the same sentence, much less in a presentation title. 😉

An interesting illustration of category theory being used in day to day programming.

See: Hands-On Category Theory at Github for notes on the presentation.

Perhaps this will pique your interest in category theory!

Notes on Category Theory

Sunday, September 29th, 2013

Notes on Category Theory by Robert L. Knighten.

From the preface:

There are many fine articles, notes, and books on category theory, so what is the excuse for publishing yet another tome on the subject? My initial excuse was altruistic, a student asked for help in learning the subject and none of the available sources was quite appropriate. But ultimately I recognized the personal and selfi sh desire to produce my own exposition of the subject. Despite that I have some hope that other students of the subject will find these notes useful.

The other generous explanation for “another tome” is to completely master a subject.

Either teach it or write a master tome on it.

I first saw this in a tweet by OGE Search.

Learning From Data

Sunday, September 29th, 2013

Learning From Data by Professor Yaser Abu-Mostafa.

Rather than being broken into smaller segments, these lectures are traditional lecture length.

Personally I prefer the longer lecture style over shorter snippets, such as were used for Learning from Data (an earlier version).


  • Lecture 1 (The Learning Problem)
  • Lecture 2 (Is Learning Feasible?)
  • Lecture 3 (The Linear Model I)
  • Lecture 4 (Error and Noise)
  • Lecture 5 (Training versus Testing)
  • Lecture 6 (Theory of Generalization)
  • Lecture 7 (The VC Dimension)
  • Lecture 8 (Bias-Variance Tradeoff)
  • Lecture 9 (The Linear Model II)
  • Lecture 10 (Neural Networks)
  • Lecture 11 (Overfitting)
  • Lecture 12 (Regularization)
  • Lecture 13 (Validation)
  • Lecture 14 (Support Vector Machines)
  • Lecture 15 (Kernel Methods)
  • Lecture 16 (Radial Basis Functions)
  • Lecture 17 (Three Learning Principles)
  • Lecture 18 (Epilogue)


Foundations of Data Science

Sunday, September 29th, 2013

Foundations of Data Science by John Hopcroft and Ravindran Kannan.

From the introduction:

Computer science as an academic discipline began in the 60’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered nite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other elds calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory.

While traditional areas of computer science are still important and highly skilled individuals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just
how to make computers useful on specifi c well-defi ned problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods.

In draft form but impressive!

Current chapters:

  1. Introduction
  2. High-Dimensional Space
  3. Random Graphs
  4. Singular Value Decomposition (SVD)
  5. Random Walks and Markov Chains
  6. Learning and the VC-dimension
  7. Algorithms for Massive Data Problems
  8. Clustering
  9. Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation
  10. Other Topics [Rankings, Hare System for Voting, Compressed Sensing and Sparse Vectors]
  11. Appendix

I am certain the authors would appreciate comments and suggestions concerning the text.

I first saw this in a tweet by CompSciFact.

Language support and linguistics

Saturday, September 28th, 2013

Language support and linguistics in Apache Lucene™ and Apache Solr™ and the eco-system by Gaute Lambertsen and Christian Moen.

Slides from Lucene Revolution May, 2013.

Good overview of language support and linguistics in both Lucene and Solr.

A few less language examples at the beginning would shorten the slide deck from its current one hundred and fifty-one (151) count without impairing its message.

Still, if you are unfamiliar with language support in Lucene and Solr, the extra examples don’t hurt anything.

Big Data Boot Camp Day 2,3 and 4, Simons Institute, Berkeley

Saturday, September 28th, 2013

Big Data Boot Camp Day 2,3 and 4, Simons Institute, Berkeley by Igor Carron.

Igor has posted links to videos and supplemental materials for:

  • Algorithmic High-Dimensional Geometry II by Alex Andoni, Microsoft Research
  • High-Dimensional Statistics I by Martin Wainwright, UC Berkeley
  • High-Dimensional Statistics II by Martin Wainwright, UC Berkeley
  • Optimization I by Ben Recht, UC Berkeley
  • Optimization II by Ben Recht, UC Berkeley
  • Past, Present and Future of Randomized Numerical Linear Algebra I by Petros Drineas, Rensselaer Polytechnic Institute & Michael Mahoney, Stanford University
  • Past, Present and Future of Randomized Numerical Linear Algebra II by Petros Drineas, Rensselaer Polytechnic Institute & Michael Mahoney, Stanford University
  • Some Geometric Perspectives on Combinatorics: High-Dimensional, Local and Local-to-Global I by Nati Linial, Hebrew University of Jerusalem
  • Some Geometric Perspectives on Combinatorics: High-Dimensional, Local and Local-to-Global II by Nati Linial, Hebrew University of Jerusalem
  • Streaming, Sketching and Sufficient Statistics I by Graham Cormode, University of Warwick
  • Streaming, Sketching and Sufficient Statistics II by Graham Cormode, University of Warwick
  • Theory and Big Data by Ravi Kannan, Microsoft Research India

Just in case you have run out of video material for the weekend.

Referencing a Tweet…

Saturday, September 28th, 2013

Referencing a Tweet in an Academic Paper? Here’s an Automatic Citation Generator by Rebecca Rosen.

I haven’t ever thought of formally referencing a tweet but Rebecca details the required format (MLA and APA) plus points us to, a free generator for tweet citations.

If I don’t make a note of this article, someone will ask me next week how to cite a tweet.

Or worse yet, someone in a standards body will decide that tweets are appropriate for normative references.

I kid you not. I ran across a citation in a 2013 draft to a ten year old version of the XML standard.

Not so bad if in the bibliography for a cattle data paper but this was in a markup vocabulary proposal.

Inside or outside a topic map, proper and accurate citation is a courtesy to your readers.

Building Applications with…

Saturday, September 28th, 2013

Building Applications with a Graph Database by Tobias Lindaaker.

Slides from JavaOne 2013.

Hard to tell without a video how much is lost by having only the slides.

However, the slides alone should make you curious (if not anxious) about trying to build an application based on a graph database.

Neo4j centric in places but not unexpectedly since it is hard to build an application in the abstract.

At least if you want visible results. 😉

Over one hundred (100) slides that merit a close read.



Saturday, September 28th, 2013

MANTRA: Free, online course on how to manage digital data by Sarah Dister.

From the post:

Research Data MANTRA is a free, online course with guidelines on how to manage the data you collect throughout your research. The course is particularly appropriate for those who work or are planning to work with digital data.

Once you have finalized the course, you will:

  • Be aware of the risk of data loss and data protection requirements.
  • Know how to store and transport your data safely and securely (backup and encryption).
  • Have experience in using data in software packages such as R, SPSS, NVivo, or ArcGIS.
  • Recognise the importance of good research data management practice in your own context.
  • Be able to devise a research data management plan and apply it throughout the projects life.
  • Be able to organise and document your data efficiently during the course of your project.
  • Understand the benefits of sharing data and how to do it legally and ethically.

Data management may not be as sexy as “big data” but without it, there would be no “big data” to make some of us sexy. 😉

Google Alters Search… [Pushy Suggestions]

Friday, September 27th, 2013

Google Alters Search to Handle More Complex Queries by Claire Cain Miller.

From the post:

Google on Thursday announced one of the biggest changes to its search engine, a rewriting of its algorithm to handle more complex queries that affects 90 percent of all searches.

The change, which represents a new approach to search for Google, required the biggest changes to the company’s search algorithm since 2000. Now, Google, the world’s most popular search engine, will focus more on trying to understand the meanings of and relationships among things, as opposed to its original strategy of matching keywords.

The company made the changes, executives said, because Google users are asking increasingly long and complex questions and are searching Google more often on mobile phones with voice search.

“They said, ‘Let’s go back and basically replace the engine of a 1950s car,’ ” said Danny Sullivan, founding editor of Search Engine Land, an industry blog. “It’s fair to say the general public seemed not to have noticed that Google ripped out its engine while driving down the road and replaced it with something else.”

One of the “other” changes is “pushy suggestions.”

In the last month I have noticed that if my search query is short that I will get Google’s suggested completion rather than my search request.

How short? Just has to be shorter than the completion suggested by Google.

A simple return means it adopts its suggestion and not your request.

You don’t believe me?

OK, type in:


Note the autocompletion to:

That’s OK if I am searching for the cable company but not if I am searching for “charter” as in a charter for technical work.

I am required to actively avoid the suggestion by Google.

I can avoid Google’s “pushy suggestions” by hitting the space bar.

But like many people, I toss off Google searches without ever looking at the search or URL box. I don’t look up until I have the results. And now sometimes the wrong results.

I would rather have a search engine execute my search by default and its suggestions only when asked.

How about you?

Large Graphs on multi-GPUs

Friday, September 27th, 2013

Large Graphs on multi-GPUs by Enrico Mastrostefano.


We studied the problem of developing an efficient BFS algorithm to explore large graphs having billions of nodes and edges. The size of the problem requires a parallel computing architecture. We proposed a new algorithm that performs a distributed BFS and the corresponding implementation on multiGPUs clusters. As far as we know, this is the first attempt to implement a distributed graph algorithm on that platform. Our study shows how most straightforward BFS implementations present significant computation and communication overheads. The main reason is that, at each iteration, the number of processed edges is greater than the number actually needed to determine the parent or the distance array (the standard output of the BFS): there is always redundant information at each step.

Reducing as much as possible this redundancy is essential in order to improve performances by minimizing the communication overhead. To this purpose, our algorithm performs, at each BFS level, a pruning procedure on the set of nodes that will be visited (NLFS). This step reduces both the amount of work required to enqueue new vertices and the size of messages exchanged among different tasks. To implement this pruning procedure efficiently is not trivial: none of the earlier works on GPU tackled that problem directly. The main issue being how to employ a sufficient large number of threads and balance their workload, to fully exploit the GPU computing power.

To that purpose, we developed a new mapping of data elements to CUDA threads that uses a binary search function at its core. This mapping permits to process the entire Next Level Frontier Set by mapping each element of the set to one CUDA thread (perfect load-balancing) so the available parallelism is exploited at its best. This mapping allows for an efficient filling of a global array that, for each BFS level, contains all the neighbors of the vertices in the queue as required by the pruning procedure (based on sort and unique operations) of the array.

This mapping is a substantial contribution of our work: it is quite simple and general and can be used in different contexts. We wish to highlight that it is this operation (and not the sorting) that makes possible to exploit at its best the computing power of the GPU. To speed up the sort and unique operations we rely on very efficient implementations (like the radix sort) available in the CUDA Thrust library. We have shown that our algorithm has good scaling properties and, with 128 GPUs, it can traverse 3 billion edges per second (3 GTEPS for an input graph with 228 vertices). By comparing our results with those obtained on different architectures we have shown that our implementation is better or comparable to state-of-the-art implementations.

Among the operations that are performed during the BFS, the pruning of the NLFS is the most expensive in terms of execution time. Moreover, the overall computational time is greater then the time spent in communications. Our experiments show that the ratio between the time spent in computation and the time spent in communication reduces by increasing the number of tasks. For instance, with 4 GPUs the ratio is 2:125 whereas by using 64 GPUs the value is 1:12. The result can be explained as follows. In order to process the largest possible graph, the memory of each GPU is fully used and thus the subgraph assigned to each processor has a maximum (fixed) size. When the graph size increases we use more GPUs and the number of messages exchanged among nodes increases accordingly.

To maintain a good scalability using thousands GPUs we need to further improve the communication mechanism that is, in the present implementation, quite simple. To this purpose, many studies employed a 2D partitioning of the graph to reduce the number of processors involved in communication. Such partitioning could be, in principle, implemented in our code and it will be the subject of a future work. (paragraphing was inserted into the abstract for readability)

Without any paragraph breaks the abstract was very difficult to read. Apologies if I have incorrectly inserted paragraph breaks.

If you have access to multiple GPUs, this should be very interesting work.

GraphStream 1.2

Friday, September 27th, 2013

GraphStream 1.2

GraphStream moved to a new website back in June of 2013 and I missed it. Sorry!

Releases for GraphStream “core,” “algo,” and “ui,” are available for release 1.2.


Stardog 2.0.0 (26 September 2013)

Friday, September 27th, 2013

Stardog 2.0.0 (26 September 2013)

From the docs page:

Introducing Stardog

Stardog is a graph database—fast, lightweight, pure Java storage for mission-critical apps—that supports:

  • the RDF data model
  • SPARQL 1.1 query language
  • HTTP and SNARL protocols for remote access and control
  • OWL 2 and rules for inference and data analytics
  • Java, JavaScript, Ruby, Python, .Net, Groovy, Spring, etc.

New features in 2.0:

I was amused to read in Stardog Rules Syntax:

Stardog supports two different syntaxes for defining rules. The first is native Stardog Rules syntax and is based on SPARQL, so you can re-use what you already know about SPARQL to write rules. Unless you have specific requirements otherwise, you should use this syntax for user-defined rules in Stardog. The second, the de facto standard RDF/XML syntax for SWRL. It has the advantage of being supported in many tools; but it‘s not fun to read or to write. You probably don’t want to use it. Better: don’t use this syntax! (emphasis in the original)

Install and play with it over the weekend. It’s a good way to experience RDF and SPARQL.

Meet Node-RED…

Friday, September 27th, 2013

Meet Node-RED, an IBM project that fulfills the internet of things’ missing link by Stacey Higginbotham.

From the post:

If you play around with enough connected devices or hang out with enough people thinking about what it means to have 200 connected gizmos in your home, eventually you get to a pretty big elephant in the room: How the heck are you going to connect all this stuff? To a hub? To the internet? To each other?

It’s one thing to set a program to automate your lights/thermostat/whatever to go to a specific setting when you hit a button/lock your door/exit your home’s Wi-Fi network, but it’s quite another to have a truly intuitive and contextual experience in a connected home if you have to manually program it using IFTTT or a series of apps. Imagine if instead of popping a couple Hue Light Bulbs into your bedroom lamp, you bought home 30 or 40 for your entire home. That’s a lot of adding and setting preferences.

Organic programming: Just let it go

If you take this out of the residential setting and into a factory or office it’s magnified and even more daunting because of a variety of potential administrative tasks and permissions required. Luckily, there are several people thinking about this problem. Mike Kuniavsky, a principal in the innovation services group at PARC, first introduced me to this concept back in February and will likely touch on this in a few weeks at our Mobilize conference next month. He likens it to a more organic way of programming.

The basic idea is to program the internet of things much like you play a Sims-style video game — you set things up to perform in a way you think will work and then see what happens. Instead of programming an action, you’re programming behaviors and trends in a device or class of devices. Then you put them together, give them a direction and they figure out how to get there.

Over at IBM, a few engineers are actually building something that might be helpful in implementing such systems. It’s called node-RED and it’s a way to interject a layer of behaviors for devices using a visual interface. It’s built on top of node.js and is available over on github.

If you have ever seen the Eureka episode H.O.U.S.E. Rules, you will have serious doubts about the wisdom of “…then see what happens” with regard to your house. 😉

I wonder if this will be something truly different, like organic computing or a continuation of well known trends.

Early computers were programmed using switches but quickly migrated to assembler, but few write assembler now and those chores are done by compilers.

Some future compiler may accept examples of the “same” subject and decide on the most effective way to search and collate all the data for a given subject.

That will require a robust understanding of subject identity on the part of the compiler writers.

IVOA Newsletter – September 2013

Friday, September 27th, 2013

IVOA [International Virtual Observatory Alliance] Newsletter – September 2013 by Mark G. Allen, Deborah Baines, Sarah Emery Bunn, Chenzou Cui, Mark Taylor, & Ivan Zolotukhin.

From the post:

The International Virtual Observatory Alliance (IVOA) was formed in June 2002 with a mission to facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory. The IVOA now comprises 20 VO programs from Argentina, Armenia, Australia, Brazil, Canada, China, Europe, France, Germany, Hungary, India, Italy, Japan, Russia, South Africa, Spain, Ukraine, the United Kingdom, and the United States and an inter-governmental organization (ESA). Membership is open to other national and international programs according to the IVOA Guidelines for Participation. You can read more about the IVOA and what we do at

What is the VO?

The Virtual Observatory (VO) aims to provide a research environment that will open up new possibilities for scientific research based on data discovery, efficient data access, and interoperability. The vision is of global astronomy archives connected via the VO to form a multiwavelength digital sky that can be searched, visualized, and analyzed in new and innovative ways. VO projects worldwide working toward this vision are already providing science capabilities with new tools and services. This newsletter, aimed at astronomers, highlights VO tools and technologies for doing astronomy research, recent papers, and upcoming events.

Astroninformatics has a long history of dealing with “big data,” although it didn’t have a marketing name.

Astronomical “big data” is being shared and accessed around the world.

What about your “big data?”

…Boring HA And Scalability Problems

Friday, September 27th, 2013

Great Open Source Solution For Boring HA And Scalability Problems by Maarten Ectors and Frank Mueller.

From the post:

High-availability and scalability are exciting in general but there are certain problems that experts see over and over again. The list is long but examples are setting up MySQL clustering, sharding Mongo, adding data nodes to a Hadoop cluster, monitoring with Ganglia, building continuous deployment solutions, integrating Memcached / Varnish / Nginx,… Why are we reinventing the wheel?

At Ubuntu we made it our goal to have the community solve these repetitive and often boring tasks. How often have you had to set-up MySQL replication and scale it? What if the next time you just simply do:

  1. juju deploy mysql
  2. juju deploy mysql mysql-slave
  3. juju add-relation mysql:master mysql-slave:slave
  4. juju add-unit -n 10 mysql-slave

It’s easy to see how these four commands work. After deploying a master and a slave MySQL both are related as master and slave. After this you simply can add more slaves like it is done here with 10 more instances.

Responsible for this easy approach is one of our latest open source solutions, Juju. Juju allows any server software to be packaged inside what is called a Charm. A Charm describes how the software is deployed, integrated and scaled. Once an expert creates the Charm and uploads it to the Charm Store, anybody can use it instantly. Execution environments for Charms are abstracted via providers. The list of supported providers is growing and includes Amazon AWS, HP Cloud, Microsoft Azure, any Openstack private cloud, local LXC containers, bare metal deployments with MaaS, etc. So Juju allows any software to be instantly deployed, integrated and scaled on any cloud.

We all want HA and scalability problems to be “boring.”

When HA and scalabiilty problems are “exciting,” that’s a bad thing!

If you are topic mapping in the cloud, take the time to read about Juju.

R Resources

Friday, September 27th, 2013

The week in stats (Sept. 16th edition) by Matt Asher.

Matt writes of his discovery of several R resources:

This week, we found a number of useful webinars and presentations for statisticians and data scientists on R. Feel free to check out the following opportunities: Online course on forecasting using R by Prof. Hyndman of Monash University, Coursera’s free R courses, Why use R for Data Analysis by Vivek H. Patil of Gonzaga University, and two workshops on R by Bob Muenchen.

Pass on to others if you are already deep into R.

Quantifying the Language of British Politics, 1880-1914

Friday, September 27th, 2013

Quantifying the Language of British Politics, 1880-1914


This paper explores the power, potential, and challenges of studying historical political speeches using a specially constructed multi-million word corpus via quantitative computer software. The techniques used – inspired particularly by Corpus Linguists – are almost entirely novel in the field of political history, an area where research into language is conducted nearly exclusively qualitatively. The paper argues that a corpus gives us the crucial ability to investigate matters of historical interest (e.g. the political rhetoric of imperialism, Ireland, and class) in a more empirical and systematic manner, giving us the capacity to measure scope, typicality, and power in a massive text like a national general election campaign which it would be impossible to read in entirety.

The paper also discusses some of the main arguments against this approach which are commonly presented by critics, and reflects on the challenges faced by quantitative language analysis in gaining more widespread acceptance and recognition within the field.

Points to a podcast by Luke Blaxill presenting the results of his Ph.D research.

Luke Blaxill’s dissertation: The Language of British Electoral Politics 1880-1910.

Important work that strikes a balance between a “close reading” of the relevant texts and using a one million word corpus (two corpora actually) to trace language usage.

Think of it as the opposite of tools that flatten the meaning of words across centuries.

Semantic Search and Linked Open Data Special Issue

Friday, September 27th, 2013

Semantic Search and Linked Open Data Special Issue

Paper submission: 15 December 2013
Notice of review results: 15 February 2013
Revisions due: 31 March 2014
Publication: Aslib Proceedings, issue 5, 2014.

From the call:

The opportunities and challenges of Semantic Search from theoretical and practical, conceptual and empirical perspectives. We are particularly interested in papers that place carefully conducted studies into the wider framework of current Semantic Search research in the broader context of Linked Open Data. Topics of interest include but are not restricted to:

  • The history of semantic search –  the latest techniques and technology developments in the last 1000 years
  • Technical approaches to semantic search : linguistic/NLP, probabilistic, artificial intelligence, conceptual/ontological
  • Current trends in Semantic Search, including best practice, early adopters, and cultural heritage
  • Usability and user experience; Visualisation; and techniques and technologies in the practice for Semantic Search
  • Quality criteria and Impact of norms and standardisation similar to ISO 25964 “Thesauri for information retrieval“
  • Cross-industry collaboration and standardisation
  • Practical problems in brokering consensus and agreement – defining concepts, terms and classes, etc
  • Curation and management of ontologies
  • Differences between web-scale, enterprise scale, and collection-specific scale techniques
  • Evaluation of Semantic Search solutions, including comparison of data collection approaches
  • User behaviour including evolution of norms and conventions; Information behaviour; and Information literacy
  • User surveys; usage scenarios and case studies

Papers should clearly connect their studies to the wider body of Semantic Search scholarship, and spell out the implications of their findings for future research. In general, only research-based submissions including case studies and best practice will be considered. Viewpoints, literature reviews or general reviews are generally not acceptable.

See the post for submission requirements, etc.

I am encouraged by the inclusion of:

The history of semantic search –  the latest techniques and technology developments in the last 1000 years

Wondering who will take up the gauntlet on that topic?

Big data and the “Big Lie”:…

Thursday, September 26th, 2013

Big data and the “Big Lie”: the challenges facing big brand marketers by Renee DiResta.

We’ve talked about the NSA and others gathering data like it is meaningful. Renee captures another point of inaccuracy in data:

However, the flip side of “social” is what’s come to be called The Big Lie: “the gap between social norm and private reality, between expressed opinions and inner motions.” We ensure that our Facebook and LinkedIn profiles present us in our best light. Our shared audio playlists highlight the artists we’re proud to call ourselves fans of — and conceal the mass-market pop that we actually listen to when we’re alone. We use Instagram to share our most gourmet dining experiences, not our Oreo habit. There’s an important distinction between user-generated data and user-volunteered data. Targeting someone using data they generated but did not volunteer can put a brand squarely into the “creepy” zone.

To emphasize the critical point:

“the gap between social norm and private reality, between expressed opinions and inner motions.”

Although that doesn’t account for self-deception/delusion, which I suspect operates a good deal of the time.

You need only to watch the evening news to see allegedly competent people saying things that are inconsistent with commonly shared views of reality.

I think Renee’s bottom line is that turning the crank on “big data” isn’t going to result in sales. It’s a bit harder than that.

See Renee’s post for more details.

Integrating with Apache Camel

Thursday, September 26th, 2013

Integrating with Apache Camel by Charles Mouillard.

From the post:

Since its creation by the Apache community in 2007, the open source integration framework Apache Camel has become a developer favourite. It is recognised as a key technology to design SOA / Integration projects and address complex enterprise integration use cases. This article, the first part of a series, will reveal how the framework generates, from the Domain Specific Language, routes where exchanges take place, how they are processed according to the patterns chosen, and finally how integration occurs.

This series will be a good basis to continue onto ‘Enterprise Integration Patterns‘ and compare that to topic maps.

How should topic maps be modified (if at all) to fit into enterprise integration patterns?

Similarity maps…

Thursday, September 26th, 2013

Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods by Sereina Riniker and Gregory A Landrum.


Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models. Here we present similarity maps, a straightforward and general strategy to visualize the atomic contributions to the similarity between two molecules or the predicted probability of a ML model. We show the application of similarity maps to a set of dopamine D3 receptor ligands using atom-pair and circular fingerprints as well as two popular ML methods: random forests and na?ve Bayes. An open-source implementation of the method is provided.

If you are doing topic maps in areas where molecular fingerprints are relevant, this could be quite useful.

Despite my usual warnings that semantics are continuous versus the discrete structures treated here, this may also be useful in “fuzzier” areas where topic maps are found.


Thursday, September 26th, 2013


From the webpage:

JPASS gives you personal access to a library of more than 1,500 academic journals on JSTOR. If you don’t have access to JSTOR through a school or public library, JPASS may be a perfect fit.

With JPASS, a substantial portion of the most influential research and ideas published over centuries is available to you anywhere, anytime. Access includes a vast collection of archival journals in the humanities, social sciences, and sciences. Coverage begins for each journal at the first volume and issue ever published, and extends up to a publication date usually set in the past three to five years. Current issues are not part of the JPASS Collection.

Current rates: $19.95/month or $199/year, with download permission for ten articles a month or one hundred and twenty for a year subscription.

It’s not much but if you don’t have access to a major academic library, it is better than nothing.

The amazing part of this story is that until quite recently JSTOR had no individual subscriptions.

Can’t imagine someone outside of a traditional academic setting wanting to read substantive academic research.

That may sound like I am not a fan of JSTOR. Truth is I’m not. But like I said, if you have no meaningful access at all, this will be better than nothing.

For CS and related topics, I would spend the money on the ACM Digital Library and/or the IEEE Xplore Digital Library.