August 1st, 2014


From the webpage:

OpenGM is a C++ template library for discrete factor graph models and distributive operations on these models. It includes state-of-the-art optimization and inference algorithms beyond message passing. OpenGM handles large models efficiently, since (i) functions that occur repeatedly need to be stored only once and (ii) when functions require different parametric or non-parametric encodings, multiple encodings can be used alongside each other, in the same model, using included and custom C++ code. No restrictions are imposed on the factor graph or the operations of the model. OpenGM is modular and extendible. Elementary data types can be chosen to maximize efficiency. The graphical model data structure, inference algorithms and different encodings of functions inter-operate through well-defined interfaces. The binary OpenGM file format is based on the HDF5 standard and incorporates user extensions automatically.

Documentation lists algorithms with references.

I first saw this in a post by Danny Bickson, OpenGM graphical models toolkit.

DBpedia – Wikipedia Data Extraction

August 1st, 2014

DBpedia – Wikipedia Data Extraction by Gaurav Vaidya.

From the post:

We are happy to announce an experimental RDF dump of the Wikimedia Commons. A complete first draft is now available online at, and will be eventually accesible from A small sample dataset, which may be easier to browse, is available on Github at

Just in case you are looking for some RDF data to experiment with this weekend!

A Very Gentle Introduction to Relational Programming

August 1st, 2014

A Very Gentle Introduction to Relational Programming & Functional Programming by David Nolen.

From the webpage:

This tutorial will guide you through the magic and fun of combining relational programming (also known as logic programming) with functional programming. This tutorial does not assume that you have any knowledge of Lisp, Clojure, Java, or even functional programming. The only thing this tutorial assumes is that you are not afraid of using the command line and you have used at least one programming language before in your life.

A fairly short tutorial but one where “relational” in the title is likely to result in confusion. Here “relational” is meant in the sense of “logical.”

Another one of those ambiguity problems.

USB Security Fundamentally Broken

August 1st, 2014

Why the Security of USB Is Fundamentally Broken by Andy Greenberg.

From the post:

Computer users pass around USB sticks like silicon business cards. Although we know they often carry malware infections, we depend on antivirus scans and the occasional reformatting to keep our thumbdrives from becoming the carrier for the next digital epidemic. But the security problems with USB devices run deeper than you think: Their risk isn’t just in what they carry, it’s built into the core of how they work.

That’s the takeaway from findings security researchers Karsten Nohl and Jakob Lell plan to present next week, demonstrating a collection of proof-of-concept malicious software that highlights how the security of USB devices has long been fundamentally broken. The malware they created, called BadUSB, can be installed on a USB device to completely take over a PC, invisibly alter files installed from the memory stick, or even redirect the user’s internet traffic. Because BadUSB resides not in the flash memory storage of USB devices, but in the firmware that controls their basic functions, the attack code can remain hidden long after the contents of the device’s memory would appear to the average user to be deleted. And the two researchers say there’s no easy fix: The kind of compromise they’re demonstrating is nearly impossible to counter without banning the sharing of USB devices or filling your port with superglue.

“These problems can’t be patched,” says Nohl, who will join Lell in presenting the research at the Black Hat security conference in Las Vegas. “We’re exploiting the very way that USB is designed.”

You can get the gist of this new security issue from Andy’s post or pay late registration fees for Black Hat 2014 next week.

I was surprised when I learned a sneaker net using a USB device was part of the reason for the Snowden leaks. I was assuming that NSA computers had no USB ports and/or would have them glued up. Apparently not.

Are you going to send the NSA a note about this latest USB issue or should I?

PS: Aside from possible new USB designs, the upside of this issue may be a discussion of how much security do you want at what price? No system is “secure,” but rather “relatively secure under the following assumptions…”

Letter to a Young Haskell Enthusiast [No Haskell Required for Reading]

August 1st, 2014

Letter to a Young Haskell Enthusiast by Gershom Bazerman.

From an introduction before the letter:

The following letter is not about what “old hands” know and newcomers do not. Instead, it is about lessons that we all need to learn more than once, and remind ourselves of. It is about tendencies that are common, and understandable, and come with the flush of excitement of learning any new thing that we understand is important, and about the difficulty, always, in trying to decide how best to convey that excitement and sense of importance to others, in a way that they will listen. It is written more specifically, but only because I have found that if we don’t talk specifics as well as generalities, the generalities make no sense. This holds for algebraic structures, and it holds for other, vaguer concepts no less. It is a letter full of things I want to remember, as well as of advice I want to share. I expect I will want to remind myself of it when I encounter somebody who is wrong on the internet, which, I understand, may occur on rare occasion. (emphasis in original)

Extremely good advice on being a contributing member of a community, on or offline.

Share it and calendar it for regular re-reading.

Interactive Map: First World War: A Global View

August 1st, 2014

Interactive Map: First World War: A Global View by UkNatArchives.

From the pop-up when you visit the map:

A global view

Explore the global impact of the First World War through our interactive map, which highlights key events and figures in countries from Aden to Zanzibar. Drawn directly from our records at The National Archives, the map aims to go beyond the trenches of the Western Front and shows how the war affected different parts of the world.

The First World War: A global view is part of our First World War 100 programme. It currently focuses on the contributions of the countries and territories that made up the British Empire during wartime. We will continue to develop the map over the next four years, to show more countries and territories across Europe, the Middle East, the Americas, Africa and Asia.

About this map

To get started, select a country or territory by clicking on a marker Map maker icon on the map, or select a name from the list on the left. Navigate through the tabs to read about battles, life on the Home Front and much more. Each country or territory is illustrated with images, maps and other documents from our collections. Click on the references to find key documents in Discovery, our catalogue, or images in our image library.

To reflect changing borders and names since 1914, we have provided two map views. Switch between the global map as it was during wartime, and as it is today, by using the buttons at the top of the map.

My assumptions about certain phrases do jump up to bite me every now and again. This was one of those cases.

I think I know what is meant by “First World War,” and “A Global View.” And even the language about “changing borders and names since 1914,” makes sense given the rise of so many new nations in the last century.

Hence, my puzzlement when I looked at the Country/Territory list only to see:

Aden Jamaica
Anglo-Egyptian Sudan Leeward Islands
Ascension Island Malaya
Australia Maldives
Barbados Malta
Bermuda Mauritius
Britian New Zealand
British East Africa Newfoundland
British Gold Coast Nigeria
British Honduras Northern Rhodesia
British New Guinea and German New Guinea Nyasaland
British North Borneo and Sarawak Pacific Islands
Burma Seychelles
Canada Sierra Leone
Ceylon Straits Settlements
Cocos (Keeling) Islands Southern Rhodesia
Cyprus St Helena
Egypt The Gambia
Falkland Islands Trinidad and Tobago
Gibraltar Uganda
Hong Kong and Wei-Hai-Wei Windward Islands
India Zanzibar

In my history lessons, I had learned there were many other countries that were involved in World War I, especially from a “global” view. ;-)

My purpose is not to disagree with the definition of World War I or “global perspective” used by the UK National Archive. It is their map and they are free to use whatever definitions seem appropriate to their purpose.

My point is that even common phrases, such as World War I and “global perspective” can be understood in radically different ways by different readers of the same text.

For an American class, I would re-title this resources as England and its territories during World War I. To which a UK teacher could rightly reply, “That’s what we said.”

More examples of unexpected semantic dissonance welcome!

PS: You should be following The National Archives (UK). Truly a remarkable effort.

Elasticsearch 1.3.1 released

August 1st, 2014

Elasticsearch 1.3.1 released by Clinton Gormley.

From the post:

Today, we are happy to announce the bugfix release of Elasticsearch 1.3.1, based on Lucene 4.9. You can download it and read the full changes list here: Elasticsearch 1.3.1.


GraphLab Conference 2014 (Videos!)

August 1st, 2014

GraphLab Conference 2014 (Videos!)

Videos from the GraphLab Conference 2014 have been posted! Who needs to wait for a new season of Endeavor? ;-)

(I included the duration times so you can squeeze these in between conference calls.)

Presentations, ordered by author’s last name.

Training Sessions on GraphLab Create

I first saw this in a tweet by xamat.

COSMOS: Python library for massively parallel workflows

August 1st, 2014

COSMOS: Python library for massively parallel workflows by Erik Gafni, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu385 )


Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at and

Contact: or

Supplementary information: Supplementary data are available at Bioinformatics online.

A very good abstract but for pitching purposes, I would have chosen the first paragraph of the introduction:

The growing deluge of data from next-generation sequencers leads to analyses lasting hundreds or thousands of compute hours per specimen, requiring massive computing clusters or cloud infrastructure. Existing computational tools like Pegasus (Deelman et al., 2005) and more recent efforts like Galaxy (Goecks et al., 2010) and Bpipe (Sadedin et al., 2012) allow the creation and execution of complex workflows. However, few projects have succeeded in describing complicated workflows in a simple, but powerful, language that generalizes to thousands of input files; fewer still are able to deploy workflows onto distributed resource management systems (DRMs) such as Platform Load Sharing Facility (LSF) or Sun Grid Engine that stitch together clusters of thousands of compute cores. Here we describe COSMOS, a Python library developed to address these and other needs.

That paragraph highlights the bioinformatics aspects of COSMOS but also hints at a language that might be adapted to other “massively parallel workflows.” Workflows may differ details but the need to efficiently and effectively define them is a common problem.

Toposes, Triples and Theories

August 1st, 2014

Toposes, Triples and Theories by Michael Barr and Charles Wells.

From the preface:

Chapter 1 is an introduction to category theory which develops the basic constructions in categories needed for the rest of the book. All the category theory the reader needs to understand the book is in it, but the reader should be warned that if he has had no prior exposure to categorical reasoning the book might be tough going. More discursive treatments of category theory in general may be found in Borceux [1994], Mac Lane [1998], and Barr and Wells [1999]; the last-mentioned could be suitably called a prequel to this book.

So you won’t have to dig the references out of the bibliography:

M. Barr and C. Wells, Category Theory for Computing Science, 3rd Edition. Les Publications CRM (1999).
Online at:

F. Borceux, Handbook of Categorical Algebra I, II and III. Cambridge University Press (1994).
Cambridge Online Books has these volumes but that requires an institutional subscription.

S. Mac Lane, Categories for the Working Mathematician 2nd Edition. Springer-Verlag, 1998.
Online at:


Security Thunderstorm in the Cloud

July 31st, 2014

Your data security, in the cloud and elsewhere, got weaker today.

U.S District Judge Loretta Preska ruled today that Microsoft must turn over a customer’s emails that are stored in Ireland. (see: U.S. Judge Rules Microsoft Must Produce Emails Held Abroad)

Whether your data is stored in the U.S. or controlled by a U.S. company, it is subject to seizure under by the U.S.

The ruling has been stayed pending an appeal to the 2nd U.S. Circuit Court of Appeals.

The Digital Constitution (MS) has a great set of resources on this issue:

Along with briefs filed by others:

More resources and news will appear at the Digital Constitution so sign up for updates!

The legal dancing in the briefs may not interest you but the bottom line is this:

If data can be seized by any government without regard to the location of the data, the Cloud is effectively dead for anyone concerned about data security.

You may store your data in the Cloud on European servers due to greater privacy protection by the EU. Not a concern for U.S. courts if your data is held by a U.S. company.

You may store your data in the Cloud on U.S. servers but if the Chinese government wants to seize it, Judge Preska appears to think that is ok.

Congress needs to quell this security thunderstorm in the Cloud before it does major economic damage both here and abroad.

PS: Many thanks to Joseph Palazzolo (WSJ) for pointing me to the Digital Constitution site.

How To Create Semantic Confusion

July 31st, 2014

Merge: to cause (two or more things, such as two companies) to come together and become one thing : to join or unite (one thing) with another (

Do you see anything common between that definition of merge and:

  • It ensures that a pattern exists in the graph by creating it if it does not exist already
  • It will not use partially existing (unbound) patterns- it will attempt to match the entire pattern and create the entire pattern if missing
  • When unique constraints are defined, MERGE expects to find at most one node that matches the pattern
  • It also allows you to define what should happen based on whether data was created or matched

The quote is from Cypher MERGE Explained by Luanne Misquitta. Great post if you want to understand the operation of Cypher “merge,” which has nothing in common with the term “merge” in English.

Want to create semantic confusion?

Choose a well-known term and define new and unrelated semantics for it. Creates a demand for training, tutorials as well as confused users.

I first saw this in a tweet by GraphAware.

Bio-Linux 8 – Released July 2014

July 31st, 2014

Bio-Linux 8 – Released July 2014

About Bio-Linux:

Bio-Linux 8 is a powerful, free bioinformatics workstation platform that can be installed on anything from a laptop to a large server, or run as a virtual machine. Bio-Linux 8 adds more than 250 bioinformatics packages to an Ubuntu Linux 14.04 LTS base, providing around 50 graphical applications and several hundred command line tools. The Galaxy environment for browser-based data analysis and workflow construction is also incorporated in Bio-Linux 8.

Bio-Linux 8 represents the continued commitment of NERC to maintain the platform, and comes with many updated and additional tools and libraries. With this release we support pre-prepared VM images for use with VirtualBox, VMWare or Parallels. Virtualised Bio-Linux will power the EOS Cloud, which is in development for launch in 2015.

You can install Bio-Linux on your machine, either as the only operating system, or as part of a dual-boot set-up which allows you to use your current system and Bio-Linux on the same hardware.

Bio-Linux can also run Live from a DVD or a USB stick. This runs in the memory of your machine and does not involve installing anything. This is a great, no-hassle way to try out Bio-Linux, demonstrate or teach with it, or to work with it when you are on the move.

Bio-Linux is built on open source systems and software, and so is free to to install and use. See What’s new on Bio-Linux 8. Also, check out the 2006 paper on Bio-Linux and open source systems for biologists.

Great news if you are handling biological data!

Not to mention being a good example of multiple delivery methods, you can use Bio-Linux 8 as your OS, run it from a VM, DVD or USB stick.

How is your software delivered?

Semantic Investment Needs A Balance Sheet Line Item

July 30th, 2014

The Hidden Shareholder Boost From Information Assets by Douglas Laney.

From the post:

It’s hard today not to see the tangible, economic benefits of information all around us: Walmart uses social media trend data to entice online shoppers to purchase 10 percent to 15 percent more stuff; Kraft spinoff Mondelez grew revenue by $100 million through improved in-store promotion configurations using detailed store, chain, product, stock and pricing data; and UPS saves more than $50 million, delivers 35 percent more packages per year and has doubled driver wages by continually collecting and analyzing more than 200 data points per truck along with GPS data to reduce accidents and miles driven.

Even businesses from small city zoos to mom-and-pop coffee shops to wineries are collecting, crushing and consuming data to yield palpable revenue gains or expense reductions. In addition, some businesses beyond the traditional crop of data brokers monetize their information assets directly by selling or trading them for goods or services.

Yet while as a physical asset, technology is easily given a value attribution and represented on balance sheets; information is treated as an asset also ran or byproduct of the IT department. Your company likely accounts for and manages your office furniture with greater discipline than your information assets. Why? Because accounting standards in place since the beginning of the information age more than 50 years ago continue to be based on 250-year-old Industrial Age realities. (emphasis in original)

Does your accounting system account for your investment in semantics?

Here’s some ways to find out:

  • For any ETL project in the last year, can your accounting department detail how much was spent discovering the semantics of the ETL data?
  • For any data re-used for an ETL project in the last three years, can your accounting department detail how much was spent duplicating the work of the prior ETL?
  • Can your accounting department produce a balance sheet showing your current investment in the semantics of your data?
  • Can your accounting department produce a balance sheet showing the current value of your information?

If the answer is “no,” to any of those questions, is your accounting department meeting your needs in the information age?

Douglas has several tips for getting people’s attention for the $$$ you have invested in information.

Is information an investment or an unknown loss on your books?

Senator John Walsh plagiarism, color-coded

July 30th, 2014

Senator John Walsh plagiarism, color-coded by Nathan Yau.

Nathan points to a New York Times’ visualization that makes a telling case for plagiarism against Senator John Walsh.

Best if you see it at Nathan’s site, his blog formats better than mine does.

Senator Walsh was rather obvious about it but I often wonder how much news copy, print or electronic, is really original?

Some is I am sure but when a story goes out over AP or UPI, how much of it is repeated verbatim in other outlets?

It’s not plagiarism because someone purchased a license to repeat the stories but it certainly isn’t original.

If an AP/UPI story is distributed and re-played in 500 news outlets, it remains one story. With no more credibility than it had at the outset.

Would color coding be as effective against faceless news sources as they have been against Sen. Walsh?

BTW, if you are interested in the sordid details: Pentagon Watchdog to review plagiarism probe of Sen. John Walsh. Incumbents need not worry, Sen. Walsh is an appointed senator and therefore is an easy throw-away in order to look tough on corruption.

Ideals, Varieties, and Algorithms

July 30th, 2014

Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Communtative Algebra by David Cox, John Little, and Donal OShea.

From the introduction:

We wrote this book to introduce undergraduates to some interesting ideas in algebraic geometry and commutative algebra. Until recently, these topics involved a lot of abstract mathematics and were only taught in graduate school. But in the 1960s, Buchberger and Hironaka discovered new algorithms for manipulating systems of polynomial equations. Fueled by the development of computers fast enough to run these algorithms, the last two decades have seen a minor revolution in commutative algebra. The ability to compute efficiently with polynomial equations has made it possible to investigate complicated examples that would be impossible to do by hand, and has changed the practice of much research in algebraic geometry. This has also enhanced the importance of the subject for computer scientists and engineers, who have begun to use these techniques in a whole range of problems.

The authors do presume students “…have access to a computer algebra system.”

The Wikipedia List of computer algebra systems has links to numerous such systems. A large number of which are free.

That list is headed by Axiom (Wikipedia article) and is an example of literate programming. The Axiom documentation looks like a seriously entertaining time sink! You may want to visit

I haven’t installed Axiom so take that as a comment on its documentation more than its actual use. Use whatever system you like best and fits your finances.

I first saw this in a tweet from onepaperperday.


Awesome Machine Learning

July 30th, 2014

Awesome Machine Learning by Joseph Misiti.

From the webpage:

A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. Other awesome lists can be found in the awesome-awesomeness list.

If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti

Not strictly limited to “machine learning” as it offers resources on data analysis, visualization, etc.

With a list of 576 resources, I am sure you will find something new!

JudaicaLink released

July 30th, 2014

JudaicaLink released

From the post:

Data extractions from two encyclopediae from the domain of Jewish culture and history have been released as Linked Open Data within our JudaicaLink project.

JudaicaLink now provides access to 22,808 concepts in English (~ 10%) and Russian (~ 90%), mostly locations and persons.

See here for further information:

Next steps in this project include “…the creation of links between the two encyclopedias and links to external sources like DBpedia or Geonames.”

In case you are interested, the two encyclopedias are:

The YIVO Encyclopedia of Jews in Eastern Europe, courtesy of the YIVO Institute of Jewish Research, NY. provides an Internet version of the Encyclopedia of Russian Jewry, which is published in Moscow since 1994, giving a comprehensive, objective picture of the life and activity of the Jews of Russia, the Soviet Union and the CIS.

For more details: Encyclopediae

If you are looking to contribute content or time to a humanities project, this should be on your short list.

Graphs, Databases and Graphlab

July 30th, 2014

Graphs, Databases and Graphlab by Bugra Akyildiz.

From the post:

I will talk about graphs, graph databases and mainly the paper that powers Graphlab. At the end of the post, I will go over briefly basic capabilities of Graphlab as well.

Background coverage of graphs and graphdatabases, followed by a discussion of GraphLab.

The high point of the post are graphs generated from prior work by Bugra on the Internet Movie Database. (IMDB Top 100K Movies Analysis in Depth (Parts 1- 4))


CIS 194: Introduction to Haskell (Spring 2013)

July 30th, 2014

CIS 194: Introduction to Haskell (Spring 2013) by Brent Yorgey.

From the description:

Haskell is a high-level, pure functional programming language with a strong static type system and elegant mathematical underpinnings, and is being increasingly used in industry by organizations such as Facebook, AT&T, and NASA. In the first 3/4 of the course, we will explore the joys of pure, lazy, typed functional programming in Haskell (warning: euphoric, mind-bending epiphanies may result!). The last 1/4 of the course will consist of case studies in fun and interesting applications of Haskell. Potential case study topics include automated randomized testing, software transactional memory, graphics generation, parser combinators, or others as time and interest allow. Evaluation will be based on class participation, weekly programming assignments, and an open-ended final project.

Lectures, homework, and resources on Haskell.

I particularly liked this resource and comment:

The #haskell IRC channel is a great place to get help. Strange as it may seem if you’ve spent time in other IRC channels, #haskell is always full of friendly, helpful people.

Good to know there are places like that left!

I first saw this in a tweet by Kludgy.

Multi-Term Synonyms [Bags of Properties?]

July 30th, 2014

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter by Ted Sullivan.

From the post:

In a previous blog post, I introduced the AutoPhrasingTokenFilter. This filter is designed to recognize noun-phrases that represent a single entity or ‘thing’. In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

The problem with multi-term synonyms in Lucene/Solr is well documented (see Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the synonym-expanded token list – based on the way that the Lucene indexer ingests the analyzed token stream. The indexer pays attention to a token’s start position but does not attend to its position length increment. This causes multi-term tokens to overlap subsequent terms in the token stream rather than maintaining a strictly parallel relation (in terms of both start and end positions) with their synonymous terms. Therefore, rather than getting a clean ‘state-graph’, we get a pattern called “sausagination” that does not accurately reflect the 1-1 mapping of terms to synonymous terms within the flow of the text (see blog post by Mike McCandless on this issue). This problem disappears if all of the synonym pairs are single tokens.

The multi-term synonym problem was described in a Lucene JIRA ticket (LUCENE-1622) which is still marked as “Unresolved”:

Posts like this one are a temptation to sign off Twitter and read the ticket feeds for Lucene/Solr instead. Seriously.

Ted proposes a workaround to the multi-term synonym problem using the auto phrasing tokenfilter. Equally important is his conclusion:

The AutoPhrasingTokenFilter can be an important tool in solving one of the more difficult problems with Lucene/Solr search – how to deal with multi-term synonyms. Simultaneously, we can improve another serious problem that all search engines have – their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”), the search engine is better able to return results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.

Or more precisely:

…their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”)…

Ambiguity at the token level remains, even if for particular cases phrases can be treated as semantic entities.

Rather than Ted’s “bag of things,” may I suggest indexing “bags of properties?” Where the lowliest token or a higher semantic unit can be indexed as a bag of properties.

Imagine indexing these properties* for a single token:

  • string: value
  • pubYear: value
  • author: value
  • journal: value
  • keywords: value

Would that suffice to distinguish a term in a medical journal from Vanity Fair?

Ambiguity is predicated upon a lack of information.

That should be suggestive of a potential cure.

*(I’m not suggesting that all of those properties or even most of them would literally appear in a bag. Most, if not all, could be defaulted from an indexed source.)

I first saw this in a tweet by SolrLucene.

Scrapy and Elasticsearch

July 30th, 2014

Scrapy and Elasticsearch by Florian Hopf.

From the post:

On 29.07.2014 I gave a talk at Search Meetup Karlsruhe on using Scrapy with Elasticsearch, the slides are here. This post evolved from the talk and introduces you to web scraping and search with Scrapy and Elasticsearch.

Web Crawling

You might think that web crawling and scraping only is for search engines like Google and Bing. But a lot of companies are using it for different purposes: Price comparison, financial risk information and portals all need a way to get the data. And at least sometimes the way is to retrieve it through some public website. Besides these cases where the data is not in your hand it can also make sense if the data is aggregated already. For intranet and portal search engines it can be easier to just scrape the frontend instead of building data import facilities for different, sometimes even old systems.

The Example

In this post we are looking at a rather artificial example: Crawling the page for recent meetups to make them available for search. Why artificial? Because has an API that provides all the data in a more convenient way. But imagine there is no other way and we would like to build a custom search on this information, probably by adding other event sites as well. (emphasis in original)

Not everything you need to know about Scrapy but enough to get you interested.

APIs for data are on the up swing but web scrapers will be relevant to data mining for decades to come.

Expanded 19th-century Medical Collection

July 30th, 2014

Wellcome Library and Jisc announce partners in 19th-century medical books digitisation project

From the post:

The libraries of six universities have joined the partnership – UCL (University College London), the University of Leeds, the University of Glasgow, the London School of Hygiene & Tropical Medicine, King’s College London and the University of Bristol – along with the libraries of the Royal College of Physicians of London, the Royal College of Physicians of Edinburgh and the Royal College of Surgeons of England.

Approximately 15 million pages of printed books and pamphlets from all ten partners will be digitised over a period of two years and will be made freely available to researchers and the public under an open licence. By pooling their collections the partners will create a comprehensive online library. The content will be available on multiple platforms to broaden access, including the Internet Archive, the Wellcome Library and Jisc Historic Books.

The project’s focus is on books and pamphlets from the 19th century that are on the subject of medicine or its related disciplines. This will include works relating to the medical sciences, consumer health, sport and fitness, as well as different kinds of medical practice, from phrenology to hydrotherapy. Works on food and nutrition will also feature: around 1400 cookery books from the University of Leeds are among those lined up for digitisation. They, along with works from the other partner institutions, will be transported to the Wellcome Library in London where a team from the Internet Archive will undertake the digitisation work. The project will build on the success of the US-based Medical Heritage Library consortium, of which the Wellcome Library is a part, which has already digitised over 50 000 books and pamphlets.

Digital coverage of the 19th century is taking another leap forward!

Given the changes in medical terminology (and practices!) since the 19th century, this should be a gold mine for topic map applications.

Solr’s New AnalyticsQuery API

July 29th, 2014

Solr’s New AnalyticsQuery API by Joel Bernstein.

From the post:

In Solr 4.9 there is a new AnalyticsQuery API that allows developers to plug custom analytic logic into Solr. The AnalyticsQuery class provides a clean and simple API that gives developers access to all the rich functionality in Lucene and is strategically placed within Solr’s distributed search architecture. Using this API you can harness all the power of Lucene and Solr to write custom analytic logic.

Not all the detail you are going to want but a good start towards using the new AnalyticsQuery API in Solr 4.9.

The AnalyticsQuery API is an example of why I wonder about projects with custom search solutions (read not Lucene-based).

If you have any doubts, default to a Lucene-based search solution.

Using Category Theory to design…

July 29th, 2014

Using Category Theory to design implicit conversions and generic operators by John C. Reynolds.


A generalization of many-sorted algebras, called category-sorted algebras, is defined and applied to the language-design problem of avoiding anomalies in the interaction of implicit conversions and generic operators. The definition of a simple imperative language (without any binding mechanisms) is used as an example.

The greatest exposure most people have to implicit conversions is that they are handled properly.

This paper dates from 1980 so some of the category theory jargon will seem odd but consider it a “practical” application of category theory.

That should hold your interest. ;-)

I first saw this in a tweet by scottfleischman.

Hello World! – Hadoop, Hive, Pig

July 29th, 2014

Hello World! – An introduction to Hadoop with Hive and Pig

A set of tutorials to be run on Sandbox v2.0.

From the post:

This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0

The tutorials are presented in sections as listed below.

Maybe I have seen too many “Hello World!” examples but I was expecting the tutorials to go through the use of Hadoop, HCatalog, Hive and Pig to say “Hello World!”

You can imagine my disappointment when that wasn’t the case. ;-)

A lot of work to say “Hello World!” but on the other hand, tradition is tradition.

Kite SDK 0.15.0

July 29th, 2014

What’s New in Kite SDK 0.15.0? by Ryan Blue.

From the post:

Recently, Kite SDK, the open source toolset that helps developers build systems on the Apache Hadoop ecosystem, became a 0.15.0. In this post, you’ll get an overview of several new features and bug fixes.

Covered by this quick recap:

Working with Datasets by URI

Improved Configuration for MR and Apache Crunch Jobs

Parent POM for Kite Applications

Java Class Hints [more informative error messages]

More Docs and Tutorials

The last addition this release is a new user guide on, where we’re adding new tutorials and background articles. We’ve also updated the examples for the new features, which is a great place to learn more about Kite.

Also, watch this technical webinar on-demand to learn more about working with datasets in Kite.

I think you are going to like this.

HST V1.0 mosaics

July 29th, 2014

HST V1.0 mosaics released for Epoch 2 of Abell 2744

From the webpage:

We are pleased to announce the Version 1.0 release of Epoch 2 of Abell 2744, after the completion of all the ACS and WFC3/IR imaging on the main cluster and parallel field from our Frontier Fields program (13495, PI: J. Lotz), in addition to imaging from programs 11689 (PI: R. Dupke), 13386 (PI: S. Rodney), and 13389 (PI: B. Siana). These v1.0 mosaics have been fully recalibrated relative to the v0.5 mosaics that we have been releasing regularly throughout the course of this epoch during May, June and July 2014. For ACS, the v1.0 mosaics incorporate new bias and dark current reference files, along with CTE correction and bias destriping, and also include a set of mosaics that have been processed with the new selfcal approach to better account for the low-level dark current structure. The WFC3/IR v1.0 mosaics have improved masking for persistence and bad pixels, and in addition include a set of mosaics that have been corrected for time-variable sky emission that can occur during the orbit and can otherwise impact the up-the-ramp count-rate fitting if not properly corrected. Further details are provided in the readme file, which can be obtained along with all the mosaics at the following location:

From Wikipedia on Abell 2744:

Abell 2744, nicknamed Pandora’s Cluster, is a giant galaxy cluster resulting from the simultaneous pile-up of at least four separate, smaller galaxy clusters that took place over a span of 350 million years.[1] The galaxies in the cluster make up less than five percent of its mass.[1] The gas (around 20 percent) is so hot that it shines only in X-rays.[1] Dark matter makes up around 75 percent of the cluster’s mass.[1]

Admittedly the data is over 350 million years out of date but it is the latest data that is currently available. ;-)


Collection Pipeline

July 29th, 2014

Collection Pipeline by Martin Fowler.

From the post:

Collection pipelines are a programming pattern where you organize some computation as a sequence of operations which compose by taking a collection as output of one operation and feeding it into the next. (Common operations are filter, map, and reduce.) This pattern is common in functional programming, and also in object-oriented languages which have lambdas. This article describes the pattern with several examples of how to form pipelines, both to introduce the pattern to those unfamiliar with it, and to help people understand the core concepts so they can more easily take ideas from one language to another.

With a bit of Smalltalk history thrown in, this is a great read. Be mindful of software stories. You will learn that useful features have been dropped, ignored, re-invented throughout software history.


Cooper Hewitt, Color Interface

July 29th, 2014

From the about page:

Cooper Hewitt, Smithsonian Design Museum is the only museum in the nation devoted exclusively to historic and contemporary design. The Museum presents compelling perspectives on the impact of design on daily life through active educational and curatorial programming.

It is the mission of Cooper Hewitt’s staff and Board of Trustees to advance the public understanding of design across the thirty centuries of human creativity represented by the Museum’s collection. The Museum was founded in 1897 by Amy, Eleanor, and Sarah Hewitt—granddaughters of industrialist Peter Cooper—as part of The Cooper Union for the Advancement of Science and Art. A branch of the Smithsonian since 1967, Cooper-Hewitt is housed in the landmark Andrew Carnegie Mansion on Fifth Avenue in New York City.

I thought some background might be helpful because the Cooper Hewitt has a new interface:


Color, or colour, is one of the attributes we’re interested in exploring for collection browsing. Bearing in mind that only a fraction of our collection currently has images, here’s a first pass.

Objects with images now have up to five representative colors attached to them. The colors have been selected by our robotic eye machines who scour each image in small chunks to create color averages. These have then been harvested and “snapped” to the grid of 120 different colors — derived from the CSS3 palette and naming conventions — below to make navigation a little easier.

My initial reaction was to recall the old library joke where a patron comes to the circulation desk and doesn’t know a book’s title or author, but does remember it had a blue cover. ;-) At which point you wish Basil from Faulty Towers was manning the circulation desk. ;-)

It may be a good idea with physical artifacts because color/colour is a fixed attribute that may be associated with a particular artifact.

If you know the collection, you can amuse yourself by trying to guess what objects will be returned for particular colors.

BTW, the collection is interlinked by people, roles, periods, types, countries. Very impressive!

Don’t miss the resources for developers at: and their GitHub account.

I first saw this in a tweet by Lyn Marie B.

PS: The use of people, roles, objects, etc. for browsing has a topic map-like feel. Since their data and other resources are downloadable, more investigation will follow.