COSMOS: Python library for massively parallel workflows

August 1st, 2014

COSMOS: Python library for massively parallel workflows by Erik Gafni, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu385 )

Abstract:

Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu.

Contact: dpwall@stanford.edu or peter_tonellato@hms.harvard.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

A very good abstract but for pitching purposes, I would have chosen the first paragraph of the introduction:

The growing deluge of data from next-generation sequencers leads to analyses lasting hundreds or thousands of compute hours per specimen, requiring massive computing clusters or cloud infrastructure. Existing computational tools like Pegasus (Deelman et al., 2005) and more recent efforts like Galaxy (Goecks et al., 2010) and Bpipe (Sadedin et al., 2012) allow the creation and execution of complex workflows. However, few projects have succeeded in describing complicated workflows in a simple, but powerful, language that generalizes to thousands of input files; fewer still are able to deploy workflows onto distributed resource management systems (DRMs) such as Platform Load Sharing Facility (LSF) or Sun Grid Engine that stitch together clusters of thousands of compute cores. Here we describe COSMOS, a Python library developed to address these and other needs.

That paragraph highlights the bioinformatics aspects of COSMOS but also hints at a language that might be adapted to other “massively parallel workflows.” Workflows may differ details but the need to efficiently and effectively define them is a common problem.

Toposes, Triples and Theories

August 1st, 2014

Toposes, Triples and Theories by Michael Barr and Charles Wells.

From the preface:

Chapter 1 is an introduction to category theory which develops the basic constructions in categories needed for the rest of the book. All the category theory the reader needs to understand the book is in it, but the reader should be warned that if he has had no prior exposure to categorical reasoning the book might be tough going. More discursive treatments of category theory in general may be found in Borceux [1994], Mac Lane [1998], and Barr and Wells [1999]; the last-mentioned could be suitably called a prequel to this book.

So you won’t have to dig the references out of the bibliography:

M. Barr and C. Wells, Category Theory for Computing Science, 3rd Edition. Les Publications CRM (1999).
Online at: http://www.math.mcgill.ca/triples/Barr-Wells-ctcs.pdf

F. Borceux, Handbook of Categorical Algebra I, II and III. Cambridge University Press (1994).
Cambridge Online Books has these volumes but that requires an institutional subscription.

S. Mac Lane, Categories for the Working Mathematician 2nd Edition. Springer-Verlag, 1998.
Online at: http://www.maths.ed.ac.uk/~aar/papers/maclanecat.pdf

Enjoy!

Security Thunderstorm in the Cloud

July 31st, 2014

Your data security, in the cloud and elsewhere, got weaker today.

U.S District Judge Loretta Preska ruled today that Microsoft must turn over a customer’s emails that are stored in Ireland. (see: U.S. Judge Rules Microsoft Must Produce Emails Held Abroad)

Whether your data is stored in the U.S. or controlled by a U.S. company, it is subject to seizure under by the U.S.

The ruling has been stayed pending an appeal to the 2nd U.S. Circuit Court of Appeals.

The Digital Constitution (MS) has a great set of resources on this issue:

Along with briefs filed by others:

More resources and news will appear at the Digital Constitution so sign up for updates!

The legal dancing in the briefs may not interest you but the bottom line is this:

If data can be seized by any government without regard to the location of the data, the Cloud is effectively dead for anyone concerned about data security.

You may store your data in the Cloud on European servers due to greater privacy protection by the EU. Not a concern for U.S. courts if your data is held by a U.S. company.

You may store your data in the Cloud on U.S. servers but if the Chinese government wants to seize it, Judge Preska appears to think that is ok.

Congress needs to quell this security thunderstorm in the Cloud before it does major economic damage both here and abroad.

PS: Many thanks to Joseph Palazzolo (WSJ) for pointing me to the Digital Constitution site.

How To Create Semantic Confusion

July 31st, 2014

Merge: to cause (two or more things, such as two companies) to come together and become one thing : to join or unite (one thing) with another (http://www.merriam-webster.com/dictionary/merge

Do you see anything common between that definition of merge and:

  • It ensures that a pattern exists in the graph by creating it if it does not exist already
  • It will not use partially existing (unbound) patterns- it will attempt to match the entire pattern and create the entire pattern if missing
  • When unique constraints are defined, MERGE expects to find at most one node that matches the pattern
  • It also allows you to define what should happen based on whether data was created or matched

The quote is from Cypher MERGE Explained by Luanne Misquitta. Great post if you want to understand the operation of Cypher “merge,” which has nothing in common with the term “merge” in English.

Want to create semantic confusion?

Choose a well-known term and define new and unrelated semantics for it. Creates a demand for training, tutorials as well as confused users.

I first saw this in a tweet by GraphAware.

Bio-Linux 8 – Released July 2014

July 31st, 2014

Bio-Linux 8 – Released July 2014

About Bio-Linux:

Bio-Linux 8 is a powerful, free bioinformatics workstation platform that can be installed on anything from a laptop to a large server, or run as a virtual machine. Bio-Linux 8 adds more than 250 bioinformatics packages to an Ubuntu Linux 14.04 LTS base, providing around 50 graphical applications and several hundred command line tools. The Galaxy environment for browser-based data analysis and workflow construction is also incorporated in Bio-Linux 8.

Bio-Linux 8 represents the continued commitment of NERC to maintain the platform, and comes with many updated and additional tools and libraries. With this release we support pre-prepared VM images for use with VirtualBox, VMWare or Parallels. Virtualised Bio-Linux will power the EOS Cloud, which is in development for launch in 2015.

You can install Bio-Linux on your machine, either as the only operating system, or as part of a dual-boot set-up which allows you to use your current system and Bio-Linux on the same hardware.

Bio-Linux can also run Live from a DVD or a USB stick. This runs in the memory of your machine and does not involve installing anything. This is a great, no-hassle way to try out Bio-Linux, demonstrate or teach with it, or to work with it when you are on the move.

Bio-Linux is built on open source systems and software, and so is free to to install and use. See What’s new on Bio-Linux 8. Also, check out the 2006 paper on Bio-Linux and open source systems for biologists.

Great news if you are handling biological data!

Not to mention being a good example of multiple delivery methods, you can use Bio-Linux 8 as your OS, run it from a VM, DVD or USB stick.

How is your software delivered?

Semantic Investment Needs A Balance Sheet Line Item

July 30th, 2014

The Hidden Shareholder Boost From Information Assets by Douglas Laney.

From the post:

It’s hard today not to see the tangible, economic benefits of information all around us: Walmart uses social media trend data to entice online shoppers to purchase 10 percent to 15 percent more stuff; Kraft spinoff Mondelez grew revenue by $100 million through improved in-store promotion configurations using detailed store, chain, product, stock and pricing data; and UPS saves more than $50 million, delivers 35 percent more packages per year and has doubled driver wages by continually collecting and analyzing more than 200 data points per truck along with GPS data to reduce accidents and miles driven.

Even businesses from small city zoos to mom-and-pop coffee shops to wineries are collecting, crushing and consuming data to yield palpable revenue gains or expense reductions. In addition, some businesses beyond the traditional crop of data brokers monetize their information assets directly by selling or trading them for goods or services.

Yet while as a physical asset, technology is easily given a value attribution and represented on balance sheets; information is treated as an asset also ran or byproduct of the IT department. Your company likely accounts for and manages your office furniture with greater discipline than your information assets. Why? Because accounting standards in place since the beginning of the information age more than 50 years ago continue to be based on 250-year-old Industrial Age realities. (emphasis in original)

Does your accounting system account for your investment in semantics?

Here’s some ways to find out:

  • For any ETL project in the last year, can your accounting department detail how much was spent discovering the semantics of the ETL data?
  • For any data re-used for an ETL project in the last three years, can your accounting department detail how much was spent duplicating the work of the prior ETL?
  • Can your accounting department produce a balance sheet showing your current investment in the semantics of your data?
  • Can your accounting department produce a balance sheet showing the current value of your information?

If the answer is “no,” to any of those questions, is your accounting department meeting your needs in the information age?

Douglas has several tips for getting people’s attention for the $$$ you have invested in information.

Is information an investment or an unknown loss on your books?

Senator John Walsh plagiarism, color-coded

July 30th, 2014

Senator John Walsh plagiarism, color-coded by Nathan Yau.

Nathan points to a New York Times’ visualization that makes a telling case for plagiarism against Senator John Walsh.

Best if you see it at Nathan’s site, his blog formats better than mine does.

Senator Walsh was rather obvious about it but I often wonder how much news copy, print or electronic, is really original?

Some is I am sure but when a story goes out over AP or UPI, how much of it is repeated verbatim in other outlets?

It’s not plagiarism because someone purchased a license to repeat the stories but it certainly isn’t original.

If an AP/UPI story is distributed and re-played in 500 news outlets, it remains one story. With no more credibility than it had at the outset.

Would color coding be as effective against faceless news sources as they have been against Sen. Walsh?

BTW, if you are interested in the sordid details: Pentagon Watchdog to review plagiarism probe of Sen. John Walsh. Incumbents need not worry, Sen. Walsh is an appointed senator and therefore is an easy throw-away in order to look tough on corruption.

Ideals, Varieties, and Algorithms

July 30th, 2014

Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Communtative Algebra by David Cox, John Little, and Donal OShea.

From the introduction:

We wrote this book to introduce undergraduates to some interesting ideas in algebraic geometry and commutative algebra. Until recently, these topics involved a lot of abstract mathematics and were only taught in graduate school. But in the 1960s, Buchberger and Hironaka discovered new algorithms for manipulating systems of polynomial equations. Fueled by the development of computers fast enough to run these algorithms, the last two decades have seen a minor revolution in commutative algebra. The ability to compute efficiently with polynomial equations has made it possible to investigate complicated examples that would be impossible to do by hand, and has changed the practice of much research in algebraic geometry. This has also enhanced the importance of the subject for computer scientists and engineers, who have begun to use these techniques in a whole range of problems.

The authors do presume students “…have access to a computer algebra system.”

The Wikipedia List of computer algebra systems has links to numerous such systems. A large number of which are free.

That list is headed by Axiom (Wikipedia article) and is an example of literate programming. The Axiom documentation looks like a seriously entertaining time sink! You may want to visit http://axiom-developer.org/

I haven’t installed Axiom so take that as a comment on its documentation more than its actual use. Use whatever system you like best and fits your finances.

I first saw this in a tweet from onepaperperday.

Enjoy!

Awesome Machine Learning

July 30th, 2014

Awesome Machine Learning by Joseph Misiti.

From the webpage:

A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. Other awesome lists can be found in the awesome-awesomeness list.

If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti

Not strictly limited to “machine learning” as it offers resources on data analysis, visualization, etc.

With a list of 576 resources, I am sure you will find something new!

JudaicaLink released

July 30th, 2014

JudaicaLink released

From the post:

Data extractions from two encyclopediae from the domain of Jewish culture and history have been released as Linked Open Data within our JudaicaLink project.

JudaicaLink now provides access to 22,808 concepts in English (~ 10%) and Russian (~ 90%), mostly locations and persons.

See here for further information: http://www.judaicalink.org/blog/kai-eckert/encyclopedia-russian-jewry-released-updates-yivo-encyclopedia

Next steps in this project include “…the creation of links between the two encyclopedias and links to external sources like DBpedia or Geonames.”

In case you are interested, the two encyclopedias are:

The YIVO Encyclopedia of Jews in Eastern Europe, courtesy of the YIVO Institute of Jewish Research, NY.

Rujen.ru provides an Internet version of the Encyclopedia of Russian Jewry, which is published in Moscow since 1994, giving a comprehensive, objective picture of the life and activity of the Jews of Russia, the Soviet Union and the CIS.

For more details: Encyclopediae

If you are looking to contribute content or time to a humanities project, this should be on your short list.

Graphs, Databases and Graphlab

July 30th, 2014

Graphs, Databases and Graphlab by Bugra Akyildiz.

From the post:

I will talk about graphs, graph databases and mainly the paper that powers Graphlab. At the end of the post, I will go over briefly basic capabilities of Graphlab as well.

Background coverage of graphs and graphdatabases, followed by a discussion of GraphLab.

The high point of the post are graphs generated from prior work by Bugra on the Internet Movie Database. (IMDB Top 100K Movies Analysis in Depth (Parts 1- 4))

Enjoy!

CIS 194: Introduction to Haskell (Spring 2013)

July 30th, 2014

CIS 194: Introduction to Haskell (Spring 2013) by Brent Yorgey.

From the description:

Haskell is a high-level, pure functional programming language with a strong static type system and elegant mathematical underpinnings, and is being increasingly used in industry by organizations such as Facebook, AT&T, and NASA. In the first 3/4 of the course, we will explore the joys of pure, lazy, typed functional programming in Haskell (warning: euphoric, mind-bending epiphanies may result!). The last 1/4 of the course will consist of case studies in fun and interesting applications of Haskell. Potential case study topics include automated randomized testing, software transactional memory, graphics generation, parser combinators, or others as time and interest allow. Evaluation will be based on class participation, weekly programming assignments, and an open-ended final project.

Lectures, homework, and resources on Haskell.

I particularly liked this resource and comment:

The #haskell IRC channel is a great place to get help. Strange as it may seem if you’ve spent time in other IRC channels, #haskell is always full of friendly, helpful people.

Good to know there are places like that left!

I first saw this in a tweet by Kludgy.

Multi-Term Synonyms [Bags of Properties?]

July 30th, 2014

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter by Ted Sullivan.

From the post:

In a previous blog post, I introduced the AutoPhrasingTokenFilter. This filter is designed to recognize noun-phrases that represent a single entity or ‘thing’. In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

The problem with multi-term synonyms in Lucene/Solr is well documented (see Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the synonym-expanded token list – based on the way that the Lucene indexer ingests the analyzed token stream. The indexer pays attention to a token’s start position but does not attend to its position length increment. This causes multi-term tokens to overlap subsequent terms in the token stream rather than maintaining a strictly parallel relation (in terms of both start and end positions) with their synonymous terms. Therefore, rather than getting a clean ‘state-graph’, we get a pattern called “sausagination” that does not accurately reflect the 1-1 mapping of terms to synonymous terms within the flow of the text (see blog post by Mike McCandless on this issue). This problem disappears if all of the synonym pairs are single tokens.

The multi-term synonym problem was described in a Lucene JIRA ticket (LUCENE-1622) which is still marked as “Unresolved”:

Posts like this one are a temptation to sign off Twitter and read the ticket feeds for Lucene/Solr instead. Seriously.

Ted proposes a workaround to the multi-term synonym problem using the auto phrasing tokenfilter. Equally important is his conclusion:

The AutoPhrasingTokenFilter can be an important tool in solving one of the more difficult problems with Lucene/Solr search – how to deal with multi-term synonyms. Simultaneously, we can improve another serious problem that all search engines have – their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”), the search engine is better able to return results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.

Or more precisely:

…their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”)…

Ambiguity at the token level remains, even if for particular cases phrases can be treated as semantic entities.

Rather than Ted’s “bag of things,” may I suggest indexing “bags of properties?” Where the lowliest token or a higher semantic unit can be indexed as a bag of properties.

Imagine indexing these properties* for a single token:

  • string: value
  • pubYear: value
  • author: value
  • journal: value
  • keywords: value

Would that suffice to distinguish a term in a medical journal from Vanity Fair?

Ambiguity is predicated upon a lack of information.

That should be suggestive of a potential cure.

*(I’m not suggesting that all of those properties or even most of them would literally appear in a bag. Most, if not all, could be defaulted from an indexed source.)

I first saw this in a tweet by SolrLucene.

Scrapy and Elasticsearch

July 30th, 2014

Scrapy and Elasticsearch by Florian Hopf.

From the post:

On 29.07.2014 I gave a talk at Search Meetup Karlsruhe on using Scrapy with Elasticsearch, the slides are here. This post evolved from the talk and introduces you to web scraping and search with Scrapy and Elasticsearch.

Web Crawling

You might think that web crawling and scraping only is for search engines like Google and Bing. But a lot of companies are using it for different purposes: Price comparison, financial risk information and portals all need a way to get the data. And at least sometimes the way is to retrieve it through some public website. Besides these cases where the data is not in your hand it can also make sense if the data is aggregated already. For intranet and portal search engines it can be easier to just scrape the frontend instead of building data import facilities for different, sometimes even old systems.

The Example

In this post we are looking at a rather artificial example: Crawling the meetup.com page for recent meetups to make them available for search. Why artificial? Because meetup.com has an API that provides all the data in a more convenient way. But imagine there is no other way and we would like to build a custom search on this information, probably by adding other event sites as well. (emphasis in original)

Not everything you need to know about Scrapy but enough to get you interested.

APIs for data are on the up swing but web scrapers will be relevant to data mining for decades to come.

Expanded 19th-century Medical Collection

July 30th, 2014

Wellcome Library and Jisc announce partners in 19th-century medical books digitisation project

From the post:

The libraries of six universities have joined the partnership – UCL (University College London), the University of Leeds, the University of Glasgow, the London School of Hygiene & Tropical Medicine, King’s College London and the University of Bristol – along with the libraries of the Royal College of Physicians of London, the Royal College of Physicians of Edinburgh and the Royal College of Surgeons of England.

Approximately 15 million pages of printed books and pamphlets from all ten partners will be digitised over a period of two years and will be made freely available to researchers and the public under an open licence. By pooling their collections the partners will create a comprehensive online library. The content will be available on multiple platforms to broaden access, including the Internet Archive, the Wellcome Library and Jisc Historic Books.

The project’s focus is on books and pamphlets from the 19th century that are on the subject of medicine or its related disciplines. This will include works relating to the medical sciences, consumer health, sport and fitness, as well as different kinds of medical practice, from phrenology to hydrotherapy. Works on food and nutrition will also feature: around 1400 cookery books from the University of Leeds are among those lined up for digitisation. They, along with works from the other partner institutions, will be transported to the Wellcome Library in London where a team from the Internet Archive will undertake the digitisation work. The project will build on the success of the US-based Medical Heritage Library consortium, of which the Wellcome Library is a part, which has already digitised over 50 000 books and pamphlets.

Digital coverage of the 19th century is taking another leap forward!

Given the changes in medical terminology (and practices!) since the 19th century, this should be a gold mine for topic map applications.

Solr’s New AnalyticsQuery API

July 29th, 2014

Solr’s New AnalyticsQuery API by Joel Bernstein.

From the post:

In Solr 4.9 there is a new AnalyticsQuery API that allows developers to plug custom analytic logic into Solr. The AnalyticsQuery class provides a clean and simple API that gives developers access to all the rich functionality in Lucene and is strategically placed within Solr’s distributed search architecture. Using this API you can harness all the power of Lucene and Solr to write custom analytic logic.

Not all the detail you are going to want but a good start towards using the new AnalyticsQuery API in Solr 4.9.

The AnalyticsQuery API is an example of why I wonder about projects with custom search solutions (read not Lucene-based).

If you have any doubts, default to a Lucene-based search solution.

Using Category Theory to design…

July 29th, 2014

Using Category Theory to design implicit conversions and generic operators by John C. Reynolds.

Abstract:

A generalization of many-sorted algebras, called category-sorted algebras, is defined and applied to the language-design problem of avoiding anomalies in the interaction of implicit conversions and generic operators. The definition of a simple imperative language (without any binding mechanisms) is used as an example.

The greatest exposure most people have to implicit conversions is that they are handled properly.

This paper dates from 1980 so some of the category theory jargon will seem odd but consider it a “practical” application of category theory.

That should hold your interest. ;-)

I first saw this in a tweet by scottfleischman.

Hello World! – Hadoop, Hive, Pig

July 29th, 2014

Hello World! – An introduction to Hadoop with Hive and Pig

A set of tutorials to be run on Sandbox v2.0.

From the post:

This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0

The tutorials are presented in sections as listed below.

Maybe I have seen too many “Hello World!” examples but I was expecting the tutorials to go through the use of Hadoop, HCatalog, Hive and Pig to say “Hello World!”

You can imagine my disappointment when that wasn’t the case. ;-)

A lot of work to say “Hello World!” but on the other hand, tradition is tradition.

Kite SDK 0.15.0

July 29th, 2014

What’s New in Kite SDK 0.15.0? by Ryan Blue.

From the post:

Recently, Kite SDK, the open source toolset that helps developers build systems on the Apache Hadoop ecosystem, became a 0.15.0. In this post, you’ll get an overview of several new features and bug fixes.

Covered by this quick recap:

Working with Datasets by URI

Improved Configuration for MR and Apache Crunch Jobs

Parent POM for Kite Applications

Java Class Hints [more informative error messages]

More Docs and Tutorials

The last addition this release is a new user guide on kitesdk.org, where we’re adding new tutorials and background articles. We’ve also updated the examples for the new features, which is a great place to learn more about Kite.

Also, watch this technical webinar on-demand to learn more about working with datasets in Kite.

I think you are going to like this.

HST V1.0 mosaics

July 29th, 2014

HST V1.0 mosaics released for Epoch 2 of Abell 2744

From the webpage:

We are pleased to announce the Version 1.0 release of Epoch 2 of Abell 2744, after the completion of all the ACS and WFC3/IR imaging on the main cluster and parallel field from our Frontier Fields program (13495, PI: J. Lotz), in addition to imaging from programs 11689 (PI: R. Dupke), 13386 (PI: S. Rodney), and 13389 (PI: B. Siana). These v1.0 mosaics have been fully recalibrated relative to the v0.5 mosaics that we have been releasing regularly throughout the course of this epoch during May, June and July 2014. For ACS, the v1.0 mosaics incorporate new bias and dark current reference files, along with CTE correction and bias destriping, and also include a set of mosaics that have been processed with the new selfcal approach to better account for the low-level dark current structure. The WFC3/IR v1.0 mosaics have improved masking for persistence and bad pixels, and in addition include a set of mosaics that have been corrected for time-variable sky emission that can occur during the orbit and can otherwise impact the up-the-ramp count-rate fitting if not properly corrected. Further details are provided in the readme file, which can be obtained along with all the mosaics at the following location:

From Wikipedia on Abell 2744:

Abell 2744, nicknamed Pandora’s Cluster, is a giant galaxy cluster resulting from the simultaneous pile-up of at least four separate, smaller galaxy clusters that took place over a span of 350 million years.[1] The galaxies in the cluster make up less than five percent of its mass.[1] The gas (around 20 percent) is so hot that it shines only in X-rays.[1] Dark matter makes up around 75 percent of the cluster’s mass.[1]

Admittedly the data is over 350 million years out of date but it is the latest data that is currently available. ;-)

Enjoy!

Collection Pipeline

July 29th, 2014

Collection Pipeline by Martin Fowler.

From the post:

Collection pipelines are a programming pattern where you organize some computation as a sequence of operations which compose by taking a collection as output of one operation and feeding it into the next. (Common operations are filter, map, and reduce.) This pattern is common in functional programming, and also in object-oriented languages which have lambdas. This article describes the pattern with several examples of how to form pipelines, both to introduce the pattern to those unfamiliar with it, and to help people understand the core concepts so they can more easily take ideas from one language to another.

With a bit of Smalltalk history thrown in, this is a great read. Be mindful of software stories. You will learn that useful features have been dropped, ignored, re-invented throughout software history.

Enjoy!

Cooper Hewitt, Color Interface

July 29th, 2014

From the about page:

Cooper Hewitt, Smithsonian Design Museum is the only museum in the nation devoted exclusively to historic and contemporary design. The Museum presents compelling perspectives on the impact of design on daily life through active educational and curatorial programming.

It is the mission of Cooper Hewitt’s staff and Board of Trustees to advance the public understanding of design across the thirty centuries of human creativity represented by the Museum’s collection. The Museum was founded in 1897 by Amy, Eleanor, and Sarah Hewitt—granddaughters of industrialist Peter Cooper—as part of The Cooper Union for the Advancement of Science and Art. A branch of the Smithsonian since 1967, Cooper-Hewitt is housed in the landmark Andrew Carnegie Mansion on Fifth Avenue in New York City.

I thought some background might be helpful because the Cooper Hewitt has a new interface:

COLORS

Color, or colour, is one of the attributes we’re interested in exploring for collection browsing. Bearing in mind that only a fraction of our collection currently has images, here’s a first pass.

Objects with images now have up to five representative colors attached to them. The colors have been selected by our robotic eye machines who scour each image in small chunks to create color averages. These have then been harvested and “snapped” to the grid of 120 different colors — derived from the CSS3 palette and naming conventions — below to make navigation a little easier.

My initial reaction was to recall the old library joke where a patron comes to the circulation desk and doesn’t know a book’s title or author, but does remember it had a blue cover. ;-) At which point you wish Basil from Faulty Towers was manning the circulation desk. ;-)

It may be a good idea with physical artifacts because color/colour is a fixed attribute that may be associated with a particular artifact.

If you know the collection, you can amuse yourself by trying to guess what objects will be returned for particular colors.

BTW, the collection is interlinked by people, roles, periods, types, countries. Very impressive!

Don’t miss the resources for developers at: https://collection.cooperhewitt.org/developers/ and their GitHub account.

I first saw this in a tweet by Lyn Marie B.

PS: The use of people, roles, objects, etc. for browsing has a topic map-like feel. Since their data and other resources are downloadable, more investigation will follow.

MusicGraph

July 29th, 2014

Senzari Unveils MusicGraph.ai At The GraphLab Conference 2014

From the post:

Senzari introduced MusicGraph.ai, the first web-based graph analytics and intelligence engine for the music industry at the GraphLab Conference 2014, the annual gathering of leading data scientists and machine learning experts. MusicGraph.ai will serve as the primary dashboard for MusicGraph, where API clients will be able to view detailed reports on their API usage and manage their account. More importantly, through this dashboard, they will also be able to access a comprehensive library of algorithms to extract even more value from the world’s most extensive repository of music data.

“We believe MusicGraph.ai will forever change the music intelligence industry, as it allows scientists to execute powerful analytics and machine learning algorithms at scale on a huge data-set without the need to write a single-line of code”

Free access to MusicGraph at: http://developer.musicgraph.com

I originally encountered MusicGraph because of its use of the Titan graph database. BTW, GraphLab and GraphX are also available for data analytics.

From the MusicGraph website:

MusicGraph is the world’s first “natural graph” for music, which represents the real-world structure of the musical universe. Information contained within it includes data related to the relationship between millions of artists, albums, and songs. Also included is detailed acoustical and lyrical features, as well as real-time statistics across artists and their music across many sources.

MusicGraph has over 600 million vertices and 1 billion edges, but more importantly it has over 7 billion properties, which allows for deep knowledge extraction through various machine learning approaches.

Sigh, why can’t people say: “…it represents a useful view of the musical universe…,” instead of “…which represents the real-world structure of the musical universe”? All representations are views of some observer. (full stop) If you think otherwise, please return your college and graduate degrees for a refund.

Yes, I know political leaders use “real world” all the time. But they are trying to deceive you into accepting their view as beyond question because it represents the “real world.” Don’t be deceived. Their views are no “real world” based than yours are. Which is to say, not at all. Defend your view but knowing it is a view.

I first saw this in a tweet by Gregory Piatetsky.

Alphabetical Order

July 29th, 2014

Alphabetical order explained in a mere 27,817 words by David Weinberger.

From the post:

This is one of the most amazing examples I’ve seen of the complexity of even simple organizational schemes. “Unicode Collation Algorithm (Unicode Technical Standard #10)” spells out in precise detail how to sort strings in what we might colloquially call “alphabetical order.” But it’s way, way, way more complex than that.

Unicode is an international standard for how strings of characters get represented within computing systems. For example, in the familiar ASCII encoding, the letter “A” is represented in computers by the number 65. But ASCII is too limited to encode the world’s alphabets. Unicode does the job.

As the paper says, “Collation is the general term for the process and function of determining the sorting order of strings of characters” so that, for example, users can look them up on a list. Alphabetical order is a simple form of collation.

The best part is the summary of Unicode Technical Standard #10:

This document dives resolutely into the brambles and does not give up. It incidentally exposes just how complicated even the simplest of sorting tasks is when looked at in their full context, where that context is history, language, culture, and the ambiguity in which they thrive.

We all learned the meaning of “alphabetical order” in elementary school. But which “alphabetical order” depends upon language, culture, context, etc.

Other terms and phrases have the same problem. But the vast majority of them have no Unicode Technical Report with all the possible meanings.

For those terms there are topic maps.

I first saw this in a tweet by Computer Science.

Synchronizer Based on Operational Transformation…

July 28th, 2014

Synchronizer Based on Operational Transformation for P2P Environments by Michelle Cart and Jean Ferrié

Abstract:

Reconciling divergent copies is a common problem encountered in distributed or mobile systems, asynchronous collaborative groupware, concurrent engineering, software configuration management, version control systems and personal work involving several mobile computing devices. Synchronizers provide a solution by enabling two divergent copies of the same object to be reconciled. Unfortunately, a master copy is generally required before they can be used for reconciling n copies, otherwise copy convergence will not be achieved. This paper presents the principles and algorithm of a Synchronizer which provides the means to reconcile n copies, without discriminating in favour of any particular copy. Copies can be modified (concurrently or not) on different sites and the Synchronizer we propose enables them to be reconciled pairwise, at any time, regardless of the pair, while achieving convergence of all copies. For this purpose, it uses the history of operations executed on each copy and Operational Transformations. It does not require a centralised or ordering (timestamp, state vector, etc.) mechanism. Its main advantage is thus to enable free and lazy propagation of copy updates while ensuring their convergence – it is particularly suitable for P2P environments in which no copy should be favoured.

Not the oldest work on operational transformations, 2007, nor the most recent.

Certainly of interest for distributed topic maps as well as other change tracking applications.

I first saw this in a tweet by onepaperperday.

A Survey of Graph Theory and Applications in Neo4J

July 28th, 2014

A Survey of Graph Theory and Applications in Neo4J by Geoff Moes.

A great summary of resources on graph theory along with a two part presentation on the same.

Geoff mentions: Graph Theory, 1736-1936 by Norman L. Biggs, E. Keith Lloyd, and Robin J. Wilson, putting to rest any notion that graphs are a recent invention.

Enjoy!

Oryx 2:…

July 28th, 2014

Oryx 2: Lambda architecture on Spark for real-time large scale machine learning

From the overview:

This is a redesign of the Oryx project as “Oryx 2.0″. The primary design goals are:

1. A more reusable platform for lambda-architecture-style designs, with batch, speed and serving layers

2. Make each layer usable independently

3.Fuller support for common machine learning needs

  • Test/train set split and evaluation
  • Parallel model build
  • Hyper-parameter selection

4. Use newer technologies like Spark and Streaming in order to simplify:

  • Remove separate in-core implementations for scale-down
  • Remove custom data transport implementation in favor of message queues like Apache Kafka
  • Use a ‘real’ streaming framework instead of reimplementing a simple one
  • Remove complex MapReduce-based implementations in favor of Apache Spark-based implementations

5. Support more input (i.e. not just CSV)

Initial import was three days ago if you are interested in being in on the beginning!

Visualizing High-Dimensional Data…

July 28th, 2014

Visualizing High-Dimensional Data in the Browser with SVD, t-SNE and Three.js by Nicolas Kruchten.

From the post:

Data visualization, by definition, involves making a two- or three-dimensional picture of data, so when the data being visualized inherently has many more dimensions than two or three, a big component of data visualization is dimensionality reduction. Dimensionality reduction is also often the first step in a big-data machine-learning pipeline, because most machine-learning algorithms suffer from the Curse of Dimensionality: more dimensions in the input means you need exponentially more training data to create a good model. Datacratic’s products operate on billions of data points (big data) in tens of thousands of dimensions (big problem), and in this post, we show off a proof of concept for interactively visualizing this kind of data in a browser, in 3D (of course, the images on the screen are two-dimensional but we use interactivity, motion and perspective to evoke a third dimension).

Both the post and the demo are very impressive!

For a compelling review, see Dimension Reduction: A Guided Tour by Christopher J.C. Burges.

Christopher captures my concern with dimensional reduction in the first sentence of the introduction:

Dimension reduction1 is the mapping of data to a lower dimensional space such that uninformative variance in the data is discarded, or such that a subspace in which the data lives is detected.

I understand the need for dimensional reduction and that it can produce useful results. But what is being missed in the “…uniformative variance in the data…” is unknown.

Not an argument against dimensional reduction but a caution to avoid quickly dismissing variation in data as “uninformative.”

When did linking begin?

July 28th, 2014

When did linking begin? by Bob DuCharme.

Bob republished an old tilt at a windmill that attempts to claim “linking” as beginning in the 12th century CE.

It’s an interesting read but I disagree with his dismissal of quoting of a work as a form of linking. Bob says:

Quoting of one work by another was certainly around long before the twelfth century, but if an author doesn’t identify an address for his source, his reference can’t be traversed, so it’s not really a link. Before the twelfth century, religious works had a long tradition of quoting and discussing other works, but in many traditions (for example, Islam, Theravada Buddhism, and Vedic Hinduism) memorization of complete religious works was so common that telling someone where to look within a work was unnecessary. If one Muslim scholar said to another “In the words of the Prophet…” he didn’t need to name the sura of the Qur’an that the quoted words came from; he could assume that his listener already knew. Describing such allusions as “links” adds heft to claims that linking is thousands of years old, but a link that doesn’t provide an address for its destination can’t be traversed, and a link that can’t be traversed isn’t much of a link. And, such claims diminish the tremendous achievements of the 12th-century scholars who developed new techniques to navigate the accumulating amounts of recorded information they were studying. (emphasis added)

Bob’s error is too narrow a view of the term “address.” Quoted text of the Hebrew Bible acts as an “address,” assuming you are familiar enough with the text. The same is true for the examples of the Qur’an and Vedic Hinduism. It is as certain and precise as a chapter and verse reference, but it does require a degree of knowledge of the text in question.

That does not take anything away from 12th century scholars who created addresses that did not require as much knowledge of the underlying text. Faced with more and more information, their inventions assisted in navigating texts with a new type of address, one that could be used by anyone.

Taking a broader view of addressing creates a continuum of addressing that encompasses web-based linking. Rather than using separate systems of physical addresses to locate information in books, users now have electronic addresses that can deliver them to particular locations in a work.

Here is my continuum of linking:

Linking Requires
Quoting Memorized Text
Reference System Copy of Text
WWW Hyperlink Access to Text

The question to ask about Bob’s point about quoting “…his reference can’t be traversed…” is, “Who can’t traverse that link?” Anyone who has memorized the text can quite easily.

Oh, people who have not memorized the text cannot traverse the link. And? If I don’t have access to the WWW, I can’t traverse hyperlinks. Does that make them any less links?

Or does it mean I haven’t met the conditions for exercising the link?

Instead of diminishing the work of 12th century scholars, recognizing prior linking practices allows us to explore what changed and for who as a result of their “…tremendous achievements….”

Basic Category Theory (Publish With CUP)

July 28th, 2014

Basic Category Theory by Tom Leinster.

From the webpage:

Basic Category Theory is an introductory category theory textbook. Features:

  • It doesn’t assume much, either in terms of background or mathematical maturity.
  • It sticks to the basics.
  • It’s short.

Advanced topics are omitted, leaving more space for careful explanations of the core concepts. I used earlier versions of the text to teach master’s-level courses at the University of Glasgow.

The book is published by Cambridge University Press. You can find all the publication data, and buy it, at the book’s CUP web page.

It was published on 24 July 2014 in hardback and e-book formats. The physical book should be in stock throughout Europe now, and worldwide by mid-September. Wherever you are, you can (pre)order it now from CUP or the usual online stores.

By arrangement with CUP, a free online version will be released in January 2016. This will be not only freely downloadable but also freely editable, under a Creative Commons licence. So, for instance, if parts of the book are unsuitable for the course you’re teaching, or if you don’t like the notation, you can change it. More details will appear here when the time comes.

Freely available as etext (6 months after hard copy release) and freely editable?

Show of hands. How many publishers have you seen with those policies?

I keep coming up with one, Cambridge University Press, CUP.

As readers and authors we need to vote with our feet. Purchase from and publish with Cambridge University Press.

It may take a while but other publishers may finally notice.