## Updating OpenStreetMap…

December 8th, 2013

Updating OpenStreetMap with the latest US road data by Eric Fisher.

From the post:

We can now pull the most current US government index of all roads directly into OpenStreetMap for tracing. Just go to OpenStreetMap.org, click Edit, and choose the “New & Misaligned TIGER Roads” option from the layer menu. “TIGER” is the name of the US road database managed by the Census Bureau. The TIGER layer will reveal in yellow any roads that have been corrected in or added to TIGER since 2006 and that have not also been corrected in OpenStreetMap. Zoom in on any yellow road to see how TIGER now maps it, verify it against the aerial imagery, and correct it in OpenStreetMap.

This could be very useful.

For planning protest, retreat, escape routes and such.

## Advances in Neural Information Processing Systems 26

December 8th, 2013

Advances in Neural Information Processing Systems 26

The NIPS 2013 conference ended today.

All of the NIPS 2013 papers were posted today.

I count three hundred and sixty (360) papers.

From the NIPS Foundation homepage:

The Foundation: The Neural Information Processing Systems (NIPS) Foundation is a non-profit corporation whose purpose is to foster the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects. Neural information processing is a field which benefits from a combined view of biological, physical, mathematical, and computational sciences.

The primary focus of the NIPS Foundation is the presentation of a continuing series of professional meetings known as the Neural Information Processing Systems Conference, held over the years at various locations in the United States, Canada and Spain.

Enjoy the proceedings collection!

I first saw this in a tweet by Benoit Maison.

## Mapping the open web using GeoJSON

December 8th, 2013

Mapping the open web using GeoJSON by Sean Gillies.

From the post:

GeoJSON is an open format for encoding information about geographic features using JSON. It has much in common with older GIS formats, but also a few new twists: GeoJSON is a text format, has a flexible schema, and is specified in a single HTML page. The specification is informed by standards such as OGC Simple Features and Web Feature Service and streamlines them to suit the way web developers actually build software today.

Promoted by GitHub and used in the Twitter API, GeoJSON has become a big deal in the open web. We are huge fans of the little format that could. GeoJSON suits the web and suits us very well; it plays a major part in our libraries, services, and products.

A short but useful review of why GeoJSON is important to MapBox and why it should be important to you.

A must read if you are interested in geo-locating data of interest to your users to maps.

Sean mentions that Github promotes GeoJSON but I’m curious if the NSA uses/promotes it as well?

## Neo4j GraphGist December Challenge

December 8th, 2013

Neo4j GraphGist December Challenge

Meetup Slides say: Deadline for entry is January 31st (2014). I mention that because the webpage still says Dec 31, 2013.

From the webpage:

This time we want you to look into these 10 categories and provide us with really easy to understand and still insightful Graph Use-Cases: Do not take the example keywords literally, you know your domain much better than we do!

• Education – Schools, Universities, Courses, Planning, Management etc
• Finance – Loans, Risks Fraud
• Life Science – Biology, Genetics, Drug research, Medicine, Doctors, Referrals
• Manufacturing – production line management, supply chain, parts list, product lines
• Sports – Football, Baseball, Olympics, Public Sports
• Resources – Energy Market, Consumption, Resource exploration, Green Energy, Climate Modeling
• Retail – Recommendations, Product categories, Price Management, Seasons, Collections
• Telecommunication – Infrastructure, Authorization, Planning, Impact
• Transport – Shipping, Logistics, Flights, Cruises, Road/Train optimizations, Schedules
• Advanced Graph Gists – for those of you that run outside of the competition anyway, give your best

Prizes:

We want to offer in each of our 10 categories Amazon gift-cards valued:

1. Winner: 300 USD
2. Second: 150 USD
3. Third: 50 USD
4. Every participant gets a special GraphGist t-shirt too.

In addition to the resources at the webpage, you may find AsciiDoc Cheatsheet helpful.

The meetup video where the GraphGist was announced.

Easy to understand graph use cases should not be too difficult.

Easy to solve graph use cases, that may be another matter.

## BayesDB

December 8th, 2013

BayesDB (Alpha 0.1.0 Release)

From the webpage:

BayesDB, a Bayesian database table, lets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.

BayesDB is suitable for analyzing complex, heterogeneous data tables with up to tens of thousands of rows and hundreds of variables. No preprocessing or parameter adjustment is required, though experts can override BayesDB’s default assumptions when appropriate.

BayesDB’s inferences are based in part on CrossCat, a new, nonparametric Bayesian machine learning method, that automatically estimates the full joint distribution behind arbitrary data tables.

Now there’s an interesting idea!

Not sure if it is a good idea but it certainly is an interesting one.

## Recommender Systems Course from GroupLens

December 7th, 2013

Recommender Systems Course from GroupLens by Danny Bickson.

From the post:

I got the following course link from my colleague Tim Muss. The GroupLens research group (Univ. of Minnesota) have released a coursera course about recommender systems. Michael Konstan and Michael Ekstrand are lecturing. Any reader of my blog which has an elephant memory will recall I wrote about the Lenskit project already 2 years ago where I intreviewed Michael Ekstrand.

Would you agree that recommendation involves subject recognition?

At a minimum recognition of the subject to be recommended and the subject of a particular user’s preference.

I ask because the key to topic map “merging” isn’t ontological correctness but “correctness” in the eyes of a particular user.

What other standard would I use?

## Large-Scale Machine Learning and Graphs

December 7th, 2013

Large-Scale Machine Learning and Graphs by Carlos Guestrin.

The presentation starts with a history of the evolution of GraphLab, which is interesting in and of itself.

Carlos then goes beyond a history lesson and gives a glimpse of a very exciting future.

Such as: installing GraphLab with Python, using Python for local development, running the same Python with Graphlab in the cloud.

Thought that might catch your eye.

Something to remember when people talk about scaling graph analysis.

If you are interested in seeing one possible future of graph processing today, not some day, check out: GraphLab Notebook (Beta).

BTW, Carlos mentions a technique call “think like a vertex” which involves distributing vertexes across machines rather than splitting graphs on edges.

Seems to me that would work to scale the processing of topic maps by splitting topics as well. Once “merging” has occurred on different machines, then “merge” the relevant topics back together across machines.

## The Society of the Mind

December 7th, 2013

The Society of the Mind by Marvin Minsky.

From the Prologue:

This book tries to explain how minds work. How can intelligence emerge from nonintelligence? To answer that, we’ll show that you can build a mind from many little parts, each mindless by itself.

I’ll call Society of Mind this scheme in which each mind is made of many smaller processes. These we’ll call agents. Each mental agent by itself can only do some simple thing that needs no mind or thought at all. Yet when we join these agents in societies — in certain very special ways — this leads to true intelligence.

There’s nothing very technical in this book. It, too, is a society — of many small ideas. Each by itself is only common sense, yet when we join enough of them we can explain the strangest mysteries of mind. One trouble is that these ideas have lots of cross-connections. My explanations rarely go in neat, straight lines from start to end. I wish I could have lined them up so that you could climb straight to the top, by mental stair-steps, one by one. Instead they’re tied in tangled webs.

Perhaps the fault is actually mine, for failing to find a tidy base of neatly ordered principles. But I’m inclined to lay the blame upon the nature of the mind: much of its power seems to stem from just the messy ways its agents cross-connect. If so, that complication can’t be helped; it’s only what we must expect from evolution’s countless tricks.

What can we do when things are hard to describe? We start by sketching out the roughest shapes to serve as scaffolds for the rest; it doesn’t matter very much if some of those forms turn out partially wrong. Next, draw details to give these skeletons more lifelike flesh. Last, in the final filling-in, discard whichever first ideas no longer fit.

That’s what we do in real life, with puzzles that seem very hard. It’s much the same for shattered pots as for the cogs of great machines. Until you’ve seen some of the rest, you can’t make sense of any part.

All 270 essays in 30 chapters of Minsky’s 1988 book by the same name.

To be read critically.

It is dated but a good representative of a time in artificial intelligence.

I first saw this in Nat Torkington’s Five Short Links for 6 December 2013.

## Free GIS Data

December 7th, 2013

Free GIS Data by Robin Wilson.

Over 300 GIS data sets. As of 7 December 2013, last updated 6 December 2013.

A very wide ranging collection of “free” GIS data.

Robin recommends you check the licenses of individual data sets. The meaning of “free” varies from person to person.

If you discover “free” GIS resources not listed on Robin’s page, drop him a note.

I first saw this in Pete Warden’s Five Short Links for November 30, 2013.

## Think Tank Review

December 7th, 2013

Think Tank Review by Central Library of the General Secretariat of the EU Council.

The title could mean a number of things so when I saw it at Full Text Reports, I followed it.

From the first page:

Welcome to issue 8 of the Think Tank Review compiled by the Council Library.* It references papers published in October 2013. As usual, we provide the link to the full text and a short abstract.

The current Review and past issues can be downloaded from the Intranet of the General Secretariat of the Council or requested to the Library.

A couple of technical points: the Think Tank Review will soon be made available – together with other bibliographic and research products from the Library – on our informal blog at http://www.councillibrary.wordpress.com. A Beta version is already online for you to comment.

More broadly, in the next months we will be looking for ways to disseminate the contents of the Review in a more sophisticated way than the current – admittedly spartan – collection of links cast in a pdf format. We will look at issues such as indexing, full text search, long-term digital preservation, ease of retrieval and readability on various devices. Ideas from our small but faithful community of readers are welcome. You can reach us at central.library@consilium.europa.eu.

I’m not a policy wonk so scanning the titles didn’t excite me but it might you or (more importantly) one of your clients.

It seemed like an odd enough resource that you may not encounter it by chance.

## Analysis of PubMed search results using R

December 6th, 2013

Analysis of PubMed search results using R by Pilar Cacheiro.

From the post:

Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year…

Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine.

Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,…

Pliar does a great job introducing RISmed and pointing to additional sources for more examples and discussion of the package.

Meta-analysis is great but you could also be selling the results of your queries to PubMed.

After all, they would be logging your IP address, not that of your client.

Some people prefer more anonymity than others and are willing to pay for that privilege.

## DARPA’s online games crowdsource software security

December 6th, 2013

DARPA’s online games crowdsource software security by Kevin McCaney.

From the post:

Flaws in commercial software can cause serious problems if cyberattackers take advantage of them with their increasingly sophisticated bag of tricks. The Defense Advanced Research Projects Agency wants to see if it can speed up discovery of those flaws by making a game of it. Several games, in fact.

DARPA’s Crowd Sourced Formal Verification (CSFV) program has just launched its Verigames portal, which hosts five free online games designed to mimic the formal software verification process traditionally used to look for software bugs.

Verification, both dynamic and static, has proved to be the best way to determine if software free of flaws, but it requires software engineers to perform “mathematical theorem-proving techniques” that can be time-consuming, costly and unable to scale to the size of some of today’s commercial software, according to DARPA. With Verigames, the agency is testing whether untrained (and unpaid) users can verify the integrity of software more quickly and less expensively.

“We’re seeing if we can take really hard math problems and map them onto interesting, attractive puzzle games that online players will solve for fun,” Drew Dean, DARPA program manager, said in announcing the portal launch. “By leveraging players’ intelligence and ingenuity on a broad scale, we hope to reduce security analysts’ workloads and fundamentally improve the availability of formal verification.”

If program verification is possible with online games, I don’t know of any principled reason why topic map authoring should not be possible.

Maybe fill-in-the-blank topic map authoring is just a poor authoring technique for topic maps.

Imagine gamifying data streams to be like Missile Command.

Can you even count the number of hours that you played Missile Command?

Now consider the impact of a topic map authoring interface that addictive.

Particularly if the user didn’t know they were doing useful work.

## …: Selling Data

December 6th, 2013

A New Source of Revenue for Data Scientists: Selling Data by Vincent Granville.

From the post:

What kind of data is salable? How can data scientists independently make money by selling data that is automatically generated: raw data, research data (presented as customized reports), or predictions. In short, using an automated data generation / gathering or prediction system, working from home with no boss and no employee, and possibly no direct interactions with clients. An alternate career path that many of us would enjoy!

Vincent gives a number of examples of companies selling data right now, some possible data sources, startup ideas and pointers to articles on data scientists.

Vincent makes me think there are at least three ways to sell topic maps:

1. Sell people on using topic maps so they can produce high quality data through the use of topic maps.
2. Sell people on hiring you to construct a topic map system so they can produce high quality data.
3. Sell people high quality data because you are using a topic map.

Not everyone who likes filet mignon (#3), wants to raise the cow (#1) and/or butcher the cow(#2).

It is more expensive to buy filet mignon, but it also lowers the odds of stepping in cow manure and/or blood.

What data would you buy?

## Instructions for deploying an Elasticsearch Cluster with Titan

December 6th, 2013

Instructions for deploying an Elasticsearch Cluster with Titan by Benjamin Bengfort.

From the post:

Elasticsearch is an open source distributed real-time search engine for the cloud. It allows you to deploy a scalable, auto-discovered cluster of nodes, and as search capacity grows, you simple need to add more nodes and the cluster will reorganize itself. Titan, a distributed graph engine by Aurelius supports elasticsearch as an option to index your vertices for fast lookup and retrieval. By default, Titan supports elasticsearch running in the same JVM and storing data locally on the client, which is fine for embedded mode. However, once your Titan cluster starts growing, you have to respond by growing an elasticsearch cluster side by side with the graph engine.

This tutorial is how to quickly get a elasticsearch cluster up and running on EC2, then configuring Titan to use it for indexing. It assumes you already have an EC2/Titan cluster deployed. Note, that these instructions were for a particular deployment, so please forward any questions about specifics in the comments!

A great tutorial. Short, on point and references other resources.

Enjoy!

## Glitch is Dead, Long Live Glitch!

December 6th, 2013

From the website:

The collaborative, web-based, massively multiplayer game Glitch began its initial private testing in 2009, opened to the public in 2010, and was shut down in 2012. It was played by more than 150,000 people and was widely hailed for its original and highly creative visual style.

The entire library of art assets from the game, has been made freely available, dedicated to the public domain. Code from the game client is included to help developers work with the assets. All of it can be downloaded and used by anyone, for any purpose. (But: use it for good.)

Tiny Speck, Inc., the game’s developer, has relinquished its ownership of copyright over these 10,000+ assets in the hopes that they help others in their creative endeavours and build on Glitch’s legacy of simple fun, creativity and an appreciation for the preposterous. Go and make beautiful things.

I never played Glitch but the art could be useful.

Or perhaps even the online game code if you are looking to create a topic map gaming site.

Read the release for the details of the licensing.

I first saw this in Nat Torkington’s Four short links: 22 November 2013.

## Whoosh

December 6th, 2013

Whoosh: Python Search Library

From the webpage:

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

Some of Whoosh’s features include:

• Pythonic API.
• Pure-Python. No compilation or binary packages needed, no mysterious crashes.
• Fielded indexing and search.
• Fast indexing and retrieval — faster than any other pure-Python search solution I know of. See Benchmarks.
• Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
• Powerful query language.
• Production-quality pure Python spell-checker (as far as I know, the only one).

Whoosh might be useful in the following circumstances:

• Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
• As a research platform (at least for programmers that find Python easier to read and work with than Java
• When an easy-to-use Pythonic interface is more important to you than raw speed.
• If your application can make good use of one deeply integrated search/lookup solution you can rely on just being there rather than having two different search solutions (a simple/slow/homegrown one integrated, an indexed/fast/external binary dependency one as an option).

Whoosh was created and is maintained by Matt Chaput. It was originally created for use in the online help system of Side Effects Software’s 3D animation software Houdini. Side Effects Software Inc. graciously agreed to open-source the code.

## Learning more

One of the reasons to use Whoosh made me laugh:

When an easy-to-use Pythonic interface is more important to you than raw speed.

When is raw speed less important than anything?

Seriously, experimentation with search promises to be a fruitful area for the foreseeable future.

I first saw this in Nat Torkington’s Four short links: 21 November 2013.

## TextBlob: Simplified Text Processing

December 5th, 2013

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

….

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

• Noun phrase extraction
• Part-of-speech tagging
• Sentiment analysis
• Classification (Naive Bayes, Decision Tree)
• Language translation and detection powered by Google Translate
• Tokenization (splitting text into words and sentences)
• Word and phrase frequencies
• Parsing
• n-grams
• Word inflection (pluralization and singularization) and lemmatization
• Spelling correction
• JSON serialization
• Add new models or languages through extensions
• WordNet integration

Knowing that TextBlob plays well with NLTK is a big plus!

## InfluxDB

December 5th, 2013

InfluxDB

From the webpage:

An open-source, distributed, time series, events, and metrics database with no external dependencies.

Time Series

Everything in InfluxDB is a time series that you can perform standard functions on like min, max, sum, count, mean, median, percentiles, and more.

Metrics

Scalable metrics that you can collect on any interval, computing rollups on the fly later. Track 100 metrics or 1 million, InfluxDB scales horizontally.

Events

InfluxDB’s data model supports arbitrary event data. Just write in a hash of associated data and count events, uniques, or grouped columns on the fly later.

The overview page gives some greater detail:

When we built Errplane, we wanted the data model to be flexible enough to store events like exceptions along with more traditional metrics like response times and server stats. At the same time we noticed that other companies were also building custom time series APIs on top of a database for analytics and metrics. Depending on the requirements these APIs would be built on top of a regular SQL database, Redis, HBase, or Cassandra.

We thought the community might benefit from the work we’d already done with our scalable backend. We wanted something that had the HTTP API built in that would scale out to billions of metrics or events. We also wanted sometehing that would make it simple to query for downsampled data, percentiles, and other aggregates at scale. Our hope is that once there’s a standard API, the community will be able to build useful tooling around it for data collection, visualization, and analysis.

While phrased as tracking server stats and events, I suspect InfluxDB would be just as happy tracking other types of stats or events.

I don’t know, say like the “I’m alive” messages your cellphone sends to the local towers for instance.

I first saw this in Nat Torkington’s Four short links: 5 November 2013.

## SICP in Clojure – Update

December 5th, 2013

In my post, SICP in Clojure, I incorrectly identified Steve Deobald as the maintainer of this project.

The original maintainer of the project placed a link on the site saying that Steve is the maintainer.

That is not correct.

Apologies to Steve and apologies to my readers who were hopeful this project would be going forward.

Any thoughts on moving this project forward?

I think the idea is a very sound one.

PS: Unlike many media outlets, I think corrections should be as prominent as the original mistakes.

## On Self-Licking Ice Cream Cones

December 5th, 2013

On Self-Licking Ice Cream Cones by Peter Worden. 1992

Ben Brody in The definitive glossary of modern US military slang quotes the following definition for a Self-Licking Ice Cream Cone:

A military doctrine or political process that appears to exist in order to justify its own existence, often producing irrelevant indicators of its own success. For example, continually releasing figures on the amount of Taliban weapons seized, as if there were a finite supply of such weapons. While seizing the weapons, soldiers raid Afghan villages, enraging the residents and legitimizing the Taliban’s cause.

Wikipedia at (Self-licking ice cream cone) reports the phrase was first used by Pete Worden in “On Self-Licking Ice Cream Cones” in 1992 to describe the NASA bureaucracy.

The keywords for the document are: Ice Cream Cones; Pork; NASA; Mafia; Congress.

Birds of a feather I would say.

Worden isolates several problems:

Problems, National, The Budget Process

This unfortunate train of events has resulted in a NASA which, more than any other agency, believes it works only for the appropriations committees. The senior staff of those committees, who have little interest in science or space, effectively run NASA. NASA senior offiicials’ noses are usually found at waist level near those committee staffers.

Problems, Closer to Home, NASA

“The Self-Licking Ice Cream Cone”

Since NASA effectively works for the most porkish part of Congress, it is not surprising that their programs are designed to maximize and perpetuate jobs programs in key Congressional districts. The Space Shuttle-Space Station is an outrageous example. Almost two-thirds of NASA’s budget is tied up in this self-licking program. The Shuttle is an unbelievably costly was to get to space at \$1 billion a pop. The Space Station is a silly design. Yet, this Station is designed so it can only be built by the Shuttle and the Shuttle is the only way to construct the Station….

“Inmates Running the Asylum”

NASA’s vaulted “peer review” process is not a positive factor, but an example of the “pork” mentality within the scientific community. It results in needlessly complex programs whose primary objective is not putting instruments in orbit, but maximizing the number of constituencies and investigators, thereby maximizing the political invulnerability of the program….

“Mafia Tactics”

…The EOS is a case in point. About a year ago, encouraged by criticism from some quarters of Congress and in the press, some scientists and satellite contractors began proposing small, cheap, near-term alternatives to the EOS “battlestars.” Senior NASA officials conducted, with impunity, an unbelievable campaign of threats against these critics. Members of the White House advisory committees were told they would not get NASA funding if they continued to probe the program….

“Shoot the Sick Horses, and their Trainers”

It is outrageous that the Hubble disaster resulted in no repercussions. All we hear is that some un-named technician, no longer working for the contractor, made a mistake in the early 1980s. Even in the Defense Department, current officials would lost their jobs over allowing such an untested and expensive system to be launched.

Compare Worden’s complaints to the security apparatus represented by the NSA and its kin.

Have you heard of any repercussions for any of the security failures and/or outrages?

Is there any doubt that the security apparatus exists solely to perpetuate the security apparatus?

By definition the NSA is a Self-Licking Ice Cream Cone.

Time to find a trash can.

Hubble: The Hubble Space Telescope Optical Systems Failure Report (pdf) Long before all the dazzling images from Hubble, it was virtually orbiting space junk for several years.

## U.S. Military Slang

December 5th, 2013

The definitive glossary of modern US military slang by Ben Brody.

From the post:

It’s painful for US soldiers to hear discussions and watch movies about modern wars when the dialogue is full of obsolete slang, like “chopper” and “GI.”

Slang changes with the times, and the military’s is no different. Soldiers fighting the wars in Iraq and Afghanistan have developed an expansive new military vocabulary, taking elements from popular culture as well as the doublespeak of the military industrial complex.

The US military drawdown in Afghanistan — which is underway but still awaiting the outcome of a proposed bilateral security agreement — is often referred to by soldiers as “the retrograde,” which is an old military euphemism for retreat. Of course the US military never “retreats” — rather it conducts a “tactical retrograde.”

This list is by no means exhaustive, and some of the terms originated prior to the wars in Afghanistan and Iraq. But these terms are critical to speaking the current language of soldiers, and understanding it when they speak to others. Please leave anything you think should be included in the comments.

Useful for documents that contain U.S. military slang, such as the Afghanistan War Diary.

As Ben notes at the outset, language changes over time so validate any vocabulary against your document/data set.

## Geoff (update)

December 5th, 2013

Geoff

My prior post on Geoff pointed to a page about Geoff that appears to no longer exist. I have updated that page to point to the new location.

The current description reads:

Geoff is a text-based interchange format for Neo4j graph data that should be instantly readable to anyone familiar with Cypher, on which its syntax is based.

## N*SQL Matters @Barcelona, Spain Slides!

December 5th, 2013

N*SQL Matters @Barcelona, Spain Slides!

Slides for today but videos are said to be coming soon!

By Title:

• API Analytics with Redis and Bigquery, Javier Ramirez view the slides
• ArangoDB – a different approach to NoSQL, Lucas Dohmen view the slides
• Big Memory Scale-in vs. Scale-out, Niklas Bjorkman view the slides
• Bringing NoSQL to your mobile!, Patrick Heneise view the slides
• Building information systems using rapid application development methods, Michel Müller view the slides
• A call for sanity in NoSQL, Nathan Marz view the slides
• Cicerone: A Real-Time social venue recommender, Daniel Villatoro view the slides
• Database History from Codd to Brewer and Beyond, Doug Turnbull view the slides
• DynamoDB – on-demand NoSQL scaling as a service, Steffen Krause view the slides
• Getting down and dirty with Elasticsearch, Clinton Gormley view the slides
• Harnessing the Internet of Things with NoSQL, Michael Hausenblas view the slides
• How to survive in a BASE world, Uwe Friedrichsen view the slides
• Introduction to Graph Databases, Stefan Armbruster view the slides
• A Journey through the MongoDB Internals, Christian Kvalheim view the slides
• Killing pigs and saving Danish bacon with Riak, Joel Jacobsen view the slides
• Lambdoop, a framework for easy development of Big Data applications, Rubén Casado view the slides
• NoSQL Infrastructure, David Mytton view the slides
• Realtime visitor analysis with Couchbase and Elasticsearch, Jeroen Reijn view the slides
• SAMOA: A Platform for Mining Big Data Streams, Gianmarco De Francisci Morales view the slides
• Splout SQL: Web-latency SQL View for Hadoop, Iván de Prado view the slides
• Sprayer: low latency, reliable multichannel messaging for Telefonica Digital, Pablo Enfedaque and Javier Arias
• By Presenter:

• Armbruster, Stefan – Introduction to Graph Databases view the slides
• Bjorkman, Niklas – Big Memory – Scale-in vs. Scale-out view the slides
• Casado, Rubén – Lambdoop, a framework for easy development of Big Data applications view the slides
• Dohmen, Lucas – ArangoDB – a different approach to NoSQL view the slides
• Enfedaque, Pablo and Javier Arias – Sprayer: low latency, reliable multichannel messaging for Telefonica Digital view the slides
• Friedrichsen, Uwe – How to survive in a BASE world view the slides
• Gormley, Clinton – Getting down and dirty with Elasticsearch view the slides
• Hausenblas, Michael – Harnessing the Internet of Things with NoSQL view the slides
• Heneise, Patrick – Bringing NoSQL to your mobile! view the slides
• Jacobsen, Joel – Killing pigs and saving Danish bacon with Riak view the slides
• Krause, Steffen – DynamoDB – on-demand NoSQL scaling as a service view the slides
• Kvalheim, Christian – A Journey through the MongoDB Internals view the slides
• Marz, Nathan – A call for sanity in NoSQL view the slides
• Morales, Gianmarco De Francisci – SAMOA: A Platform for Mining Big Data Streams view the slides
• Müller, Michel – Building information systems using rapid application development methods view the slides
• Mytton, David – NoSQL Infrastructure view the slides
• Prado, Iván de – Splout SQL: Web-latency SQL View for Hadoop view the slides
• Ramirez, Javier – API Analytics with Redis and Bigquery view the slides
• Reijn, Jeroen – Realtime visitor analysis with Couchbase and Elasticsearch view the slides
• Turnbull, Doug – Database History from Codd to Brewer and Beyond view the slides
• Villatoro, Daniel – Cicerone: A Real-Time social venue recommender view the slides

I will update these with the videos when they are posted.

Enjoy!

## Apache CouchDB Conf Vancouver Videos!

December 5th, 2013

Apache CouchDB Conf Vancouver Videos!

For your viewing pleasure.

By Title:

By Presenter:

Enjoy!

## Latest NSA Fire Storm

December 5th, 2013

Among the many places you can read about the latest Edward Snowden disclosures, NSA tracking cellphone locations worldwide, Snowden documents show by Barton Gellman and Ashkan Soltani, Washington Post, December 4, 2013, reads in part:

The National Security Agency is gathering nearly 5 billion records a day on the whereabouts of cellphones around the world, according to top-secret documents and interviews with U.S. intelligence officials, enabling the agency to track the movements of individuals — and map their relationships — in ways that would have been previously unimaginable.

The records feed a vast database that stores information about the locations of at least hundreds of millions of devices, according to the officials and the documents, which were provided by former NSA contractor Edward Snowden. New projects created to analyze that data have provided the intelligence community with what amounts to a mass surveillance tool.

And among the many denunciations of NSA activities, the American Library Association:

Nation’s Libraries Warn of NSA’s ‘Ravenous Hunger’ for Data

“We don’t want [library patrons] being surveilled because that will inhibit learning, and reading, and creativity,” said Alan Inouye of the American Library Association

- Andrea Germanos, staff writer

A quick search on Twitter quickly led to several hundred tweets with updates in the double digits every 30 seconds or so.

The general tenor being surprise (which I don’t understand) and outrage (that I do understand).

What is missing from the discussion is what to do to correct the situation?

Quite recently we all learned that MinuteMan missiles had their launch codes set to 00000000, despite direct presidential orders to the contrary.

I take that as evidence, along with the history of the NSA, that passing laws to regulate an agency that is without effective supervision is an exercise in futility.

Any assurance from the NSA that they are obeying U.S. laws is incapable of public verification and therefore should be presumed to be false.

The only effective means to limit NSA activities is to limit the NSA.

Let me repeat that: The only effective means to limit NSA activities is to limit the NSA.

We only have the NSA’s word that it has played an important role in protecting the U.S. from terrorists.

How can we test that tale?

My suggestion is that we defund the NSA for a period of not less than five years. No transfer of data, equipment or personnel. None.

If during the next five years, if U.S. based terrorism increases and proponents have a plausible plan for a new NSA, then we can re-consider it.

If there is, as is likely, no increase in U.S. based terrorism, we can avoid the expense of a rogue agency with its own agenda.

PS: I would not worry about the fates of NSA staff/contractors. There are a number of high tech surveillance opportunities in People’s Republic of China. Plus they have a form of government more suited to current NSA staff.

## Ekisto

December 5th, 2013

From the about:

Ekisto comes from ekistics, the science of human settlements.

Ekisto is an interactive visualization of three online communities: StackOverflow, Github and Friendfeed. Ekisto tries to imagine and map our online habitats using graph algorithms and the city as a metaphor.

A graph layout algorithm arranges users in 2D space based on their similarity. Cosine similarity is computed based on the users’ network (Friendfeed), collaborate, watch, fork and follow relationships (Github), or based on the tags of posts contributed by users (StackOverflow). The height of each user represents the normalized value of the user’s Pagerank (Github, Friendfeed) or their reputation points (StackOverflow).

A project by Alex Dragulescu.

The three communities modeled are:

• stackoverflow.jul.2013
• github.mar.2012
• friendfeed.feb.2012

Stackoverflow can be searched by name but Github and FriendFeed, only by userid. Which makes moving from one community to the next along a particular user almost impossible.

I mention that because we all participate in many different communities and our roles and even status may vary widely from community to community.

Any one community view is an incomplete view of that person.

Beyond the need to map across communities, the other take away from Ekisto is the question of community formation?

That is, given the present snapshot of these communities, how did they evolve over time? Did particular people joining have a greater impact than others? Did some event trigger a rise in membership?

Deeply interesting work and a reason to learn more about ekistics.

I first saw this in a tweet by Neil Saunders.

## German Digital Library releases API

December 5th, 2013

German Digital Library releases API by Lieke Ploeger.

From the post:

Last month the German Digital Library (Deutsche Digitale Bibliothek – DDB) made a promising step forward toward further opening up their data by releasing its API (Application Programming Interface) to the public. This API provides access to all the metadata of the DDB released under a CC0 license, which is the predominant share. The release of this API opens up a wide range of possibilities for users to build applications, create combinations with other data or include the German digitised cultural heritage on other platforms. In the future, the DDB also plans to organize a programming competition for API applications as well as a series of workshops for developers.

The official press release.

Technical documentation on the API (German).

A good excuse for you to brush up on your German. Besides, not all of it is in German.

## Free Language Lessons for Computers

December 4th, 2013

Free Language Lessons for Computers by Dave Orr.

From the post:

50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.

These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.

A great summary of the major data drops by Google Research over the past year. In many cases including pointers to additional information on the datasets.

One that I have seen before and that strikes me as particularly relevant to topic maps is:

Dictionaries for linking Text, Entities, and Ideas

What is it: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.

Where can I find it: http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2

I want to know more: A description of the data, several examples, and ideas for uses for it can be found in a blog post or in the associated paper.

For most purposes, you would need far less than the full set of 7.5 million concepts. Imagine having the relevant concepts for a domain that was being automatically “tagged” as you composed prose about it.

Certainly less error-prone than marking concepts by hand!

## MusicGraph

December 4th, 2013

From the post:

Music data company Senzari has launched MusicGraph, a new service for discovering music by searching through graph of over a billion music-related data points.

MusicGraph includes a consumer-facing version and an API that can be used for commercial purposes. Senzari built the graph while working on the recommendation engine for its own streaming service, which has been rebranded as Wahwah.

Interestingly, MusicGraph is launching first on Firefox OS before coming to iOS, Android and Windows Phone in “the coming weeks.”

You know how much I try to avoid “practical” applications but when I saw aureliusgraphs tweet this as using the Titan database, I just had to mention it.

I think this announcement underlines something a comment said recently about promoting topic maps for what they do, not because they are topic maps.

Here, graphs are being promoted as the source of a great user experience, not because they are fun, powerful, etc. (all of which is also true).

## Homotopy Type Theory

December 4th, 2013

Homotopy Type Theory by Robert Harper. (Course with video lectures, notes, etc.)

Synopsis:

This is a graduate research seminar on Homotopy Type Theory (HoTT), a recent enrichment of Intuitionistic Type Theory (ITT) to include "higher-dimensional" types. The dimensionality of a type refers to the structure of its paths, the constructive witnesses to the equality of pairs of elements of a type, which themselves form a type, the identity type. In general a type is infinite dimensional in the sense that it exhibits non-trivial structure at all dimensions: it has elements, paths between elements, paths between paths, and so on to all finite levels. Moreover, the paths at each level exhibit the algebraic structure of a (higher) groupoid, meaning that there is always the "null path" witnessing reflexivity, the "inverse" path witnessing symmetry, and the "concatenation" of paths witnessing transitivity such that group-like laws hold "up to higher homotopy". Specifically, there are higher-dimensional paths witnessing the associative, unital, and inverse laws for these operations. Altogether this means that a type is a weak ∞-groupoid.

The significance of the higher-dimensional structure of types lies in the concept of a type-indexed family of types. Such families exhibit the structure of a fibration, which means that a path between two indices "lifts" to a transport mapping between the corresponding instances of the family that is, in fact, an equivalence. Thinking of paths as constructive witnesses for equality, this amounts to saying that equal indices give rise to equivalent types, and hence, by univalence, equal elements of the universe in which the family is valued. Thus, for example, if we think of the interval I as a type with two endpoints connected by a path, then an I-indexed family of types must assign equivalent types to the endpoints. In contrast the type B of booleans consists of two disconnected points, so that a B-indexed family of types may assign unrelated types to the two points of B. Similarly, mappings from I into another type A must assign connected points in A to the endpoints of the interval, whereas mappings from B into A are free to assign arbitrary points of A to the two booleans. These preservation principles are central to the structure of HoTT.

In many cases the path structure of a type becomes trivial beyond a certain dimension, called the level of the type. By convention the levels start at -2 and continue through -1, 0, 1, 2, and so on indefinitely. At the lowest, -2, level, the path structure of a type is degenerate in that there is an element to which all other elements are equal; such a type is said to be contractible, and is essentially a singleton. At the next higher level, -1, the type of paths between any two elements is contractible (level -2), which means that any two elements are equal, if there are any elements at all; such as type is a sub-singleton or h-proposition. At the next level, 0, the type of paths between paths between elements is contractible, so that any two elements are equal "in at most one way"; such a type is a set whose types of paths (equality relations) are all h-prop’s. Continuing in this way, types of level 1 are groupoids, those of level 2 are 2-groupoids, and so on for all finite levels.

ITT is capable of expressing only sets, which are types of level 0. Such types may have elements, and two elements may be considered equal in at most one way. A large swath of (constructive) mathematics may be formulated using only sets, and hence is amenable to representation in ITT. Computing applications, among others, require more than just sets. For example, it is often necessary to suppress distinctions among elements of a type so as to avoid over-specification; this is called proof irrelevance. Traditionally ITT has been enriched with an ad hoc treatment of proof irrelevance by introducing a universe of "propositions" with no computational content. In HoTT such propositions are types of level -1, requiring no special treatment or distinction. Such types arise by propositional truncation of a type to render degenerate the path structure of a type above level -1, ensuring that any two elements are equal in the sense of having a path between them.

Propositional truncation is just one example of a higher inductive type, one that is defined by specifying generators not only for its elements, but also for its higher-dimensional paths. The propositional truncation of a type is one that includes all of the elements of the type, and, in addition, a path between any two elements, rendering them equal. It is a limiting case of a quotient type in which only certain paths between elements are introduced, according to whether they are deemed to be related. Higher inductive types also permit the representation of higher-dimensional objects, such as the spheres of arbitrary dimension, as types, simply by specifying their "connectivity" properties. For example, the topological circle consists of a base point and a path starting and ending at that point, and the topological disk may be thought of as two half circles that are connected by a higher path that "fills in" the interior of the circle. Because of their higher path structure, such types are not sets, and neither are constructions such as the product of two circles.

The univalence axiom implies that an equivalence between types (an "isomorphism up to isomorphism") determines a path in a universe containing such types. Since two types can be equivalent in many ways (for example, there can be distinct bijections between two sets), univalence gives rise to types that are not sets, but rather are of a higher level, or dimension. The univalence axiom is mathematically efficient because it allows us to treat equivalent types as equal, and hence interchangeable in all contexts. In informal settings such identifications are often made by convention; in formal homotopy type theory such identifications are true equations.

If you think data types are semantic primitives with universal meaning/understanding, feel free to ignore this posting.

Data types can be usefully treated “as though” they are semantic primitives, but mistaking convenience for truth can be expensive.

The never ending cycle of enterprise level ETL for example. Even when it ends well it is expensive.

And there are all the cases where ETL or data integration don’t end well.

Homotopy Type Theory may not be the answer to those problems, but our current practices are known to not work.

Why not bet on an uncertain success versus the certainty of expense and near-certainty of failure?