You Are Listening to The New York Times

May 19th, 2013

You Are Listening to The New York Times by Hugh Mandeville.

From the post:

When the San Francisco Giants won the 2010 World Series, the post-victory celebrations got out of control. Revelers smashed windows, got into fistfights and started fires. A Muni bus and the metaverse were both set alight.

To track the chaos, Eric Eberhardt, a techie from the Bay Area, tuned in to a San Francisco police scanner station on soma.fm — while also listening to music. Something about the combination of ambient music and live police chatter clicked for Eberhardt, and youarelistening.to was born.

Eberhardt’s site is a mash-up of three APIs: police scanner audio from RadioReference.com, ambient music from SoundCloud and images from Flickr. The outcome is like a real-time soundtrack to Michael Mann’s movie “Heat.” My colleague Chase Davis, interactive news assistant editor, describes it as “‘Hearts of Space’ meets ‘The Wire.’”

(…)

My explorations inspired me to create a page on youarelistening.to that takes New York Times headlines from the Times Newswire API and reads them aloud using TTS-API.com’s text-to-speech API. I also created a page that reads trending tweets, using Twitter’s Search API.

Definitely has potential to enrich a user experience.

Imagine studying early 21st century history and when George W. Bush or Dick Cheney show up on your ereader, War Pigs plays in the background.

Trivia: Did you know that War Pigs was one of 165 songs that Clear Channel suggested could be inappropriate to play after 9/11? 2001 Clear Channel Memorandum.

Cat Stevens with Peace Train also made the list.

Terrorism we can survive. Those trying to protect us, I’m not so sure.

6 Golden Rules to Successful Dashboard Design

May 19th, 2013

6 Golden Rules to Successful Dashboard Design

From the article:

Dashboards are often created on-the-fly with data being added simply because there is some white space not being used. Different people in the company ask for different data to be displayed and soon the dashboard becomes hard to read and full of meaningless non-related information. When this happens, the dashboard is no longer useful.

This article discusses the steps that need to be taken during the design phase in order to create a useful and actionable dashboard.

Topic maps can be expressed as dashboards as well as other types of interfaces.

Whatever your interface, it needs to be driven by good design principles.

AnormCypher 0.4.1 released!

May 19th, 2013

AnormCypher 0.4.1 released! by Wes Freeman.

From the post:

Thanks to Pieter, AnormCypher 0.4.1 supports versions earlier than Neo4j 1.9 (I didn’t realize this was an issue).

AnormCypher is a Cypher-oriented Scala library for Neo4j Server (REST). The goal is to provide a great API for calling arbitrary Cypher and parsing out results, with an API inspired by Anorm from the Play! Framework.

If you are working with a Neo4j Server this may be of interest.

Scala Resources & Community links for the newcomer

May 19th, 2013

Scala Resources & Community links for the newcomer by Raúl Raja.

From the post:

During the last couple of months I have been asked a few times among colleagues and friends hot to get started with Scala. People come to Scala from diverse backgrounds such as… – Java folks looking for a better Java or just tired of waiting for Java features other modern languages such as C# already offer. – Ruby, PHP, and programmers that come from a scripting background looking for type safety. – People trying to bridge the best of both OOP and Functional paradigms. Scala is a vast language full of features with a very technical community. Don’t let your first step discourage you as you don’t need to know everything about Scala to become productive quickly. People in the mailing list will often talk about some crazy shit you don’t need to know just yet. Monads, Monoids, Combinators, Macros and things you may not even know how to pronounce,… Seriously guys as you start to learn about it it’s gonna blow your mind. It’s gonna take some time to digest all the info but it sure it’s worth it. Here is a few resources / steps may help you get started focused on its community and not so much on the technical details of downloading and running your first scala “hello world”

More than a collection to bookmark for “someday,” this is a collection of resources to start following today.

I haven’t looked at all the references but from the ones I checked, I don’t think you will be disappointed.

Apache Drill

May 19th, 2013

Michael Hausenblas at NoSQL Matters 2013 does a great lecture on Apache Drill.

Slides.

Google’s Dremel Paper

Projects “beta” for Apache Drill by second quarter and GA by end of year.

Apache Drill User.

From the rationale:

There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel.

How do you handle ad hoc exploration of data sets as part of planning a topic map?

Being able to “test” merging against data prior to implementation sounds like a good idea.

Visualizing your LinkedIn graph using Gephi (Parts 1 & 2)

May 19th, 2013

Visualizing your LinkedIn graph using Gephi – Part 1

&

Visualizing your LinkedIn graph using Gephi – Part 2

by Thomas Cabrol.

From part 1:

Graph analysis becomes a key component of data science. A lot of things can be modeled as graphs, but social networks are really one of the most obvious examples.

In this post, I am going to show how one could visualize its own LinkedIn graph, using the LinkedIn API and Gephi, a very nice software for working on this type of data. If you don’t have it yet, just go to http://gephi.org/ and download it now !

My objective is to simply look at my connections (the “nodes” or “vertices” of the graph), see how they relate to each other (the “edges”) and find clusters of strongly connected users (“communities”). This is somewhat emulating what is available already in the InMaps data product, but, hey, this is cool to do it by ourselves, no ?

The first thing to do for running this graph analysis is to be able to query LinkedIn via its API. You really don’t want to get the data by hand… The API uses the oauth authentification protocol, which will let an application make queries on behalf of a user. So go to https://www.linkedin.com/secure/developer and register a new application. Fill the form as required, and in the OAuth part, use this redirect URL for instance:

Great introduction to Gephi!

As a bonus, reinforces the lesson that ETL isn’t required to re-use data.

ETL may be required in some cases but in a world of data APIs those are getting fewer and fewer.

Think of it this way: Non-ETL data access means someone else is paying for maintenance, backups, hardware, etc.

How much of your IT budget is supporting duplicated data?

Google Map Redesign [Brain Buds]

May 19th, 2013

Google Map Redesign by Caitlin Dempsey.

From the post:

Googles Maps is preparing to debut its newly revamped Google Maps. Terming it “smart recommendations” the new functionality of Google Maps is intended to be more interactive and custom tailored to the specific user. The more you use the map to search for locations, favorite items by starring them, and write location reviews, the more unique the map becomes. Clicking a specific business or feature will result in the map features adjusting to show roads and locations related to that place.

(…)

Previewing the new Google Maps is currently only available by invite at the moment. You can request your invite via the Preview page.

Technology could be exposing you to a broader view of the world, perhaps even as other see it.

Instead:

  • Apple brought us ear buds that wall us off from ambient sound and others.
  • Apple also brought us eye buds (iPhones) that wall us off from our visual surroundings.
  • Google is building brain buds to wrap you in a customized cocoon of content.

Ironic if you remember the original MacIntosh commercial:

Timothy Leary today would say:

Turn on, tune in, unplug.

UNESCO Publications and Data (Open Access)

May 19th, 2013

UNESCO to make its publications available free of charge as part of a new Open Access policy

From the post:

The United Nations Education Scientific and Cultural Organisation (UNESCO) has announced that it is making available to the public free of charge its digital publications and data. This comes after UNESCO has adopted an Open Access Policy, becoming the first agency within the United Nations to do so.

The new policy implies that anyone can freely download, translate , adapt, and distribute UNESCO’s publications and data. The policy also states that from July 2013, hundreds of downloadable digital UNESCO publications will be available to users through a new Open Access Repository with a multilingual interface. The policy seeks also to apply retroactively to works that have been published.

There’s a treasure trove of information for mapping, say against the New York Times historical archives.

If presidential libraries weren’t concerned with helping former administration officials avoid accountability, digitizing presidential libraries for complete access, would be another great treasure trove.

Got Balls?

May 19th, 2013

IED Trends: Turning Tennis Balls Into Bombs

From the post:

Terrorists are relentlessly evolving tactics and techniques for IEDs (Improvised Explosive Devices), and analyzing reporting on IEDs can provide insight complementary to HUMINT on emerging militant methods. Preparing for an upcoming webcast with our friends at Terrogence, we found incidents using sports balls, particularly tennis balls and cricket balls, more frequently appearing as a delivery vehicle for explosives.

When we break these incidents from the last four months down by location, the city of Karachi in southern Pakistan stands out as a hotbed. There is also evidence that this tactic is being embraced around the globe as you can see sports balls fashioned into bombs found from Longview, Washington in the United States to Varanasi in India.

We can use Recorded Future’s Web Intelligence platform to plot out the locations where incidents have recently occurred as well as the frequency and timing.

Interesting but the military, by their stated doctrines, should be providing this information in theater specific IED briefings.

See for example: FMI 3-34.119/MCIP 3-17.01 IMPROVISED EXPLOSIVE DEVICE DEFEAT

On boobytraps (the old name) in general, see: FM 5-31 Boobytraps (1965), which includes pressure cookers (pp. 73-74) and rubber balls (p. 87).

Topic maps offer over rapid dissemination of “new” forms and checklists for where they may be found. (As opposed to static publications.)

Interesting that FM 5-31 reports an electric iron as boobytrap, but an electric iron is more likely to show up on Antiques Roadshow than as an IED.

At least in the United States.

Apache Hive 0.11: Stinger Phase 1 Delivered

May 18th, 2013

Apache Hive 0.11: Stinger Phase 1 Delivered by Owen O’Malley.

From the post:

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop. Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them. See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Stinger

Welcome news for the Hive and SQL communities alike!

Warp: Multi-Key Transactions for Key-Value Stores

May 18th, 2013

Warp: Multi-Key Transactions for Key-Value Stores by Robert Escriva, Bernard Wong and Emin Gün Sirer†.

Abstract:

Implementing ACID transactions has been a longstanding challenge for NoSQL systems. Because these systems are based on a sharded architecture, transactions necessarily require coordination across multiple servers. Past work in this space has relied either on heavyweight protocols such as Paxos or clock synchronization for this coordination.

This paper presents a novel protocol for coordinating distributed transactions with ACID semantics on top of a sharded data store. Called linear transactions, this protocol achieves scalability by distributing the coordination task to only those servers that hold relevant data for each transaction. It achieves high performance by serializing only those transactions whose concurrent execution could potentially yield a violation of ACID semantics. Finally, it naturally integrates chain-replication and can thus tolerate faults of both clients and servers. We have fully implemented linear transactions in a commercially available data store. Experiments show that the throughput of this system achieves 1-9× more throughput than MongoDB, Cassandra and HyperDex on the Yahoo! Cloud Serving Benchmark, even though none of the latter systems provide transactional guarantees.

Warp looks wicked cool!

Of particular interest is the non-ordering of transactions that have no impact on other transactions. That alone would be interesting for a topic map merging situation.

For more details, see the Warp page, or

Download Warp

Warp Tutorial

Warp Performance Benchmarks

I first saw this at High Scalability.

Open Data and Wishful Thinking

May 18th, 2013

BLM Fracking Rule Violates New Executive Order on Open Data by Sofia Plagakis.

From the post:

Today, the U.S. Department of the Interior’s Bureau of Land Management (BLM) released its revised proposed rule for natural gas drilling (commonly referred to as fracking) on federal and tribal lands. The much-anticipated rule violates President Obama’s recently issued executive order that requires new government information to be made available to the public in open, machine-readable formats.

Last week, President Obama signed an executive order requiring that all newly generated public data be pushed out in open, machine-readable formats. Concurrently, the Office of Management and Budget (OMB) and the Office of Science and Technology Policy (OSTP) released an Open Data Policy designed to make previously unavailable government data accessible to entrepreneurs, researchers, and the public.

The executive order and accompanying policy must have been in development for months, and agencies, including BLM, should have been fully aware of the new policy. But instead of establishing a modern example of government information collection and sharing, BLM’s proposed rule would allow drilling companies to report the chemicals used in fracking to a third-party, industry-funded website, called FracFocus.org, which does not provide data in machine-readable formats. FracFocus.org only allows users to download PDF files of reports on fracked wells. Because PDF files are not machine-readable, the site makes it very difficult for the public to use and analyze data on wells and chemicals that the government requires companies to collect and make available.

I wonder if Sofia simply overlooked:

When implementing the Open Data Policy, agencies shall incorporate a full analysis of privacy, confidentiality, and security risks into each stage of the information lifecycle to identify information that should not be released. These review processes should be overseen by the senior agency official for privacy. It is vital that agencies not release information if doing so would violate any law or policy, or jeopardize privacy, confidentiality, or national security. [From “We won’t get fooled again…”]

Or if her “…requires new government information to be made available to the public in open, machine-readable formats” is wishful thinking?

The Obama just released the Benghazi emails in PDF format. So we have an example of the Whitehouse violating its own “open data” policy.

We don’t need more “open data.”

What we need are more leakers. A lot more leakers.

Just be sure you leak or pass on leaks in “open, machine-readable formats.”

The foreign adventures, environmental pollution, failures in drug or food safety, etc., avoided by leaks may save your life, the lives of your children or grandchildren.

Leak today!

Graph Representation – Edge List

May 18th, 2013

Graph Representation – Edge List

From the post:

An Edge List is a form of representation for a graph. It maintains a list of all the edges in the graph. For each edge, it keeps track of the 2 connecting vertices as well as the weight between them.

Followed by C++ code as an example.

A hypergraph would require tracking of 3 or more connected nodes.

Dumb Jocks?

May 18th, 2013

Coaches are highest paid public employees by Nathan Yau.

Coach payments

Nathan makes another amazing find!

The topic map lesson here is effective presentation of information.

For the same data, the Obama Administration would list all U.S. public employees in order by internal department names and separately list positions with salaries, in a PDF file.

I know which strategy I prefer.

You?

A Trillion Triples in Perspective

May 18th, 2013

Mozart Meets MapReduce by Isaac Lopez.

From the post:

Big data has been around since the beginning of time, says Thomas Paulmichl, founder and CEO of Sigmaspecto, who says that what has changed is how we process the information. In a talk during Big Data Week, Paulmichl encouraged people to open up their perspective on what big data is, and how it can be applied.

During the talk, he admonished people to take a human element into big data. Paulmichl demonstrated this by examining the work of musical prodigy, Mozart – who Paulmichl noted is appreciated greatly by both music scientists, as well as the common music listener.

“When Mozart makes choices on writing a piece of work, the number of choices that he has and the kind of neural algorithms that his brain goes through to choose things is infinitesimally higher that what we call big data – it’s really small data in comparison,” he said.

Taking Mozart’s The Magic Flute as an example, Paulmichl, discussed the framework that Mozart used to make his choices by examining a music sheet outlining the number of bars, the time signature, the instrument and singer voicing.

“So from his perspective, he sits down, and starts to make what we as data scientists call quantitative choices,” explained Paulmichl. “Do I put a note here, down here, do I use a different instrument; do I use a parallel voicing for different violins – so these are all metrics that his brain has to decide.”

Exploring the mathematics of the music, Paulmichl concluded that in looking at The Magic Flute, Mozart had 4.72391E+21 creative variations (and then some) that he could have taken with the direction of it over the course of the piece. “We’re not talking about a trillion dataset; we’re talking about a sextillion or more,” he says adding that this is a very limited cut of the quantitative choice that his brain makes at every composition point.

“[A] sextillion or more…” puts the question of processing a trillion triples into perspective.

Another musical analogy?

Triples are the one finger version of Jingle Bells*:

*The gap is greater than the video represents but it is still amusing.

Does your analysis/data have one finger subtlety?

Faunus: Graph Analytics Engine

May 17th, 2013

Faunus: Graph Analytics Engine by Marko Rodriguez.

From the description:

Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.

I saw this slide deck after posting ConceptNet5 [Herein of Hypergraphs] and writing about the “id-less” nodes and edges of ConceptNet5.

So when I see nodes and edges with IDs, I have to wonder why?

What requirement is being met or advantage that is obtained by using IDs and not addressing a node by its content?*

Remembering that we are no longer concerned with shaving bits off of identifiers for storage and/or processing concerns.


* I suspect that addressing by content presumes a level of granularity that may not be appropriate in all cases. Hard to say. But I do want to look at the issue more closely.

Common Locale Data Repository (CLDR) 23.1

May 17th, 2013

Common Locale Data Repository (CLDR) 23.1

From the CLDR project homepage:

What is CLDR?

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. It includes:

  • Locale-specific patterns for formatting and parsing: dates, times, timezones, numbers and currency values
  • Translations of names: languages, scripts, countries and regions, currencies, eras, months, weekdays, day periods, timezones, cities, and time units
  • Language & script information: characters used; plural cases; gender of lists; capitalization; rules for sorting & searching; writing direction; transliteration rules; rules for spelling out numbers; rules for segmenting text into graphemes, words, and sentences
  • Country information: language usage, currency information, calendar preference and week conventions, postal and telephone codes
  • Other: ISO & BCP 47 code support (cross mappings, etc.), keyboard layouts

CLDR uses the XML format provided by UTS #35: Unicode Locale Data Markup Language (LDML). LDML is a format used not only for CLDR, but also for general interchange of locale data, such as in Microsoft’s .NET.

For a set of slides on the technical contents of CLDR, see Overview.

Great set of widely used mappings between locale data.

Solr 4, the NoSQL Search Server [Webinar]

May 17th, 2013

Solr 4, the NoSQL Search Server by Yonik Seeley

Date: Thursday, May 30, 2013
Time: 10:00am Pacific Time

From the description:

The long awaited Solr 4 release brings a large amount of new functionality that blurs the line between search engines and NoSQL databases. Now you can have your cake and search it too with Atomic updates, Versioning and Optimistic Concurrency, Durability, and Real-time Get!

Learn about new Solr NoSQL features and implementation details of how the distributed indexing of Solr Cloud was designed from the ground up to accommodate them.
Featured Presenter:

Yonik Seeley – Research creator of Apache Solr and the Chief Open Source Architect and Co-Founder at LucidWorks. Mr. Seeley is an Apache Lucene/Solr PMC member and committer and an expert in distributed search systems architecture and performance. His work experience includes CNET Networks, BEA and Telcordia. He earned his M.S. in Computer Science from Stanford University.

This could be a real treat!

Notes on the webinar to follow.

A self-updating road map of The Cancer Genome Atlas

May 17th, 2013

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

Organizing Digital Information for Others

May 17th, 2013

Organizing Digital Information for Others by Maish Nichani. (ebook, no registration required)

From the description:

When we interact with web and intranet teams, we find many struggling to move beyond conceptual-level discussions on information organization. Hours on end are spent on discussing the meaning of “metadata”, “controlled vocabulary” and “taxonomy” without any strategic understanding of how everything fits together. Being so bogged down at this level they fail to look beyond to the main reason for their pursuit—organizing information for others (the end users) so that they can find the information easily.

Web and intranet teams are not the only ones facing this challenge. Staff in companies are finding themselves tasked with organizing, say, hundreds of project documents on their collaboration space. And they usually end up organizing it in the only way they know—for themselves. Team members then often struggle to locate the information that they thought should be in “this folder”!

In this short book, we explore how lists, categories, trees and facets can be better used to organize information for others. We also learn how metadata and taxonomies can connect different collections and increase the findability of information across the website or intranet.

But more than that we hope that this book can start a conversation around this important part of our digital lives.

So let the conversation begin!

The theme of delivering information to others cannot be emphasized enough.

Your notes, interface choices, etc., are just that, your notes, interface choices, etc.

Unless you are independently wealthy, that isn’t a very good marketing model.

Nor are users going to be “trained” to work, search, author, the “right way” in your view.

An introduction to be sure but this short (50 odd pages) work is entertaining and has additional references.

Very much worth the time to read.

Properties of Morphisms

May 17th, 2013

Properties of Morphisms by Jeremy Kun.

From the post:

This post is mainly mathematical. We left it out of our introduction to categories for brevity, but we should lay these definitions down and some examples before continuing on to universal properties and doing more computation. The reader should feel free to skip this post and return to it later when the words “isomorphism,” “monomorphism,” and “epimorphism” come up again. Perhaps the most important part of this post is the description of an isomorphism.

Isomorphisms, Monomorphisms, and Epimorphisms

Perhaps the most important paradigm shift in category theory is the focus on morphisms as the main object of study. In particular, category theory stipulates that the only knowledge one can gain about an object is in how it relates to other objects. Indeed, this is true in nearly all fields of mathematics: in groups we consider all isomorphic groups to be the same. In topology, homeomorphic spaces are not distinguished. The list goes on. The only way to do determine if two objects are “the same” is by finding a morphism with special properties. Barry Mazur gives a more colorful explanation by considering the meaning of the number 5 in his essay, “When is one thing equal to some other thing?” The point is that categories, more than existing to be a “foundation” for all mathematics as a formal system (though people are working to make such a formal system), exist primarily to “capture the essence” of mathematical discourse, as Mazur puts it. A category defines objects and morphisms, but literally all of the structure of a category lies in its morphisms. And so we study them.

If you are looking for something challenging to read over the weekend, Jeremy’s latest post on morphisms should fit the bill.

The question of “equality” is easy enough to answer as lexical equivalence in UTF-8, but what if you need something more?

Being mindful that “lexical equivalence in UTF-8″ is a highly unreliable form of “equality.”

Jeremy is easing us down the road where such discussions can happen with a great deal of rigor.

Hadoop SDK and Tutorials for Microsoft .NET Developers

May 17th, 2013

Hadoop SDK and Tutorials for Microsoft .NET Developers by Marc Holmes.

From the post:

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel. It also covers some Mahout use to build a recommendation engine.
  • Microsoft Hive ODBC Driver. The examples above use this preview driver to enable the connection from Hive to Excel.

If all of the above excites you our Hadoop on Windows for Developers training course also similar content in a lot of depth.

Hadoop is coming to an office/data center near you.

Will you be ready?

ConceptNet5 [Herein of Hypergraphs]

May 17th, 2013

ConceptNet5

From the website:

ConceptNet is a semantic network containing lots of things computers should know about the world, especially when understanding text written by people.

It is built from nodes representing concepts, in the form of words or short phrases of natural language, and labeled relationships between them. These are the kinds of things computers need to know to search for information better, answer questions, and understand people’s goals. If you wanted to build your own Watson, this should be a good place to start!

ConceptNet contains everyday basic knowledge:

(…)

ConceptNet 5 is a graph

To be precise, it’s a hypergraph, meaning it has edges about edges. Each statement in ConceptNet has justfications pointing to it, explaining where it comes from and how reliable the information seems to be.

Previous versions of ConceptNet has been distributed as idiosyncratic database structures plus some software to interact with them. ConceptNet 5 is not a piece of software or a database; it is a graph. It’s a set of nodes and edges, which we can represent in multiple formats including JSON. You probably know better than we do what software you want to use to interact with it!

(That said, you can have our idiosyncratic Solr index if you want, but that’s not ConceptNet, it’s just a system for quickly looking things up in ConceptNet.)

Some other interesting properties:

  • The ConceptNet graph is ID-less. Every node and assertion contains all the information necessary to identify it and no more in its URI, and does not rely on arbitrarily-assigned IDs. The advantage of this is that if multiple branches of ConceptNet are developed in multiple places, we can later merge them simply by taking the union of the nodes and edges. (And we hope for this to happen!)
  • ConceptNet supports linked data: you can download a list of links to the greater Semantic Web, via DBPedia and via RDF/OWL WordNet. For example, our concept cat is linked to the DBPedia node at http://dbpedia.org/resource/Cat.

In addition to being a data source, interesting notion of “ID-less” nodes and edges.

Information on the software setup, Solr and Python to deliver ConceptNet5 as a hypergraph is also available.

I first saw this in Max De Marzi’s Knowledge Bases in Neo4j. You will find that Max’s approach involves dumbing down the hypergraph.

Knowledge Bases in Neo4j

May 17th, 2013

Knowledge Bases in Neo4j by Max De Marzi.

From the post:

From the second we are born we are collecting a wealth of knowledge about the world. This knowledge is accumulated and interrelated inside our brains and it represents what we know. If we could export this knowledge and give it to a computer, it would look like ConceptNet. ConceptNet is a semantic network that…

…is built from nodes representing concepts, in the form of words or short phrases of natural language, and labeled relationships between them. These are the kinds of things computers need to know to search for information better, answer questions, and understand people’s goals.

I wrote a little ruby script to import ConceptNet5 into Neo4j and it gives us a nice graph (243MB) to work with. ConceptNet5 as presented in csv files is actually a hypergraph, with a reason for the concept:

Max gives a script to remove the reasons for concepts (the hypergraph part of ConceptNet5) as duplicate content.

It does make the graph smaller, but only at the expense of information loss.

Think of it as the Benghazi emails with all the duplicate prose removed along with who said it.

If that fits your requirements, ok, but I doubt it would fit in any environment that requires information auditing.

Hadoop Toolbox: When to Use What

May 17th, 2013

Hadoop Toolbox: When to Use What by Mohammad Tariq.

From the post:

Eight years ago not even Doug Cutting would have thought that the tool which he’s naming after his kid’s soft toy would so soon become a rage and change the way people and organizations look at their data. Today Hadoop and Big Data have almost become synonyms to each other. But Hadoop is not just Hadoop now. Over time it has evolved into one big herd of various tools, each meant to serve a different purpose. But glued together they give you a powerpacked combo.

Having said that, one must be careful while choosing these tools for their specific use case as one size doesn’t fit all. What is working for someone might not be that productive for you. So, here I will show you which tool should be picked in which scenario. It’s not a big comparative study but a short intro to some very useful tools. And, this is based totally on my experience so there is always some scope of suggestions. Please feel free to comment or suggest if you have any. I would love to hear from you. Let’s get started :

Not shallow enough to be useful for the c-suite types, not deep enough for decision making.

Nice to use in a survey context, where users need an overview of the Hadoop ecosystem.

Automated Archival and Visual Analysis of Tweets…

May 16th, 2013

Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others by Stephen Turner.

From the post:

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I’ve been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter’s built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t – the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t‘s documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here’s the code as of May 14, 2013:

Stephen promises in his post that the script updates automatically and you may find “unsavory” tweets.

I didn’t but that may be a matter of happenstance or sensitivity. ;-)

Linguists Circle the Wagons, or Disagreement != Danger

May 16th, 2013

Pullum’s NLP Lament: More Sleight of Hand Than Fact by Christopher Phipps.

From the post:

My first reading of both of Pullum’s recent NLP posts (one and two) interpreted them to be hostile, an attack on a whole field (see my first response here). Upon closer reading, I see Pullum chooses his words carefully and it is less of an attack and more of a lament. He laments that the high-minded goals of early NLP (to create machines that process language like humans do) has not been reached, and more to the point, that commercial pressures have distracted the field from pursuing those original goals, hence they are now neglected. And he’s right about this to some extent.

But, he’s also taking the commonly used term “natural language processing” and insisting that it NOT refer to what 99% of people who use the term use it for, but rather only a very narrow interpretation consisting of something like “computer systems that mimic human language processing.” This is fundamentally unfair.

In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.

I feel Pullum is moving the goal posts on us when he says “there is, to my knowledge, no available system for unaided machine answering of free-form questions via general syntactic and semantic analysis” [my emphasis]. Pullum’s agenda appears to be to create a straw-man NLP world where NLP techniques are only admirable if they mimic human processing. And this is unfair for two reasons.

If there is unfairness in this discussion, it is the insistence by Christopher Phipps (and others) that Pullum has invented “…a straw-man NLP world where NLP techniques are only admirable if they mimic human processing.”

On the contrary, it was 1949 when Warren Weaver first proposed computers as the solution to world-wide translation problems. Weaver’s was not the only optimistic projection of language processing by computers. Those have continued up to and including the Semantic Web.

Yes, NLP practitioners such as Christopher Phipps use NLP in a more precise sense than Pullum. And NLP as defined by Phipps has too many achievements to easily list.

Neither one of those statements takes anything away from Pullum’s point that Google found a “sweet spot” between machine processing and human intelligence for search purposes.

What other insights Pullum has to offer may be obscured by the “…circle the wagons…” attitude from linguists.

Disagreement != Danger.

How-to: Configure Eclipse for Hadoop Contributions

May 16th, 2013

How-to: Configure Eclipse for Hadoop Contributions by Karthik Kambatla.

From the post:

Contributing to Apache Hadoop or writing custom pluggable modules requires modifying Hadoop’s source code. While it is perfectly fine to use a text editor to modify Java source, modern IDEs simplify navigation and debugging of large Java projects like Hadoop significantly. Eclipse is a popular choice thanks to its broad user base and multitude of available plugins.

This post covers configuring Eclipse to modify Hadoop’s source. (Developing applications against CDH using Eclipse is covered in a different post.) Hadoop has changed a great deal since our previous post on configuring Eclipse for Hadoop development; here we’ll revisit configuring Eclipse for the latest “flavors” of Hadoop. Note that trunk and other release branches differ in their directory structure, feature set, and build tools they use. (The EclipseEnvironment Hadoop wiki page is a good starting point for development on trunk.)

A post to ease your way towards contributing to the Hadoop project!

Or if you simply want to know the code you are running cold.

Or something in between!

HAL: a hierarchical format for storing…

May 16th, 2013

HAL: a hierarchical format for storing and analyzing multiple genome alignments by Glenn Hickey, Benedict Paten, Dent Earl, Daniel Zerbino and David Haussler. (Bioinformatics (2013) 29 (10): 1341-1342. doi: 10.1093/bioinformatics/btt128)

Abstract:

Motivation: Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance.

Results: We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover).

Availability: All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal.

Important work for bioinformatics and genome alignment as well as specializing graphs for that work.

Graphs are a popular subject these days but successful projects will rely on graphs with particular properties and structures to be useful.

The more examples of graph-based projects, the more we learn about general principles of graphs for particular applications or requirements.