Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 6, 2013

DARPA’s online games crowdsource software security

Filed under: Authoring Topic Maps,Crowd Sourcing,Games,Interface Research/Design — Patrick Durusau @ 8:22 pm

DARPA’s online games crowdsource software security by Kevin McCaney.

From the post:

Flaws in commercial software can cause serious problems if cyberattackers take advantage of them with their increasingly sophisticated bag of tricks. The Defense Advanced Research Projects Agency wants to see if it can speed up discovery of those flaws by making a game of it. Several games, in fact.

DARPA’s Crowd Sourced Formal Verification (CSFV) program has just launched its Verigames portal, which hosts five free online games designed to mimic the formal software verification process traditionally used to look for software bugs.

Verification, both dynamic and static, has proved to be the best way to determine if software free of flaws, but it requires software engineers to perform “mathematical theorem-proving techniques” that can be time-consuming, costly and unable to scale to the size of some of today’s commercial software, according to DARPA. With Verigames, the agency is testing whether untrained (and unpaid) users can verify the integrity of software more quickly and less expensively.

“We’re seeing if we can take really hard math problems and map them onto interesting, attractive puzzle games that online players will solve for fun,” Drew Dean, DARPA program manager, said in announcing the portal launch. “By leveraging players’ intelligence and ingenuity on a broad scale, we hope to reduce security analysts’ workloads and fundamentally improve the availability of formal verification.”

If program verification is possible with online games, I don’t know of any principled reason why topic map authoring should not be possible.

Maybe fill-in-the-blank topic map authoring is just a poor authoring technique for topic maps.

Imagine gamifying data streams to be like Missile Command. 😉

Can you even count the number of hours that you played Missile Command?

Now consider the impact of a topic map authoring interface that addictive.

Particularly if the user didn’t know they were doing useful work.

…: Selling Data

Filed under: Data,Marketing — Patrick Durusau @ 8:02 pm

A New Source of Revenue for Data Scientists: Selling Data by Vincent Granville.

From the post:

What kind of data is salable? How can data scientists independently make money by selling data that is automatically generated: raw data, research data (presented as customized reports), or predictions. In short, using an automated data generation / gathering or prediction system, working from home with no boss and no employee, and possibly no direct interactions with clients. An alternate career path that many of us would enjoy!

Vincent gives a number of examples of companies selling data right now, some possible data sources, startup ideas and pointers to articles on data scientists.

Vincent makes me think there are at least three ways to sell topic maps:

  1. Sell people on using topic maps so they can produce high quality data through the use of topic maps.
  2. Sell people on hiring you to construct a topic map system so they can produce high quality data.
  3. Sell people high quality data because you are using a topic map.

Not everyone who likes filet mignon (#3), wants to raise the cow (#1) and/or butcher the cow(#2).

It is more expensive to buy filet mignon, but it also lowers the odds of stepping in cow manure and/or blood.

What data would you buy?

Instructions for deploying an Elasticsearch Cluster with Titan

Filed under: ElasticSearch,Graphs,Titan — Patrick Durusau @ 7:28 pm

Instructions for deploying an Elasticsearch Cluster with Titan by Benjamin Bengfort.

From the post:

Elasticsearch is an open source distributed real-time search engine for the cloud. It allows you to deploy a scalable, auto-discovered cluster of nodes, and as search capacity grows, you simple need to add more nodes and the cluster will reorganize itself. Titan, a distributed graph engine by Aurelius supports elasticsearch as an option to index your vertices for fast lookup and retrieval. By default, Titan supports elasticsearch running in the same JVM and storing data locally on the client, which is fine for embedded mode. However, once your Titan cluster starts growing, you have to respond by growing an elasticsearch cluster side by side with the graph engine.

This tutorial is how to quickly get a elasticsearch cluster up and running on EC2, then configuring Titan to use it for indexing. It assumes you already have an EC2/Titan cluster deployed. Note, that these instructions were for a particular deployment, so please forward any questions about specifics in the comments!

A great tutorial. Short, on point and references other resources.

Enjoy!

Glitch is Dead, Long Live Glitch!

Filed under: Graphics,Open Source — Patrick Durusau @ 6:58 pm

Glitch is Dead, Long Live Glitch!: Art & Code from the Game Released into Public Domain by Tiny Speck.

From the website:

The collaborative, web-based, massively multiplayer game Glitch began its initial private testing in 2009, opened to the public in 2010, and was shut down in 2012. It was played by more than 150,000 people and was widely hailed for its original and highly creative visual style.

The entire library of art assets from the game, has been made freely available, dedicated to the public domain. Code from the game client is included to help developers work with the assets. All of it can be downloaded and used by anyone, for any purpose. (But: use it for good.)

Tiny Speck, Inc., the game’s developer, has relinquished its ownership of copyright over these 10,000+ assets in the hopes that they help others in their creative endeavours and build on Glitch’s legacy of simple fun, creativity and an appreciation for the preposterous. Go and make beautiful things.

I never played Glitch but the art could be useful.

Or perhaps even the online game code if you are looking to create a topic map gaming site.

Read the release for the details of the licensing.

I first saw this in Nat Torkington’s Four short links: 22 November 2013.

Whoosh

Filed under: Python,Search Engines — Patrick Durusau @ 5:18 pm

Whoosh: Python Search Library

From the webpage:

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

Some of Whoosh’s features include:

  • Pythonic API.
  • Pure-Python. No compilation or binary packages needed, no mysterious crashes.
  • Fielded indexing and search.
  • Fast indexing and retrieval — faster than any other pure-Python search solution I know of. See Benchmarks.
  • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
  • Powerful query language.
  • Production-quality pure Python spell-checker (as far as I know, the only one).

Whoosh might be useful in the following circumstances:

  • Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
  • As a research platform (at least for programmers that find Python easier to read and work with than Java 😉
  • When an easy-to-use Pythonic interface is more important to you than raw speed.
  • If your application can make good use of one deeply integrated search/lookup solution you can rely on just being there rather than having two different search solutions (a simple/slow/homegrown one integrated, an indexed/fast/external binary dependency one as an option).

Whoosh was created and is maintained by Matt Chaput. It was originally created for use in the online help system of Side Effects Software’s 3D animation software Houdini. Side Effects Software Inc. graciously agreed to open-source the code.

Learning more

One of the reasons to use Whoosh made me laugh:

When an easy-to-use Pythonic interface is more important to you than raw speed.

When is raw speed less important than anything? 😉

Seriously, experimentation with search promises to be a fruitful area for the foreseeable future.

I first saw this in Nat Torkington’s Four short links: 21 November 2013.

December 5, 2013

TextBlob: Simplified Text Processing

Filed under: Natural Language Processing,Parsing,Text Mining — Patrick Durusau @ 7:31 pm

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

….

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • JSON serialization
  • Add new models or languages through extensions
  • WordNet integration

Knowing that TextBlob plays well with NLTK is a big plus!

InfluxDB

Filed under: InfluxDB,NoSQL,Time,Time Series — Patrick Durusau @ 7:19 pm

InfluxDB

From the webpage:

An open-source, distributed, time series, events, and metrics database with no external dependencies.

Time Series

Everything in InfluxDB is a time series that you can perform standard functions on like min, max, sum, count, mean, median, percentiles, and more.

Metrics

Scalable metrics that you can collect on any interval, computing rollups on the fly later. Track 100 metrics or 1 million, InfluxDB scales horizontally.

Events

InfluxDB’s data model supports arbitrary event data. Just write in a hash of associated data and count events, uniques, or grouped columns on the fly later.

The overview page gives some greater detail:

When we built Errplane, we wanted the data model to be flexible enough to store events like exceptions along with more traditional metrics like response times and server stats. At the same time we noticed that other companies were also building custom time series APIs on top of a database for analytics and metrics. Depending on the requirements these APIs would be built on top of a regular SQL database, Redis, HBase, or Cassandra.

We thought the community might benefit from the work we’d already done with our scalable backend. We wanted something that had the HTTP API built in that would scale out to billions of metrics or events. We also wanted sometehing that would make it simple to query for downsampled data, percentiles, and other aggregates at scale. Our hope is that once there’s a standard API, the community will be able to build useful tooling around it for data collection, visualization, and analysis.

While phrased as tracking server stats and events, I suspect InfluxDB would be just as happy tracking other types of stats or events.

I don’t know, say like the “I’m alive” messages your cellphone sends to the local towers for instance.

I first saw this in Nat Torkington’s Four short links: 5 November 2013.

SICP in Clojure – Update

Filed under: Clojure — Patrick Durusau @ 5:25 pm

In my post, SICP in Clojure, I incorrectly identified Steve Deobald as the maintainer of this project.

The original maintainer of the project placed a link on the site saying that Steve is the maintainer.

That is not correct.

Apologies to Steve and apologies to my readers who were hopeful this project would be going forward.

Any thoughts on moving this project forward?

I think the idea is a very sound one.

PS: Unlike many media outlets, I think corrections should be as prominent as the original mistakes.

On Self-Licking Ice Cream Cones

Filed under: Funding,NSA,Project Management,Security — Patrick Durusau @ 3:47 pm

On Self-Licking Ice Cream Cones by Peter Worden. 1992

Ben Brody in The definitive glossary of modern US military slang quotes the following definition for a Self-Licking Ice Cream Cone:

A military doctrine or political process that appears to exist in order to justify its own existence, often producing irrelevant indicators of its own success. For example, continually releasing figures on the amount of Taliban weapons seized, as if there were a finite supply of such weapons. While seizing the weapons, soldiers raid Afghan villages, enraging the residents and legitimizing the Taliban’s cause.

Wikipedia at (Self-licking ice cream cone) reports the phrase was first used by Pete Worden in “On Self-Licking Ice Cream Cones” in 1992 to describe the NASA bureaucracy.

The keywords for the document are: Ice Cream Cones; Pork; NASA; Mafia; Congress.

Birds of a feather I would say.

Worden isolates several problems:

Problems, National, The Budget Process


This unfortunate train of events has resulted in a NASA which, more than any other agency, believes it works only for the appropriations committees. The senior staff of those committees, who have little interest in science or space, effectively run NASA. NASA senior offiicials’ noses are usually found at waist level near those committee staffers.

Problems, Closer to Home, NASA

“The Self-Licking Ice Cream Cone”

Since NASA effectively works for the most porkish part of Congress, it is not surprising that their programs are designed to maximize and perpetuate jobs programs in key Congressional districts. The Space Shuttle-Space Station is an outrageous example. Almost two-thirds of NASA’s budget is tied up in this self-licking program. The Shuttle is an unbelievably costly was to get to space at $1 billion a pop. The Space Station is a silly design. Yet, this Station is designed so it can only be built by the Shuttle and the Shuttle is the only way to construct the Station….

“Inmates Running the Asylum”

NASA’s vaulted “peer review” process is not a positive factor, but an example of the “pork” mentality within the scientific community. It results in needlessly complex programs whose primary objective is not putting instruments in orbit, but maximizing the number of constituencies and investigators, thereby maximizing the political invulnerability of the program….

“Mafia Tactics”

…The EOS is a case in point. About a year ago, encouraged by criticism from some quarters of Congress and in the press, some scientists and satellite contractors began proposing small, cheap, near-term alternatives to the EOS “battlestars.” Senior NASA officials conducted, with impunity, an unbelievable campaign of threats against these critics. Members of the White House advisory committees were told they would not get NASA funding if they continued to probe the program….

“Shoot the Sick Horses, and their Trainers”

It is outrageous that the Hubble disaster resulted in no repercussions. All we hear is that some un-named technician, no longer working for the contractor, made a mistake in the early 1980s. Even in the Defense Department, current officials would lost their jobs over allowing such an untested and expensive system to be launched.

Compare Worden’s complaints to the security apparatus represented by the NSA and its kin.

Have you heard of any repercussions for any of the security failures and/or outrages?

Is there any doubt that the security apparatus exists solely to perpetuate the security apparatus?

By definition the NSA is a Self-Licking Ice Cream Cone.

Time to find a trash can.


EOS: Earth Observing System

Hubble: The Hubble Space Telescope Optical Systems Failure Report (pdf) Long before all the dazzling images from Hubble, it was virtually orbiting space junk for several years.

U.S. Military Slang

Filed under: Language,Vocabularies — Patrick Durusau @ 2:35 pm

The definitive glossary of modern US military slang by Ben Brody.

From the post:

It’s painful for US soldiers to hear discussions and watch movies about modern wars when the dialogue is full of obsolete slang, like “chopper” and “GI.”

Slang changes with the times, and the military’s is no different. Soldiers fighting the wars in Iraq and Afghanistan have developed an expansive new military vocabulary, taking elements from popular culture as well as the doublespeak of the military industrial complex.

The US military drawdown in Afghanistan — which is underway but still awaiting the outcome of a proposed bilateral security agreement — is often referred to by soldiers as “the retrograde,” which is an old military euphemism for retreat. Of course the US military never “retreats” — rather it conducts a “tactical retrograde.”

This list is by no means exhaustive, and some of the terms originated prior to the wars in Afghanistan and Iraq. But these terms are critical to speaking the current language of soldiers, and understanding it when they speak to others. Please leave anything you think should be included in the comments.

Useful for documents that contain U.S. military slang, such as the Afghanistan War Diary.

As Ben notes at the outset, language changes over time so validate any vocabulary against your document/data set.

Geoff (update)

Filed under: Cypher,Geoff,Neo4j — Patrick Durusau @ 1:20 pm

Geoff

My prior post on Geoff pointed to a page about Geoff that appears to no longer exist. I have updated that page to point to the new location.

The current description reads:

Geoff is a text-based interchange format for Neo4j graph data that should be instantly readable to anyone familiar with Cypher, on which its syntax is based.

N*SQL Matters @Barcelona, Spain Slides!

Filed under: Conferences,NoSQL — Patrick Durusau @ 12:51 pm

N*SQL Matters @Barcelona, Spain Slides!

Slides for today but videos are said to be coming soon!

By Title:

  • API Analytics with Redis and Bigquery, Javier Ramirez view the slides
  • ArangoDB – a different approach to NoSQL, Lucas Dohmen view the slides
  • Big Memory Scale-in vs. Scale-out, Niklas Bjorkman view the slides
  • Bringing NoSQL to your mobile!, Patrick Heneise view the slides
  • Building information systems using rapid application development methods, Michel Müller view the slides
  • A call for sanity in NoSQL, Nathan Marz view the slides
  • Cicerone: A Real-Time social venue recommender, Daniel Villatoro view the slides
  • Database History from Codd to Brewer and Beyond, Doug Turnbull view the slides
  • DynamoDB – on-demand NoSQL scaling as a service, Steffen Krause view the slides
  • Getting down and dirty with Elasticsearch, Clinton Gormley view the slides
  • Harnessing the Internet of Things with NoSQL, Michael Hausenblas view the slides
  • How to survive in a BASE world, Uwe Friedrichsen view the slides
  • Introduction to Graph Databases, Stefan Armbruster view the slides
  • A Journey through the MongoDB Internals, Christian Kvalheim view the slides
  • Killing pigs and saving Danish bacon with Riak, Joel Jacobsen view the slides
  • Lambdoop, a framework for easy development of Big Data applications, Rubén Casado view the slides
  • NoSQL Infrastructure, David Mytton view the slides
  • Realtime visitor analysis with Couchbase and Elasticsearch, Jeroen Reijn view the slides
  • SAMOA: A Platform for Mining Big Data Streams, Gianmarco De Francisci Morales view the slides
  • Splout SQL: Web-latency SQL View for Hadoop, Iván de Prado view the slides
  • Sprayer: low latency, reliable multichannel messaging for Telefonica Digital, Pablo Enfedaque and Javier Arias

    view the slides

  • By Presenter:

    • Armbruster, Stefan – Introduction to Graph Databases view the slides
    • Bjorkman, Niklas – Big Memory – Scale-in vs. Scale-out view the slides
    • Casado, Rubén – Lambdoop, a framework for easy development of Big Data applications view the slides
    • Dohmen, Lucas – ArangoDB – a different approach to NoSQL view the slides
    • Enfedaque, Pablo and Javier Arias – Sprayer: low latency, reliable multichannel messaging for Telefonica Digital view the slides
    • Friedrichsen, Uwe – How to survive in a BASE world view the slides
    • Gormley, Clinton – Getting down and dirty with Elasticsearch view the slides
    • Hausenblas, Michael – Harnessing the Internet of Things with NoSQL view the slides
    • Heneise, Patrick – Bringing NoSQL to your mobile! view the slides
    • Jacobsen, Joel – Killing pigs and saving Danish bacon with Riak view the slides
    • Krause, Steffen – DynamoDB – on-demand NoSQL scaling as a service view the slides
    • Kvalheim, Christian – A Journey through the MongoDB Internals view the slides
    • Marz, Nathan – A call for sanity in NoSQL view the slides
    • Morales, Gianmarco De Francisci – SAMOA: A Platform for Mining Big Data Streams view the slides
    • Müller, Michel – Building information systems using rapid application development methods view the slides
    • Mytton, David – NoSQL Infrastructure view the slides
    • Prado, Iván de – Splout SQL: Web-latency SQL View for Hadoop view the slides
    • Ramirez, Javier – API Analytics with Redis and Bigquery view the slides
    • Reijn, Jeroen – Realtime visitor analysis with Couchbase and Elasticsearch view the slides
    • Turnbull, Doug – Database History from Codd to Brewer and Beyond view the slides
    • Villatoro, Daniel – Cicerone: A Real-Time social venue recommender view the slides

    I will update these with the videos when they are posted.

    Enjoy!

    Apache CouchDB Conf Vancouver Videos!

    Filed under: Conferences,CouchDB — Patrick Durusau @ 11:09 am

    Apache CouchDB Conf Vancouver Videos!

    For your viewing pleasure.

    By Title:

    By Presenter:

    Enjoy!

    Latest NSA Fire Storm

    Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 10:18 am

    Among the many places you can read about the latest Edward Snowden disclosures, NSA tracking cellphone locations worldwide, Snowden documents show by Barton Gellman and Ashkan Soltani, Washington Post, December 4, 2013, reads in part:

    The National Security Agency is gathering nearly 5 billion records a day on the whereabouts of cellphones around the world, according to top-secret documents and interviews with U.S. intelligence officials, enabling the agency to track the movements of individuals — and map their relationships — in ways that would have been previously unimaginable.

    The records feed a vast database that stores information about the locations of at least hundreds of millions of devices, according to the officials and the documents, which were provided by former NSA contractor Edward Snowden. New projects created to analyze that data have provided the intelligence community with what amounts to a mass surveillance tool.

    And among the many denunciations of NSA activities, the American Library Association:

    Nation’s Libraries Warn of NSA’s ‘Ravenous Hunger’ for Data

    “We don’t want [library patrons] being surveilled because that will inhibit learning, and reading, and creativity,” said Alan Inouye of the American Library Association

    – Andrea Germanos, staff writer

    A quick search on Twitter quickly led to several hundred tweets with updates in the double digits every 30 seconds or so.

    The general tenor being surprise (which I don’t understand) and outrage (that I do understand).

    What is missing from the discussion is what to do to correct the situation?

    Quite recently we all learned that MinuteMan missiles had their launch codes set to 00000000, despite direct presidential orders to the contrary.

    I take that as evidence, along with the history of the NSA, that passing laws to regulate an agency that is without effective supervision is an exercise in futility.

    Any assurance from the NSA that they are obeying U.S. laws is incapable of public verification and therefore should be presumed to be false.

    The only effective means to limit NSA activities is to limit the NSA.

    Let me repeat that: The only effective means to limit NSA activities is to limit the NSA.

    We only have the NSA’s word that it has played an important role in protecting the U.S. from terrorists.

    How can we test that tale?

    My suggestion is that we defund the NSA for a period of not less than five years. No transfer of data, equipment or personnel. None.

    If during the next five years, if U.S. based terrorism increases and proponents have a plausible plan for a new NSA, then we can re-consider it.

    If there is, as is likely, no increase in U.S. based terrorism, we can avoid the expense of a rogue agency with its own agenda.

    PS: I would not worry about the fates of NSA staff/contractors. There are a number of high tech surveillance opportunities in People’s Republic of China. Plus they have a form of government more suited to current NSA staff.

    Ekisto

    Filed under: Graphs,Visualization — Patrick Durusau @ 9:15 am

    Ekisto

    From the about:

    Ekisto comes from ekistics, the science of human settlements.

    Ekisto is an interactive visualization of three online communities: StackOverflow, Github and Friendfeed. Ekisto tries to imagine and map our online habitats using graph algorithms and the city as a metaphor.

    A graph layout algorithm arranges users in 2D space based on their similarity. Cosine similarity is computed based on the users’ network (Friendfeed), collaborate, watch, fork and follow relationships (Github), or based on the tags of posts contributed by users (StackOverflow). The height of each user represents the normalized value of the user’s Pagerank (Github, Friendfeed) or their reputation points (StackOverflow).

    A project by Alex Dragulescu.

    The three communities modeled are:

    • stackoverflow.jul.2013
    • github.mar.2012
    • friendfeed.feb.2012

    Stackoverflow can be searched by name but Github and FriendFeed, only by userid. Which makes moving from one community to the next along a particular user almost impossible.

    I mention that because we all participate in many different communities and our roles and even status may vary widely from community to community.

    Any one community view is an incomplete view of that person.

    Beyond the need to map across communities, the other take away from Ekisto is the question of community formation?

    That is, given the present snapshot of these communities, how did they evolve over time? Did particular people joining have a greater impact than others? Did some event trigger a rise in membership?

    Deeply interesting work and a reason to learn more about ekistics.

    I first saw this in a tweet by Neil Saunders.

    German Digital Library releases API

    Filed under: Digital Library,Library — Patrick Durusau @ 8:11 am

    German Digital Library releases API by Lieke Ploeger.

    From the post:

    Last month the German Digital Library (Deutsche Digitale Bibliothek – DDB) made a promising step forward toward further opening up their data by releasing its API (Application Programming Interface) to the public. This API provides access to all the metadata of the DDB released under a CC0 license, which is the predominant share. The release of this API opens up a wide range of possibilities for users to build applications, create combinations with other data or include the German digitised cultural heritage on other platforms. In the future, the DDB also plans to organize a programming competition for API applications as well as a series of workshops for developers.

    The official press release.

    Technical documentation on the API (German).

    A good excuse for you to brush up on your German. Besides, not all of it is in German.

    December 4, 2013

    Free Language Lessons for Computers

    Filed under: Data,Language,Natural Language Processing — Patrick Durusau @ 4:58 pm

    Free Language Lessons for Computers by Dave Orr.

    From the post:

    50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.

    These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.

    A great summary of the major data drops by Google Research over the past year. In many cases including pointers to additional information on the datasets.

    One that I have seen before and that strikes me as particularly relevant to topic maps is:

    Dictionaries for linking Text, Entities, and Ideas

    What is it: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.

    Where can I find it: http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2

    I want to know more: A description of the data, several examples, and ideas for uses for it can be found in a blog post or in the associated paper.

    For most purposes, you would need far less than the full set of 7.5 million concepts. Imagine having the relevant concepts for a domain that was being automatically “tagged” as you composed prose about it.

    Certainly less error-prone than marking concepts by hand!

    MusicGraph

    Filed under: Graphs,Marketing,Music,Titan — Patrick Durusau @ 4:30 pm

    Senzari releases a searchable MusicGraph service for making musical connections by Josh Ong.

    From the post:

    Music data company Senzari has launched MusicGraph, a new service for discovering music by searching through graph of over a billion music-related data points.

    MusicGraph includes a consumer-facing version and an API that can be used for commercial purposes. Senzari built the graph while working on the recommendation engine for its own streaming service, which has been rebranded as Wahwah.

    Interestingly, MusicGraph is launching first on Firefox OS before coming to iOS, Android and Windows Phone in “the coming weeks.”

    You know how much I try to avoid “practical” applications but when I saw aureliusgraphs tweet this as using the Titan database, I just had to mention it. 😉

    I think this announcement underlines something a comment said recently about promoting topic maps for what they do, not because they are topic maps.

    Here, graphs are being promoted as the source of a great user experience, not because they are fun, powerful, etc. (all of which is also true).

    Homotopy Type Theory

    Filed under: Homology,Types — Patrick Durusau @ 4:10 pm

    Homotopy Type Theory by Robert Harper. (Course with video lectures, notes, etc.)

    Synopsis:

    This is a graduate research seminar on Homotopy Type Theory (HoTT), a recent enrichment of Intuitionistic Type Theory (ITT) to include "higher-dimensional" types. The dimensionality of a type refers to the structure of its paths, the constructive witnesses to the equality of pairs of elements of a type, which themselves form a type, the identity type. In general a type is infinite dimensional in the sense that it exhibits non-trivial structure at all dimensions: it has elements, paths between elements, paths between paths, and so on to all finite levels. Moreover, the paths at each level exhibit the algebraic structure of a (higher) groupoid, meaning that there is always the "null path" witnessing reflexivity, the "inverse" path witnessing symmetry, and the "concatenation" of paths witnessing transitivity such that group-like laws hold "up to higher homotopy". Specifically, there are higher-dimensional paths witnessing the associative, unital, and inverse laws for these operations. Altogether this means that a type is a weak ∞-groupoid.

    The significance of the higher-dimensional structure of types lies in the concept of a type-indexed family of types. Such families exhibit the structure of a fibration, which means that a path between two indices "lifts" to a transport mapping between the corresponding instances of the family that is, in fact, an equivalence. Thinking of paths as constructive witnesses for equality, this amounts to saying that equal indices give rise to equivalent types, and hence, by univalence, equal elements of the universe in which the family is valued. Thus, for example, if we think of the interval I as a type with two endpoints connected by a path, then an I-indexed family of types must assign equivalent types to the endpoints. In contrast the type B of booleans consists of two disconnected points, so that a B-indexed family of types may assign unrelated types to the two points of B. Similarly, mappings from I into another type A must assign connected points in A to the endpoints of the interval, whereas mappings from B into A are free to assign arbitrary points of A to the two booleans. These preservation principles are central to the structure of HoTT.

    In many cases the path structure of a type becomes trivial beyond a certain dimension, called the level of the type. By convention the levels start at -2 and continue through -1, 0, 1, 2, and so on indefinitely. At the lowest, -2, level, the path structure of a type is degenerate in that there is an element to which all other elements are equal; such a type is said to be contractible, and is essentially a singleton. At the next higher level, -1, the type of paths between any two elements is contractible (level -2), which means that any two elements are equal, if there are any elements at all; such as type is a sub-singleton or h-proposition. At the next level, 0, the type of paths between paths between elements is contractible, so that any two elements are equal "in at most one way"; such a type is a set whose types of paths (equality relations) are all h-prop’s. Continuing in this way, types of level 1 are groupoids, those of level 2 are 2-groupoids, and so on for all finite levels.

    ITT is capable of expressing only sets, which are types of level 0. Such types may have elements, and two elements may be considered equal in at most one way. A large swath of (constructive) mathematics may be formulated using only sets, and hence is amenable to representation in ITT. Computing applications, among others, require more than just sets. For example, it is often necessary to suppress distinctions among elements of a type so as to avoid over-specification; this is called proof irrelevance. Traditionally ITT has been enriched with an ad hoc treatment of proof irrelevance by introducing a universe of "propositions" with no computational content. In HoTT such propositions are types of level -1, requiring no special treatment or distinction. Such types arise by propositional truncation of a type to render degenerate the path structure of a type above level -1, ensuring that any two elements are equal in the sense of having a path between them.

    Propositional truncation is just one example of a higher inductive type, one that is defined by specifying generators not only for its elements, but also for its higher-dimensional paths. The propositional truncation of a type is one that includes all of the elements of the type, and, in addition, a path between any two elements, rendering them equal. It is a limiting case of a quotient type in which only certain paths between elements are introduced, according to whether they are deemed to be related. Higher inductive types also permit the representation of higher-dimensional objects, such as the spheres of arbitrary dimension, as types, simply by specifying their "connectivity" properties. For example, the topological circle consists of a base point and a path starting and ending at that point, and the topological disk may be thought of as two half circles that are connected by a higher path that "fills in" the interior of the circle. Because of their higher path structure, such types are not sets, and neither are constructions such as the product of two circles.

    The univalence axiom implies that an equivalence between types (an "isomorphism up to isomorphism") determines a path in a universe containing such types. Since two types can be equivalent in many ways (for example, there can be distinct bijections between two sets), univalence gives rise to types that are not sets, but rather are of a higher level, or dimension. The univalence axiom is mathematically efficient because it allows us to treat equivalent types as equal, and hence interchangeable in all contexts. In informal settings such identifications are often made by convention; in formal homotopy type theory such identifications are true equations.

    If you think data types are semantic primitives with universal meaning/understanding, feel free to ignore this posting.

    Data types can be usefully treated “as though” they are semantic primitives, but mistaking convenience for truth can be expensive.

    The never ending cycle of enterprise level ETL for example. Even when it ends well it is expensive.

    And there are all the cases where ETL or data integration don’t end well.

    Homotopy Type Theory may not be the answer to those problems, but our current practices are known to not work.

    Why not bet on an uncertain success versus the certainty of expense and near-certainty of failure?

    To fairly compare…

    Filed under: Benchmarks,Graphs,Linked Data — Patrick Durusau @ 3:27 pm

    LDBC D3.3.1 Use case analysis and choke point analysis Coordinators: Alex Averbuch and Norbert Martinez.

    From the introduction:

    Due largely to the Web, an exponentially increasing amount of data is generated each year. Moreover, a significant fraction of this data is unstructured, or semi-structured at best. This has meant that traditional data models are becoming increasingly restrictive and unsuitable for many application domains – the relational model in particular has been criticized for its lack of semantics. These trends have driven development of alternative database technologies, including graph databases.

    The proliferation of applications dealing with complex networks has resulted in an increasing number of graph database deployments. This, in turn, has created demand for a means by which to compare the characteristics of different graph database technologies, such as: performance, data model, query expressiveness, as well as general functional and non-functional capabilities.

    To fairly compare these technologies it is essential to first have a thorough understanding of graph data models, graph operations, graph datasets, graph workloads, and the interactions between all of these. (emphasis added)

    In this rather brief report, the LDBC (Linked Data Benchmark Council) gives a thumbnail sketch of the varieties of graphs, graph databases, graph query languages, along with some summary use cases. To their credit, unlike some graph vendors, they do understand what is meant by a hyperedge. (see p.8)

    On the other hand, they retreat from the full generality of graph models to “directed attributed multigraphs,” before evaluating any of the graph alternatives. (also at p.8)

    It may be a personal prejudice but I would prefer to see fuller development of use cases and requirements before restricting the solution space.

    Particularly since new developments in graph theory and/or technology are a weekly if not daily occurrence.

    Premature focus on “unsettled” technology could result in a benchmark for yesterday’s version of graph technology.

    Interesting I suppose but not terribly useful.

    December 3, 2013

    Benchmarking Honesty

    Filed under: Benchmarks,FoundationDB — Patrick Durusau @ 6:43 pm

    Benchmarking Honesty by David Rosenthal.

    From the post:

    Recently, someone brought to my attention a blog post that benchmarks FoundationDB and another responding to the benchmark itself. I’ll weigh in: I think this benchmark is unfair because it gives people too good an impression of FoundationDB’s performance. In the benchmark, 100,000 items are loaded into each database/storage engine in both sequential and random patterns. In the case of FoundationDB and other sophisticated systems like SQL Server, you can see that the performance of random and sequential writes are virtually the same; this points to the problem. In the case of FoundationDB, an “absorption” mechanism is able to cope with bursts of writes (on the order of a minute or two, usually) without actually updating the real data structures holding the data (i.e. only persisting a log to disk, and making changes available to read from RAM). Hence, the published test results are giving FoundationDB an unfair advantage. I think that you will find that if you sustain this workload for a longer time, like in real-world usages, FoundationDB might be significantly slower.

    If you don’t recognize the name, David Rosenthal is the co-founder and CEO of FoundationDB.

    What?

    A CEO saying a benchmark favorable to his product is “unfair?”

    Odd as it may sound, I think there is an honest CEO on the loose.

    Statistically speaking, it had to happen eventually. 😉

    Seriously, high marks to David Rosenthal. We need more CEOs, engineers and presenters with a sense of honesty.

    Annual Christmas Tree Lecture (Knuth)

    Filed under: CS Lectures,Graphs,Mathematics,Trees — Patrick Durusau @ 6:23 pm

    Computer Musing by Professor Donald E. Knuth.

    From the webpage:

    Professor Knuth will present his 19th Annual Christmas Tree Lecture on Monday, December 9, 2013 at 7:00 pm in NVIDIA Auditorium in the new Huang Engineering Center, 475 Via Ortega, Stanford University (map). The topic will be Planar Graphs and Ternary Trees. There is no admission charge or registration required. For those unable to come to Stanford, register for the live webinar broadcast.

    No doubt heavy sledding but what better way to prepare for the holiday season?

    Date: Monday, December 9, 2013

    Time:
    7 p.m. – 8 p.m. Pacific
    10 p.m. – 11 p.m. Eastern

    Scout [NLP, Move up from Twitter Feeds to Court Opinions]

    Filed under: Government,Government Data,Law,Law - Sources — Patrick Durusau @ 5:01 pm

    Scout

    From the about page:

    Scout is a free service that provides daily insight to how our laws and regulations are shaped in Washington, DC and our state capitols.

    These days, you can receive electronic alerts to know when a company is in the news, when a TV show is scheduled to air or when a sports team wins. Now, you can also be alerted when our elected officials take action on an issue you care about.

    Scout allows anyone to subscribe to customized email or text alerts on what Congress is doing around an issue or a specific bill, as well as bills in the state legislature and federal regulations. You can also add external RSS feeds to complement a Scout subscription, such as press releases from a member of Congress or an issue-based blog.

    Anyone can create a collection of Scout alerts around a topic, for personal organization or to make it easy for others to easily follow a whole topic at once.

    Researchers can use Scout to see when Congress talks about an issue over time. Members of the media can use Scout to track when legislation important to their beat moves ahead in Congress or in state houses. Non-profits can use Scout as a tool to keep tabs on how federal and state lawmakers are making policy around a specific issue.

    Early testing of Scout during its open beta phase alerted Sunlight and allies in time to successfully stop an overly broad exemption to the Freedom of Information Act from being applied to legislation that was moving quickly in Congress. Read more about that here.

    Thank you to the Stanton Foundation, who contributed generous support to Scout’s development.

    What kind of alerts?

    If your manager suggests a Twitter feed to test NLP, classification, sentiment, etc. code, ask to use Federal Court (U.S.) Court Opinion Feed instead.

    Not all data is written in one hundred and forty (140) character chunks. 😉

    PS: Be sure to support/promote the Sunlight Foundation for making this data available.

    Project Tycho:… [125 Years of Disease Records]

    Filed under: Health care,Medical Informatics — Patrick Durusau @ 4:33 pm

    Project Tycho: Data for Health

    From the webpage:

    After four years of data digitization and processing, the Project Tycho™ Web site provites open access to newly digitized and integrated data from the entire 125 years history of United States weekly nationally notifiable disease surveillance data since 1888. These data can now be used by scientists, decision makers, investors, and the general public for any purpose. The Project Tycho™ aim is to advance the availability and use of public health data for science and decision making in public health, leading to better programs and more efficient control of diseases.

    Three levels of data have been made available: Level 1 data include data that have been standardized for specific analyses, Level 2 data include standardized data that can be used immediately for analysis, and Level 3 data are raw data that cannot be used for analysis without extensive data management. See the video tutoral.

    An interesting factoid concerning disease reporting in the United States, cica 1917. Influenza, in 1917, was not a reportable disease. The Great Influenza by John Barry.

    I am curious about the Level 3 data.

    Mostly in terms of how much “data management” would be needed to make it useful?

    Could be a window into the data management required to unify medical records in the United States.

    Or simply a way to practice your data management skills.

    Using Hive to interact with HBase, Part 2

    Filed under: HBase,Hive — Patrick Durusau @ 4:09 pm

    Using Hive to interact with HBase, Part 2 by Nick Dimiduk.

    From the post:

    This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.

    “Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”

    This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase 0.94.6.1.

    If you learn from concrete examples and then feel your way further out, you will love this post!

    ISWC, Sydney 2013 (videos)

    Filed under: Conferences,Semantic Web — Patrick Durusau @ 3:55 pm

    12th International Semantic Web Conference (ISWC), Sydney 2013

    From the webpage:

    ISWC 2013 is the premier international forum, for the Semantic Web / Linked Data Community. Here, scientists, industry specialists, and practitioners meet to discuss the future of practical, scalable, user-friendly, and game changing solutions.

    Detailed information can be found at the ISWC 2013 website.

    I count thirty-six (36) videos (including two tutorials).

    Some of them are fairly short so suitable for watching while standing in checkout lines. 😉

    Bokeh

    Filed under: Graphics,Python,Visualization — Patrick Durusau @ 3:42 pm

    Bokeh

    From the webpage:

    Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.

    For more information about the goals and direction of the project, please see the Technical Vision.

    To get started quickly, follow the Quickstart.

    Visit the source repository: https://github.com/ContinuumIO/bokeh

    Be sure to follow us on Twitter @bokehplots!

    The technical vision makes the case for Bokeh quite well:

    Photographers use the Japanese word “bokeh” to describe the blurring of the out-of-focus parts of an image. Its aesthetic quality can greatly enhance a photograph, and photographers artfully use focus to draw attention to subjects of interest. “Good bokeh” contributes visual interest to a photograph and places its subjects in context.

    In this vein of focusing on high-impact subjects while always maintaining a relationship to the data background, the Bokeh project attempts to address fundamental challenges of large dataset visualization:

    • How do we look at all the data?
      • What are the best perceptual approaches to honestly and accurately represent the data to domain experts and SMEs so they can apply their intuition to the data?
      • Are there automated approaches to accurately reduce large datasets so that outliers and anomalies are still visible, while we meaningfully represent baselines and backgrounds? How can we do this without “washing away” all the interesting bits during a naive downsampling?
      • If we treat the pixels and topology of pixels on a screen as a bottleneck in the I/O channel between hard drives and an analyst’s visual cortex, what are the best compression techniques at all levels of the data transformation pipeline?
    • How can scientists and data analysts be empowered to use visualization fluidly, not merely as an output facility or one stage of a pipeline, but as an entire mode of engagement with data and models?
      • Are language-based approaches for expressing mathematical modeling and data transformations the best way to compose novel interactive graphics?
      • What data-oriented interactions (besides mere linked brushing/selection) are useful for fluid, visually-enable analysis?

    Not likely any time soon but posting data for scientific research in ways that enable interactive analysis by readers (and snapshotting their results) could take debates over data and analysis to a whole new level.

    As opposed to debating dots on a graph not of your own making and where alternative analyses are not available.

    Of Algebirds, Monoids, Monads, …

    Filed under: BigData,Data Analysis,Functional Programming,Hadoop,Scala,Storm — Patrick Durusau @ 2:50 pm

    Of Algebirds, Monoids, Monads, and Other Bestiary for Large-Scale Data Analytics by Michael G. Noll.

    From the post:

    Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the field of large-scale data processing? Twitter recently open-sourced Algebird, which provides you with a JVM library to work with such algebraic data structures. Algebird is already being used in Big Data tools such as Scalding and SummingBird, which means you can use Algebird as a mechanism to plug your own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as Hadoop and Storm. In this post I will show you how to get started with Algebird, introduce you to monoids and monads, and address the question why you get interested in those in the first place.

    Goal of this article

    The main goal of this is article is to spark your curiosity and motivation for Algebird and the concepts of monoid, monads, and category theory in general. In other words, I want to address the questions “What’s the big deal? Why should I care? And how can these theoretical concepts help me in my daily work?”

    You can call this a “blog post” but I rarely see blog posts with a table of contents! 😉

    The post should come with a warning: May require substantial time to read, digest, understand.

    Just so you know, I was hooked by this paragraph early on:

    So let me use a different example because adding Int values is indeed trivial. Imagine that you are working on large-scale data analytics that make heavy use of Bloom filters. Your applications are based on highly-parallel tools such as Hadoop or Storm, and they create and work with many such Bloom filters in parallel. Now the money question is: How do you combine or add two Bloom filters in an easy way?

    Are you motivated?

    I first saw this in a tweet by CompSciFact.

    Five Stages of Data Grief

    Filed under: Data,Data Quality — Patrick Durusau @ 2:15 pm

    Five Stages of Data Grief by Jeni Tennison.

    From the post:

    As organisations come to recognise how important and useful data could be, they start to think about using the data that they have been collecting in new ways. Often data has been collected over many years as a matter of routine, to drive specific processes or sometimes just for the sake of it. Suddenly that data is repurposed. It is probed, analysed and visualised in ways that haven’t been tried before.

    Data analysts have a maxim:

    If you don’t think you have a quality problem with your data, you haven’t looked at it yet.

    Every dataset has its quirks, whether it’s data that has been wrongly entered in the first place, automated processing that has introduced errors, irregularities that come from combining datasets into a consistent structure or simply missing information. Anyone who works with data knows that far more time is needed to clean data into something that can be analysed, and to understand what to leave out, than in actually performing the analysis itself. They also know that analysis and visualisation of data will often reveal bugs that you simply can’t see by staring at a spreadsheet.

    But for the people who have collected and maintained such data — or more frequently their managers, who don’t work with the data directly — this realisation can be a bit of a shock. In our last ODI Board meeting, Sir Tim Berners-Lee suggested that the data curators need to go through was something like the five stages of grief described by the Kübler-Ross model.

    Jeni covers the five stages of grief from a data quality standpoint and offers a sixth stage. (No spoilers follow, read her post.)

    Correcting input/transformation errors is one level of data cleaning.

    But the near-collapse of HealthCare.gov shows how streams of “clean” data can combine into a large pool of “dirty” data.

    Every contributor supplied ‘clean’ data but when combined with other “clean” data, confusion was the result.

    “Clean” data is an ongoing process at two separate levels:

    Level 1: Traditional correction of input/transformation errors (as per Jeni).

    Level 2: Preparation of data for transformation into “clean” data for new purposes.

    The first level is familiar.

    The second we all know as ad-hoc ETL.

    Enough knowledge is gained to make a transformation work, but that knowledge isn’t passed on with the data or more generally.

    Or as we all learned from television: “Lather, rinse, repeat.”

    A good slogan if you are trying to maximize sales of shampoo, but a wasteful one when describing ETL for data.

    What if data curators captured the knowledge required for ETL, making every subsequent ETL less resource intensive and less error prone?

    I think that would qualify as data cleaning.

    You?

    Announcing Open LEIs:…

    Filed under: Business Intelligence,Identifiers,Open Data — Patrick Durusau @ 11:04 am

    Announcing Open LEIs: a user-friendly interface to the Legal Entity Identifier system

    From the post:

    Today, OpenCorporates announces a new sister website, Open LEIs, a user-friendly interface on the emerging Global Legal Entity Identifier System.

    At this point many, possibly most, of you will be wondering: what on earth is the Global Legal Entity Identifier System? And that’s one of the reasons why we built Open LEIs.

    The Global Legal Entity Identifier System (aka the LEI system, or GLEIS) is a G20/Financial Stability Board-driven initiative to solve the issues of identifiers in the financial markets. As we’ve explained in the past, there are a number of identifiers out there, nearly all of them proprietary, and all of them with quality issues (specifically not mapping one-to-one with legal entities). Sometimes just company names are used, which are particularly bad identifiers, as not only can they be represented in many ways, they frequently change, and are even reused between different entities.

    This problem is particularly acute in the financial markets, meaning that regulators, banks, market participants often don’t know who they are dealing with, affecting everything from the ability to process trades automatically to performing credit calculations to understanding systematic risk.

    The LEI system aims to solve this problem, by providing permanent, IP-free, unique identifiers for all entities participating in the financial markets (not just companies but also municipalities who issue bonds, for example, and mutual funds whose legal status is a little greyer than companies).

    The post cites five key features for Open LEIs:

    1. Search on names (despite slight misspellings) and addresses
    2. Browse the entire (100,000 record) database and/or filter by country, legal form, or the registering body
    3. A permanent URL for each LEI
    4. Links to OpenCorporate for additional data
    5. Data is available as XML or JSON

    As the post points out, the data isn’t complete but dragging legal entities out into the light is never easy.

    Use this resource and support it if you are interested in more and not less financial transparency.

    « Newer PostsOlder Posts »

    Powered by WordPress