Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 7, 2013

Learning 30 Technologies in 30 Days…

Filed under: Design,Javascript,OpenShift,Programming — Patrick Durusau @ 9:32 am

Learning 30 Technologies in 30 Days: A Developer Challenge by Shekhar Gulati.

From the post:

I have taken a challenge wherein I will learn a new technology every day for a month. The challenge started on October 29, 2013. Below is the list of technologies I’ve started learnign and blogging about. After my usual work day, I will spend a couple of hours learning a new technology and one hour writing about it. The goal of this activity is to get familiar with many of the new technologies being used in the developer community. My main focus is on JavaScript and related technologies. I’ll also explore other technologies that interest me like Java, for example. I may spend multiple days on the same technology, but I will pick a new topic each time within that technology. Wherever it makes sense, I will try to show how it can work with OpenShift. I am expecting it to be fun and a great learning experience.

The homepage of the challenge that currently points to:

  1. October 29, 2013 – Day 1: Bower—Manage Your Client Side Dependencies. The first day talks about Bower and how you can use it.

  2. October 30, 2013 – Day 2: AngularJS—Getting My Head Around AngularJS. This blog talks about how you can get started with AngularJS. It is a very basic blog and talks about how to build a simple bookshop application.

  3. October 31, 2013 – Day 3: Flask—Instant Python Web Development with Python and OpenShift. This blog introduces Flask–a micro framework for doing web development in Python. It also reviews “Instant Flask Web Development” book and port the sample application to OpenShift.

  4. November 1, 2013 – Day 4: PredictionIO—How to A Build Blog Recommender. This blog talks about how you can use PredictionIO to build a blog recommender.

  5. November 2, 2013 — Day 5: GruntJS—Let Someone Else Do My Tedious Repetitive Tasks. This blog talks about how we can let GruntJS perform tedious tasks on our behalf. It also covers how we can use grunt-markdown plugin to convert Markdown to HTML5.

  6. November 3, 2013 — Day 6: Grails–Rapid JVM Web Development with Grails And OpenShift. This blog talks about how we can use Grails to build web application. Then we will deploy the application to OpenShift.

  7. November 4, 2013 – Day 7: GruntJS LiveReload–Take Productivity To Another Level. This blog talks about how we can use GruntJS watch plugin and live reload functionality to achieve extreme productivity.

  8. November 5, 2013 – Day 8: Harp–The Modern Static Web Server. This blog post will discuss the Harp web server and how to install and use it

  9. November 6, 2103 – Day 9: TextBlob–Finding Sentiments in Text

I encountered the challenge via the Day 4: PredictionIO—How to A Build Blog Recommender post.

The more technologies you know the broader your options for creation and delivery of topic map content to users.

Musicbrainz in Neo4j – Part 1

Filed under: Cypher,Graphs,Music,Music Retrieval,Neo4j — Patrick Durusau @ 9:06 am

Musicbrainz in Neo4j – Part 1 by Paul Tremberth.

From the post:

What is MusicBrainz?

Quoting Wikipedia, MusicBrainz is an “open content music database [that] was founded in response to the restrictions placed on the CDDB.(…) MusicBrainz captures information about artists, their recorded works, and the relationships between them.”

Anyone can browse the database at http://musicbrainz.org/. If you create an account with them you can contribute new data or fix existing records details, track lengths, send in cover art scans of your favorite albums etc. Edits are peer reviewed, and any member can vote up or down. There are a lot of similarities with Wikipedia.

With this first post, we want to show you how to import the Musicbrainz data into Neo4j for some further analysis with Cypher in the second post. See below for what we will end up with:

MusicBrainz data

MusicBrainz currently has around 1000 active users, nearly 800,000 artists, 75,000 record labels, around 1,200,000 releases, more than 12,000,000 tracks, and short under 2,000,000 URLs for these entities (Wikipedia pages, official homepages, YouTube channels etc.) Daily fixes by the community makes their data probably the freshest and most accurate on the web.
You can check the current numbers here and here.

This rocks!

Interesting data, walk through how to load the data into Neo4j and the promise of more interesting activities to follow.

However, I urge caution on showing this to family members. 😉

You may wind up scripting daily data updates and teaching Cypher to family members and no doubt their friends.

Up to you.

I first saw this in a tweet by Peter Neubauer.

November 6, 2013

Scalable Property and Hypergraphs in RDF

Filed under: AllegroGraph,Hypergraphs,RDF,Triplestore — Patrick Durusau @ 8:28 pm

From the description:

There is a misconception that Triple Stores are not ‘true’ graph databases because they supposedly do not support Property Graphs and Hypergraphs.

We will demonstrate that Property and Hypergraphs are not only natural to Triple Stores and RDF but allow for potentially even more powerful graph models than non-RDF approaches.

AllegroGraph defends their implementation of Triple Stores as both property and hypergraphs.

The second story (see also A Letter Regarding Native Graph Databases) I have heard in two days based upon an unnamed vendor trash talking other graph databases.

Are graph databases catching on enough for that kind of marketing effort?

BTW, AllegroGraph does have a Free Server Edition Download.

Limited to 5 million triples but that should capture your baseball card collection or home recipe book. 😉

elasticsearch 1.0.0.beta1 released

Filed under: ElasticSearch,Lucene,Search Engines,Searching — Patrick Durusau @ 8:04 pm

elasticsearch 1.0.0.beta1 released by Clinton Gormley.

From the post:

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta1, the first public release on the road to 1.0.0. The countdown has begun!

You can download Elasticsearch 1.0.0.Beta1 here.

In each beta release we will add one major new feature, giving you the chance to try it out, to break it, to figure out what is missing and to tell us about it. Your use cases, ideas and feedback is essential to making Elasticsearch awesome.

The main feature we are showcasing in this first beta is Distributed Percolation.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

distributed percolation

For those of you who aren’t familiar with percolation, it is “search reversed”. Instead of running a query to find matching docs, percolation allows you to find queries which match a doc. Think of people registering alerts like: tell me when a newspaper publishes an article mentioning “Elasticsearch”.

Percolation has been supported by Elasticsearch for a long time. In the current implementation, queries are stored in a special _percolator index which is replicated to all nodes, meaning that all queries exist on all nodes. The idea was to have the queries alongside the data.

But users are using it at a scale that we never expected, with hundreds of thousands of registered queries and high indexing rates. Having all queries on every node just doesn’t scale.

Enter Distributed Percolation.

In the new implementation, queries are registered under the special .percolator type within the same index as the data. This means that queries are distributed along with the data, and percolation can happen in a distributed manner across potentially all nodes in the cluster. It also means that an index can be made as big or small as required. The more nodes you have the more percolation you can do.

After reading the news release I understand why Twitter traffic on the elasticsearch release surged today. 😉

A new major feature with each beta release? That should attract some attention.

Not to mention “distributed percolation.”

Getting closer to a result being the “result” at X time on the system clock.

Topic Map Conference?

Filed under: Conferences,Topic Maps — Patrick Durusau @ 7:51 pm

How To Host a Conference on Google Hangouts on Air by Roger Peng.

Aki Kivelä recently asked about where topic mappers could get together since there are no topic map specific conferences at the moment.

I haven’t used Google Hangouts for a conference but from Roger’s description, it has potential.

Any experience positive or negative with Google Hangouts?

Not like a face to face conference but it is certainly cheaper for all concerned.

Thoughts, comments?

Introduction to Information Retrieval

Filed under: Classification,Indexing,Information Retrieval,Probalistic Models,Searching — Patrick Durusau @ 5:10 pm

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

A bit dated now (2008) but the underlying principles of information retrieval remain the same.

I have a hard copy but the additional materials and ability to cut-n-paste will make this a welcome resource!

We’d be pleased to get feedback about how this book works out as a textbook, what is missing, or covered in too much detail, or what is simply wrong. Please send any feedback or comments to: informationretrieval (at) yahoogroups (dot) com

Online resources

Apart from small differences (mainly concerning copy editing and figures), the online editions should have the same content as the print edition.

The following materials are available online. The date of last update is given in parentheses.

Information retrieval resources

A list of information retrieval resources is also available.

Introduction to Information Retrieval: Table of Contents

Front matter (incl. table of notations) pdf

01   Boolean retrieval pdf html

02 The term vocabulary & postings lists pdf html

03 Dictionaries and tolerant retrieval pdf html

04 Index construction pdf html

05 Index compression pdf html

06 Scoring, term weighting & the vector space model pdf html

07 Computing scores in a complete search system pdf html

08 Evaluation in information retrieval pdf html

09 Relevance feedback & query expansion pdf html

10 XML retrieval pdf html

11 Probabilistic information retrieval pdf html

12 Language models for information retrieval pdf html

13 Text classification & Naive Bayes pdf html

14 Vector space classification pdf html

15 Support vector machines & machine learning on documents pdf html

16 Flat clustering pdf html Resources.

17 Hierarchical clustering pdf html

18 Matrix decompositions & latent semantic indexing pdf html

19 Web search basics pdf html

20 Web crawling and indexes pdf html

21 Link analysis pdf html

Bibliography & Index pdf

bibtex file bib

Taming Galactus [Entity Fluidity, Complex Bibliography, Hyperedges]

Filed under: Biography,Graphs,Neo4j,Time — Patrick Durusau @ 2:57 pm

Taming Galactus by Peter Olson.

From the description:

Marvel Entertainment’s Peter Olson talk about how Marvel uses graph theory and the emerging NoSQL space to understand, model and ultimately represent the uncanny Marvel Universe.

Marvel Comics by any other name. 😉

From the slides:

  • 70+ Years of Stories
  • 30,000+ Comic Issues
  • 5,000+ Creators
  • 8,000+ Named Characters
  • 32 Movies (Marvel Studios and Licensed Movies)
  • 30+ Television Series
  • 100+ Video Games

Peter’s question: “How do you model a world where anything can happen?”

Main problems addressed are:

  • Entity fluidity, that is entities changing over time (sort of like people tracked by the NSA).
  • Complex bibliography, that is publication order isn’t story order. Not to mention that characters “reboot.”

Marvel uses graph databases.

Using hyperedges for modeling.

For example, the relationship between a character and person who plays the character is represented by a hyperedge that includes a node for the moment when that relationship is true.

Very good illustration of why hyperedges are useful.

Makes you wonder.

If a comic book company is using hypergraph techniques with its data, why are governments sharing data with data dumpster methods?

Like the data dumpster where Snowden obtained his supply of documents.

BTW, for experiments with graphs, sans the hyperedges, Marvel is using Neo4j.

November 5, 2013

[A]ggregation is really hard

Filed under: Aggregation,Merging,Topic Maps — Patrick Durusau @ 7:07 pm

The quote in full reads:

The reason that big data hasn’t taken economics by storm is that aggregation is really hard.

As near as I can tell, the quote originates from Klint Finley.

Let’s assume aggregation is “really hard,” which of necessity means it is also expensive.

Might be nice, useful, interesting, etc., to have “all” the available data for a subject organized together, what is the financial (or other) incentive to make it so?

And that incentive has to be out weigh the expense of reaching that goal. By some stated margin.

Suggestions on how to develop such a metric?

Email Indexing Using Cloudera Search and HBase

Filed under: Cloudera,HBase,Solr — Patrick Durusau @ 6:38 pm

Email Indexing Using Cloudera Search and HBase by Jeff Shmain.

From the post:

In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)

Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:

  • Is HBase an optimal storage medium for the given use case?
  • Is the data already ingested into HBase?
  • Is there any access pattern that will require the files to be stored in a format other than HFiles?
  • If HBase is not currently running, will there be enough hardware resources to bring it up?

There are two ways to configure Cloudera Search to index documents stored in HBase: to alter the configuration files directly and start Lily HBase Indexer manually or as a service, or to configure everything using Cloudera Manager. This post will focus on the latter, because it is by far the easiest way to enable Search on HBase — or any other service on CDH, for that matter.

This rocks!

Including the reminder to fit the solution to your requirements, not the other way around.

The phrase “…near real time…” reminds me that HBase can operate in “…near real time…” but no analyst using HBase can.

Think about it. A search result comes back, the analyst reads it, perhaps compares it to their memory of other results and/or looks for other results to make the comparison. Then the analyst has to decide what if anything the results mean in a particular context and then communicate those results to others or take action based on those results.

That doesn’t sound even close to “…near real time…” to me.

You?

Implementations of Data Catalog Vocabulary

Filed under: DCAT,Government Data,Linked Data — Patrick Durusau @ 5:45 pm

Implementations of Data Catalog Vocabulary

From the post:

The Government Linked Data (GLD) Working Group today published the Data Catalog Vocabulary (DCAT) as a Candidate Recommendation. DCAT allows governmental and non-governmental data catalogs to publish their entries in a standard machine-readable format so they can be managed, aggregated, and presented in other catalogs.

Originally developed at DERI, DCAT has evolved with input from a variety of stakeholders and is now stable and ready for widespread use. If you have a collection of data sources, please consider publishing DCAT metadata for it, and if you run a data catalog or portal, please consider making use of DCAT metadata you find. The Working Group is eager to receive comments reports of use at public-gld-comments@w3.org and is maintaining an Implementation Report.

If you know anyone in the United States government, please suggest this to them.

The more time the U.S. government spends on innocuous data, the less time it has to spy on its citizens and the citizens and governments of other countries.

I say innocuous data because I have yet to see any government release information that would discredit the current regime.

Wasn’t true for the Pentagon Papers, the Watergate tapes or the Snowden releases.

Can you think of any voluntary release of data by any government that discredited a current regime?

The reason for secrecy isn’t to protect techniques or sources.

Guess whose incompetence would be exposed by transparency?

Implementations of RDF 1.1?

Filed under: RDF,Semantic Web — Patrick Durusau @ 5:31 pm

W3C Invites Implementations of five Candidate Recommendations for version 1.1 of the Resource Description Framework (RDF)

From the post:

The RDF Working Group today published five Candidate Recommendations for version 1.1 of the Resource Description Framework (RDF), a widespread and stable technology for data interoperability:

  • RDF 1.1 Concepts and Abstract Syntax defines the basics which underly all RDF syntaxes and systems. It provides for general data interoperability.
  • RDF 1.1 Semantics defines the precise semantics of RDF data, supporting use with a wide range of “semantic” or “knowledge” technologies.
  • RDF 1.1 N-Triples defines a simple line-oriented syntax for serializing RDF data. N-Triples is a minimalist subset of Turtle.
  • RDF 1.1 TriG defines an extension to Turtle (aligned with SPARQL) for handling multiple RDF Graphs in a single document.
  • RDF 1.1 N-Quads defines an extension to N-Triples for handling multiple RDF Graphs in a single document.

All of these technologies are now stable and ready to be widely implemented. Each specification (except Concepts) has an associated Test Suite and includes a link to an Implementation Report showing how various software currently fares on the tests. If you maintain RDF software, please review these specifications, update your software if necessary, and (if relevant) send in test results as explained in the Implementation Report.

RDF 1.1 is a refinement to the 2004 RDF specifications, designed to simplify and improve RDF without breaking existing deployments.

In case you are curious about where this lies in the W3C standards process, see: W3C Technical Report Development Process.

I never had a reason to ferret it out before but now that I did, I wanted to write it down.

Has it been ten (10) years since RDF 1.0?

I think it has been fifteen (15) since the Semantic Web was going to teach the web to sing or whatever the slogan was.

Like all legacy data, RDF will never go away, later systems will have to account for it.

Like COBOL I suppose.

A Letter Regarding Native Graph Databases

Filed under: Graphs,Marketing — Patrick Durusau @ 5:17 pm

A Letter Regarding Native Graph Databases by Matthias Broecheler.

From the post:

It’s fun to watch marketers create artificial distinctions between products that grab consumer attention. One of my favorite examples is Diamond Shreddies. Shreddies, a whole wheat cereal, has a square shape and was always displayed as such. So an ingenious advertiser at Kraft foods thought to advertise a new and better Diamond Shreddies. It’s a fun twist that got people’s attention and some consumers even proclaimed that Diamond Shreddies tasted better though they obviously ate the same old product.

Such marketing techniques are also used in the technology sector — unfortunately, at a detriment to consumers. Unlike Kraft’s playful approach, there are technical companies that attempt to “educate” engineers on artificial distinctions as if they were real and factual. An example from my domain is the use of the term native graph database. I recently learned that one graph database vendor decided to divide the graph database space into non-native (i.e. square) and native (i.e. diamond) graph databases. Obviously, non-native is boring, or slow, or simply bad and native is exciting, or fast, or simply good.

Excellent push back against vendor hype on graph databases.

As well written as it is, people influenced by graph database hype are unlikely to read it.

I suggest you read it so you can double down if you encounter a graph fraudster.

False claims about graph databases benefits the fraudster at the expense of the paradigm.

That’s not a good outcome.

Apache CouchDB 1.5.0

Filed under: CouchDB — Patrick Durusau @ 5:00 pm

Apache CouchDB 1.5.0 released by Dirkjan Ochtman.

From the post:

CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. CouchDB works well with modern web and mobile apps. You can even serve web apps directly out of CouchDB. And you can distribute your data, or your apps, efficiently using CouchDB’s incremental replication. CouchDB supports master-master setups with automatic conflict detection.

Grab your copy here:

http://couchdb.apache.org/

Pre-built packages for Windows and OS X are available.

CouchDB 1.5.0 is a feature release, and was originally published on 2013-11-05.

Is it just me or has the Fall season of software releases heavier this year? 😉

Exoplanets.org

Filed under: Astroinformatics,Data — Patrick Durusau @ 4:45 pm

Exoplanets.org

From the homepage:

The Exoplanet Data Explorer is an interactive table and plotter for exploring and displaying data from the Exoplanet Orbit Database. The Exoplanet Orbit Database is a carefully constructed compilation of quality, spectroscopic orbital parameters of exoplanets orbiting normal stars from the peer-reviewed literature, and updates the Catalog of nearby exoplanets.

A detailed description of the Exoplanet Orbit Database and Explorers is published here and is available on astro-ph.

In addition to the Exoplanet Data Explorer, we have also provided the entire Exoplanet Orbit Database in CSV format for a quick and convenient download here. A list of all archived CSVs is available here.

Help and documentation for the Exoplanet Data Explorer is available here. A FAQ and overview of our methodology is here, including answers to the questions “Why isn’t my favorite planet/datum in the EOD?” and “Why does site X list more planets than this one?”.

A small data set but an important one none the less.

I would point out that the term “here” occurs five (5) times with completely different meanings.

It’s a small thing but had:

Help and documentation for the Exoplanet Data Explorer is available <a href=”http://exoplanets.org/help/common/data”>here</a>

been:

<a href=”http://exoplanets.org/help/common/data”>Exoplanet Data Explorer help and documentation</a>

Even a not very bright search engine might do a better search of the page.

Please avoid labeling links with “here.”

Design Patterns in Dynamic Programming

Filed under: Design Patterns,Programming — Patrick Durusau @ 4:27 pm

Design Patterns in Dynamic Programming by Peter Norvig.

From the slides (dated May 5, 1996):

What Are Design Patterns?

  • Descriptions of what experienced designers know (that isn’t written down in the Language Manual)
  • Hints/reminders for choosing classes and methods
  • Higher-order abstractions for program organization
  • To discuss, weigh and record design tradeoffs
  • To avoid limitations of implementation language

(Design Strategies, on the other hand, are what guide you to certain patterns, and certain implementations. They are more like proverbs and like templates.)

Seventy-six slides that are packed with prose, each and every one.

Great for reading separate from the presentation.

Occurs to me that topic maps should be mature enough for a collection of patterns.

Patterns for modeling relationships. Patterns for modeling subject identity.

What patterns would you suggest?

The Shelley-Godwin Archive

The Shelley-Godwin Archive

From the homepage:

The Shelley-Godwin Archive will provide the digitized manuscripts of Percy Bysshe Shelley, Mary Wollstonecraft Shelley, William Godwin, and Mary Wollstonecraft, bringing together online for the first time ever the widely dispersed handwritten legacy of this uniquely gifted family of writers. The result of a partnership between the New York Public Library and the Maryland Institute for Technology in the Humanities, in cooperation with Oxford’s Bodleian Library, the S-GA also includes key contributions from the Huntington Library, the British Library, and the Houghton Library. In total, these partner libraries contain over 90% of all known relevant manuscripts.

In case you don’t recognize the name, Mary Shelley wrote Frankenstein; or, The Modern Prometheus; William Godwin, philosopher, early modern (unfortunately theoretical) anarchist; Percy Bysshe Shelley, English Romantic Poet; Mary Wollstonescraft, writer, feminist. Quite a group for the time or even now.

From the About page on Technological Infrastructure:

The technical infrastructure of the Shelley-Godwin Archive builds on linked data principles and emerging standards such as the Shared Canvas data model and the Text Encoding Initiative’s Genetic Editions vocabulary. It is designed to support a participatory platform where scholars, students, and the general public will be able to engage in the curation and annotation of the Archive’s contents.

The Archive’s transcriptions and software applications and libraries are currently published on GitHub, a popular commercial host for projects that use the Git version control system.

  • TEI transcriptions and other data
  • Shared Canvas viewer and search service
  • Shared Canvas manifest generation

All content and code in these repositories is available under open licenses (the Apache License, Version 2.0 and the Creative Commons Attribution license). Please see the licensing information in each individual repository for additional details.

Shared Canvas and Linked Open Data

Shared Canvas is a new data model designed to facilitate the description and presentation of physical artifacts—usually textual—in the emerging linked open data ecosystem. The model is based on the concept of annotation, which it uses both to associate media files with an abstract canvas representing an artifact, and to enable anyone on the web to describe, discuss, and reuse suitably licensed archival materials and digital facsimile editions. By allowing visitors to create connections to secondary scholarship, social media, or even scenes in movies, projects built on Shared Canvas attempt to break down the walls that have traditionally enclosed digital archives and editions.

Linked open data or content is published and licensed so that “anyone is free to use, reuse, and redistribute it—subject only, at most, to the requirement to attribute and/or share-alike,” (from http://opendefinition.org/) with the additional requirement that when an entity such as a person, a place, or thing that has a recognizable identity is referenced in the data, the reference is made using a well-known identifier—called a universal resource identifier, or “URI”—that can be shared between projects. Together, the linking and openness allow conformant sets of data to be combined into new data sets that work together, allowing anyone to publish their own data as an augmentation of an existing published data set without requiring extensive reformulation of the information before it can be used by anyone else.

The Shared Canvas data model was developed within the context of the study of medieval manuscripts to provide a way for all of the representations of a manuscript to co-exist in an openly addressable and shareable form. A relatively well-known example of this is the tenth-century Archimedes Palimpsest. Each of the pages in the palimpsest was imaged using a number of different wavelengths of light to bring out different characteristics of the parchment and ink. For example, some inks are visible under one set of wavelengths while other inks are visible under a different set. Because the original writing and the newer writing in the palimpsest used different inks, the images made using different wavelengths allow the scholar to see each ink without having to consciously ignore the other ink. In some cases, the ink has faded so much that it is no longer visible to the naked eye. The Shared Canvas data model brings together all of these different images of a single page by considering each image to be an annotation about the page instead of a surrogate for the page. The Shared Canvas website has a viewer that demonstrates how the imaging wavelengths can be selected for a page.

One important bit, at least for topic maps, is the view of the Shared Canvas data model that:

each image [is considered] to be an annotation about the page instead of a surrogate for the page.

If I tried to say that or even re-say it, it would be much more obscure. 😉

Whether “annotation about” versus “surrogate for” will catch on beyond manuscript studies it’s hard to say.

Not the way it is usually said in topic maps but if other terminology is better understood, why not?

Hadoop for Data Science: A Data Science MD Recap

Filed under: Data Science,Hadoop — Patrick Durusau @ 2:02 pm

Hadoop for Data Science: A Data Science MD Recap by Matt Motyka.

From the post:

On October 9th, Data Science MD welcomed Dr. Donald Miner as its speaker to talk about doing data science work and how the hadoop framework can help. To start the presentation, Don was very clear about one thing: hadoop is bad at a lot of things. It is not meant to be a panacea for every problem a data scientist will face.

With that in mind, Don spoke about the benefits that hadoop offers data scientists. Hadoop is a great tool for data exploration. It can easily handle filtering, sampling and anti-filtering (summarization) tasks. When speaking about these concepts, Don expressed the benefits of each and included some anecdotes that helped to show real world value. He also spoke about data cleanliness in a very Baz Luhrmann Wear Sunscreen sort of way, offering that as his biggest piece of advice.

What?

Hadoop is not a panacea for every data problem????

😉

Don’t panic when you start the video. The ads, etc., take almost seven (7) minutes but Dr. Miner is on the way.


Update: Slides for Hadoop for Data Science. Enjoy!

Statistics + Journalism = Data Journalism ?

Filed under: Journalism,News — Patrick Durusau @ 1:37 pm

Statistics + Journalism = Data Journalism ? by Armin Grossenbacher.

From the post:

Statistics+journalism=data journalism is not the full truth. The equation may make sense because statistics are the most important source for data journalism. But data journalism needs more than statistics and classic journalism: finding the story behind the data coupled with know how in specific tools (analysis, visualising) lead to the storytelling data journalism needs.

To get an idea what data journalism means and what skills are needed a free MOOC with well-known experts will be offered starting next year.

The MOOC is: Doing Journalism with Data: First Steps, Skills and Tools

No dates but said to be coming in early 2014.

Merging data on deadline?

Interesting both for the content and the tools that journalists use to explore data.

November 4, 2013

Akka at Conspire

Filed under: Akka,Scala — Patrick Durusau @ 10:25 pm

Akka at Conspire

From the post:

Ryan Tanner has posted a really good series of blogs on how and why they are using Akka, and especially how to design your application to make good use of clustering and routers. Akka provides solid tools but you still need to think where to point that shiny hammer, and Ryan has a solid story to tell:

  1. How We Built Our Backend on Akka and Scala
  2. Why We Like Actors
  3. Making Your Akka Life Easier
  4. Don’t Fall Into Our Anti-Pattern Traps
  5. The Importance of Pulling

PS: And no, we don’t mind anyone using our code, not even if it was contributed by Derek Wyatt (honorary team member) 🙂

Unlike the Peyton Place IT tragedies in Washington, this is a software tale that ends well.

Enjoy!

The 3rd GraphLab Conference is coming!

Filed under: GraphLab,Graphs — Patrick Durusau @ 10:16 pm

The 3rd GraphLab Conference is coming! by Danny Bickson.

From the post:

We have just started to organize our 3rd user conference on Monday July 21 in SF. This is a very preliminary notice to attract companies and universities who like to be involved. We are planning a mega event this year with around 800-900 data scientists attending, with the topic of graph analytics and large scale machine learning.

The conference is a non-profit event held by GraphLab.org to promote applications of large scale graph analytics in industry. We invite talks from all major state-of-the-art solutions for graph processing, graph databases and large scale data analytics and machine learning. We are looking for sponsors who would like to contribute to the event organization.

The best recommendation I can make for the 3rd GraphLab Conference is to point to the videos from the 2nd GraphLab Conference.

There you will find videos and slides for:

  • Molham Aref, LogicBlox – Datalog as a foundation for probabilistic programming
  • Dr. Avery Ching, Facebook – Graph Processing at Facebook Scale
  • Prof. Carlos Guestrin, GraphLab Inc. & University of Washington: Graphs at Scale with GraphLab
  • Dr. Pankaj Gupta, Twitter – WTF: The Who to Follow Service at Twitter
  • Prof. Joe Hellerstein – Professor, UC Berkeley and Co-Founder/CEO, Trifacta – Productivity for Data Analysts: Visualization, Intelligence and Scale
  • Aapo Kyrola, CMU – What can you do with GraphChi – what’s new?
  • Prof. Michael Mahoney, Stanford – Randomized regression in parallel and distributed environments
  • Prof. Vahab Mirrokni, Google – Large-scale Graph Clustering in MapReduce and Beyond
  • Dr. Derek Murray , Microsoft Research- Incremental, iterative and interactive data analysis with Naiad
  • Prof. Mark Oskin, University of Washington, Grappa graph engine.
  • Dr. Lei Tang – Walmart Labs – Adaptive User Segmentation for Recommendation
  • Prof. S V N Vishwanathan, PurdueNOMAD: Non-locking stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix factorization
  • Dr. Theodore Willke, Intel LabsIntel GraphBuilder 2.0

Spread the word!

Open Data Index

Filed under: Government,Government Data,Open Data — Patrick Durusau @ 7:48 pm

Open Data Index by Armin Grossenbacher.

From the post:

There are lots of indexes.

The most famous one may be the Index Librorum Prohibitorum listing books prohibited by the cathoilic church. It contained eminent scientists and intellectuals (see the list in Wikipedia) and was abolished after more than 400 years in 1966 only.

Open Data Index

One index everybody would like to be registered in and this with a high rank is the Open Data Index.

‘An increasing number of governments have committed to open up data, but how much key information is actually being released? …. Which countries are the most advanced and which are lagging in relation to open data? The Open Data Index has been developed to help answer such questions by collecting and presenting information on the state of open data around the world – to ignite discussions between citizens and governments.’

I haven’t seen the movie review guide that appeared in Our Sunday Visitor in years but when I was in high school it was the best movie guide around. Just pick the ones rated as morally condemned. 😉

There are two criteria I don’t see mentioned for rating open data:

  1. How easy/hard is it to integrate a particular data set with other data from the same source or organization?
  2. Is the data supportive, neutral or negative with regard to established government policies?

Do you know of any open data sets where those questions are used to rate them?

Integrating the Biological Universe

Filed under: Bioinformatics,Integration — Patrick Durusau @ 7:37 pm

Integrating the Biological Universe by Yasset Perez-Riverol & Roberto Vera.

From the post:

Integrating biological data is perhaps one of the most daunting tasks any bioinformatician has to face. From a cursory look, it is easy to see two major obstacles standing in the way: (i) the sheer amount of existing data, and (ii) the staggering variety of resources and data types used by the different groups working in the field (reviewed at [1]). In fact, the topic of data integration has a long-standing history in computational biology and bioinformatics. A comprehensive picture of this problem can be found in recent papers [2], but this short comment will serve to illustrate some of the hurdles of data integration and as a not-so-shameless plug for our contribution towards a solution.

“Reflecting the data-driven nature of modern biology, databases have grown considerably both in size and number during the last decade. The exact number of databases is difficult to ascertain. While not exhaustive, the 2011 Nucleic Acids Research (NAR) online database collection lists 1330 published biodatabases (1), and estimates derived from the ELIXIR database provider survey suggest an approximate annual growth rate of ∼12% (2). Globally, the numbers are likely to be significantly higher than those mentioned in the online collection, not least because many are unpublished, or not published in the NAR database issue.” [1]

Which lead me to:

JBioWH: an open-source Java framework for bioinformatics data integration:

Abstract:

The Java BioWareHouse (JBioWH) project is an open-source platform-independent programming framework that allows a user to build his/her own integrated database from the most popular data sources. JBioWH can be used for intensive querying of multiple data sources and the creation of streamlined task-specific data sets on local PCs. JBioWH is based on a MySQL relational database scheme and includes JAVA API parser functions for retrieving data from 20 public databases (e.g. NCBI, KEGG, etc.). It also includes a client desktop application for (non-programmer) users to query data. In addition, JBioWH can be tailored for use in specific circumstances, including the handling of massive queries for high-throughput analyses or CPU intensive calculations. The framework is provided with complete documentation and application examples and it can be downloaded from the Project Web site at http://code.google.com/p/jbiowh. A MySQL server is available for demonstration purposes at hydrax.icgeb.trieste.it:3307.

Database URL: http://code.google.com/p/jbiowh

Thoughts?

Comments?

Fake femme fatale dupes IT guys…

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:18 pm

Fake femme fatale dupes IT guys at US government agency by Lisa Vaas.

From the post:

It was the birthday of the head of information security at a US government agency that isn’t normally stupid about cyber security.

He didn’t have any accounts on social media websites, but two of his employees were talking about his special day on Facebook.

A penetration testing team sent the infosec head an email with a birthday card, spoofing it to look like the card came from one of his employees.

The recipient opened it and clicked on the link inside.

After the head of information security opened what was, of course, a malicious birthday card link, his computer was compromised.

That gave his attackers the front-door keys, according to Aamir Lakhani, who works for World Wide Technology, the company that performed the penetration test:

It get better, way better.

After you read the rest of Lisa’s post, ask yourself:

Would you take their word for anything?

I first saw this in Nat Torkington’s Four short links: 4 November 2013.

Crowdsourcing Multi-Label Classification for Taxonomy Creation

Filed under: Crowd Sourcing,Decision Making,Machine Learning,Taxonomy — Patrick Durusau @ 5:19 pm

Crowdsourcing Multi-Label Classification for Taxonomy Creation by Jonathan Bragg, Mausam and Daniel S. Weld.

Abstract:

Recent work has introduced CASCADE, an algorithm for creating a globally-consistent taxonomy by crowdsourcing microwork from many individuals, each of whom may see only a tiny fraction of the data (Chilton et al. 2013). While CASCADE needs only unskilled labor and produces taxonomies whose quality approaches that of human experts, it uses significantly more labor than experts. This paper presents DELUGE, an improved workflow that produces taxonomies with comparable quality using significantly less crowd labor. Specifically, our method for crowdsourcing multi-label classification optimizes CASCADE’s most costly step (categorization) using less than 10% of the labor required by the original approach. DELUGE’s savings come from the use of decision theory and machine learning, which allow it to pose microtasks that aim to maximize information gain.

An extension of work reported at Cascade: Crowdsourcing Taxonomy Creation.

While the reduction in required work is interesting, the ability to sustain more complex workflows looks like the more important.

That will require the development of workflows to be optimized, at least for subject identification.

Or should I say validation of subject identification?

What workflow do you use for subject identification and/or validation of subject identification?

November 3, 2013

Penguins in Sweaters…

Filed under: Searching,Semantics,Serendipity — Patrick Durusau @ 8:38 pm

Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content by Ilaria Bordino, Yelena Mejova and Mounia Lalmas.

Abstract:

In many cases, when browsing the Web users are searching for specific information or answers to concrete questions. Sometimes, though, users find unexpected, yet interesting and useful results, and are encouraged to explore further. What makes a result serendipitous? We propose to answer this question by exploring the potential of entities extracted from two sources of user-generated content – Wikipedia, a user-curated online encyclopedia, and Yahoo! Answers, a more unconstrained question/answering forum – in promoting serendipitous search. In this work, the content of each data source is represented as an entity network, which is further enriched with metadata about sentiment, writing quality, and topical category. We devise an algorithm based on lazy random walk with restart to retrieve entity recommendations from the networks. We show that our method provides novel results from both datasets, compared to standard web search engines. However, unlike previous research, we find that choosing highly emotional entities does not increase user interest for many categories of entities, suggesting a more complex relationship between topic matter and the desirable metadata attributes in serendipitous search.

From the introduction:

A system supporting serendipity must provide results that are surprising, semantically cohesive, i.e., relevant to some information need of the user, or just interesting. In this paper, we tackle the question of what makes a result serendipitous.

Serendipity, now that would make a very interesting product demonstration!

In particular if the search results were interesting to the client.

I must admit when I saw the first part of the title I was expecting an article on Linux. 😉

.

Interactive tool for creating directed graphs…

Filed under: D3,Graphs — Patrick Durusau @ 5:54 pm

Interactive tool for creating directed graphs using d3.js.

From the webpage:

directed-graph-creator

Interactive tool for creating directed graphs, created using d3.js.

Operation:

  • drag/scroll to translate/zoom the graph
  • shift-click on graph to create a node
  • shift-click on a node and then drag to another node to connect them with a directed edge
  • shift-click on a node to change its title
  • click on node or edge and press backspace/delete to delete

Run:

  • python -m SimpleHTTPServer 8000
  • navigate to http://127.0.0.1:8000

Github repo is at https://github.com/metacademy/directed-graph-creator

Not every useful tool is heavy weight.

Google’s Python Lessons are Awesome

Filed under: NLTK,Python — Patrick Durusau @ 5:46 pm

Google’s Python Lessons are Awesome by Hartley Brody.

From the post:

Whether you’re just starting to learn Python, or you’ve been working with it for awhile, take note.

The lovably geeky Nick Parlante — a Google employee and CS lecturer at Stanford — has written some awesomely succinct tutorials that not only tell you how you can use Python, but also how you should use Python. This makes them a fantastic resource, regardless of whether you’re just starting, or you’ve been working with Python for awhile.

The course also features six YouTube videos of Nick giving a lesson in front of some new Google employees. These make it feel like he’s actually there teaching you every feature and trick, and I’d highly recommend watching all of them as you go through the lessons. Some of the videos are longish (~50m) so this is something you want to do when you’re sitting down and focused.

And to really get your feet wet, there are also downloadable samples puzzles and challenges that go along with the lessons, so you can actually practice coding along with the Googlers in his class. They’re all pretty basic — most took me less than 5m — but they’re a great chance to practice what you’ve learned. Plus you get the satisfaction that comes with solving puzzles and successfully moving through the class.

I am studying the NLTK to get ready for a text analysis project. At least to be able to read along. This looks like a great resource to know about.

I also like the idea of samples, puzzles and challenges.

Not that samples, puzzles and challenges would put topic maps over the top but it would make instruction/self-learning more enjoyable.

Download Cooper-Hewitt Collections Data

Filed under: Data,Museums — Patrick Durusau @ 5:32 pm

Download Cooper-Hewitt Collections Data

From the post:

Cooper-Hewitt is committed to making its collection data available for public access. To date, we have made public approximately 60% of the documented collection available online. Whilst we have a web interface for searching the collection, we are now also making the dataset available for free public download. By being able to see “everything” at once, new connections and understandings may emerge.

What is it?

The download contains only text metadata, or “tombstone” information—a brief object description that includes temporal, geographic, and provenance information—for over 120,000 objects.

Is it complete?

No. The data is only tombstone information. Tombstone information is the raw data that is created by museum staff at the time of acquisition for recording the basic ‘facts’ about an object. As such, it is unedited. Historically, museum staff have used this data only for identifying the object, tracking its whereabouts in storage or exhibition, and for internal report and label creation. Like most museums, Cooper-Hewitt had never predicted that the public might use technologies, such as the web, to explore museum collections in the way that they do now. As such, this data has not been created with a “public audience” in mind. Not every field is complete for each record, nor is there any consistency in the way in which data has been entered over the many years of its accumulation. Considerable additional information is available in research files that have not yet been digitized and, as the research work of the museum is ongoing, the records will continue to be updated and change over time.

Which all sounds great, if you know what the Cooper-Hewitt collection houses.

From the about page:

Smithsonian’s Cooper-Hewitt, National Design Museum is the only museum in the nation devoted exclusively to historic and contemporary design. The Museum presents compelling perspectives on the impact of design on daily life through active educational and curatorial programming.

It is the mission of Cooper-Hewitt’s staff and Board of Trustees to advance the public understanding of design across the thirty centuries of human creativity represented by the Museum’s collection. The Museum was founded in 1897 by Amy, Eleanor, and Sarah Hewitt—granddaughters of industrialist Peter Cooper—as part of The Cooper Union for the Advancement of Science and Art. A branch of the Smithsonian since 1967, Cooper-Hewitt is housed in the landmark Andrew Carnegie Mansion on Fifth Avenue in New York City.

The campus also includes two historic townhouses renovated with state-of-the-art conservation technology and a unique terrace and garden. Cooper-Hewitt’s collections include more than 217,000 design objects and a world-class design library. Its exhibitions, in-depth educational programs, and on-site, degree-granting master’s program explore the process of design, both historic and contemporary. As part of its mission, Cooper-Hewitt annually sponsors the National Design Awards, a prestigious program which honors innovation and excellence in American design. Together, these resources and programs reinforce Cooper-Hewitt’s position as the preeminent museum and educational authority for the study of design in the United States.

Even without images, I can imagine enhancing library catalog holdings with annotations about particular artifacts being located at the Cooper-Hewitt.

Learn X in Y minutes Where X=clojure

Filed under: Clojure,Programming — Patrick Durusau @ 5:11 pm

Learn X in Y minutes Where X=clojure

From the post:

Get the code: learnclojure.clj

Clojure is a Lisp family language developed for the Java Virtual Machine. It has a much stronger emphasis on pure functional programming than Common Lisp, but includes several STM utilities to handle state as it comes up.

This combination allows it to handle concurrent processing very simply, and often automatically.

The post concedes this is just enough to get you started but also has good references to more materials.

Useful for the holidays. Like playing chess without a board, you can imagine coding without a computer. Could improve your memory. 😉

Shogun… 3.0.0

Filed under: Machine Learning — Patrick Durusau @ 5:04 pm

Shogun – A Large Scale Machine Learning Toolbox (3.0.0 release)

Highlights of the Shogun 3.0.0 release:

This release features 8 successful Google Summer of Code projects and it is the result of an incredible effort by our students. All projects come with very cool ipython-notebooks that contain background, code examples and visualizations. These can be found on our webpage!

    The projects are:

  • Gaussian Processes for binary classification [Roman Votjakov]
  • Sampling log-determinants for large sparse matrices [Soumyajit De]
  • Metric Learning via LMNN [Fernando Iglesias]
  • Independent Component Analysis (ICA) [Kevin Hughes]
  • Hashing Feature Framework [Evangelos Anagnostopoulos]
  • Structured Output Learning [Hu Shell]
  • A web-demo framework [Liu Zhengyang] Other important changes are the change of our build-system to cmake and the addition of clone/equals methods to our base-class. In addition, you get the usual ton of bugfixes, new unit-tests, and new mini-features.
  • Features:
    • In addition, the following features have been added:
    • Added method to importance sample the (true) marginal likelihood of a Gaussian Process using a posterior approximation.
    • Added a new class for classical probability distribution that can be sampled and whose log-pdf can be evaluated. Added the multivariate Gaussian with various numerical flavours.
    • Cross-validation framework works now with Gaussian Processes
    • Added nu-SVR for LibSVR class
    • Modelselection is now supported for parameters of sub-kernels of combined kernels in the MKL context. Thanks to Evangelos Anagnostopoulos
    • Probability output for multi-class SVMs is now supported using various heuristics. Thanks to Shell Xu Hu.
    • Added an “equals” method to all Shogun objects that recursively compares all registered parameters with those of another instance — up to a specified accuracy.
    • Added a “clone” method to all Shogun objects that creates a deep copy
    • Multiclass LDA. Thanks to Kevin Hughes.
    • Added a new datatype, complex128_t, for complex numbers. Math functions, support for SGVector/Matrix, SGSparseVector/Matrix, and serialization with Ascii and Xml files added. [Soumyajit De].
    • Added mini-framework for numerical integration in one variable. Implemented Gauss-Kronrod and Gauss-Hermite quadrature formulas.
    • Changed from configure script to CMake by Viktor Gal.
    • Add C++0x and C++11 cmake detection scripts
    • ND-Array typmap support for python and octave modular.

Toolbox machine learning lacks the bells and whistles of custom code but it is a great way to experiment with data and machine learning techniques.

Experimenting with data and techniques will help immunize you from the common frauds and deceptions using machine learning techniques.

David Huff wrote How to Lie with Statistics in the 1950’s.

Is there anything equivalent to that for machine learning? Given the technical nature of many of the techniques a guide to what questions to ask, etc., could be a real boon. To one side of machine learning based discussions at least.

« Newer PostsOlder Posts »

Powered by WordPress