Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 18, 2012

Waste Book 2012 [ > 1,000 Footnote Islands ]

Filed under: Government,Government Data,Marketing,Topic Maps — Patrick Durusau @ 10:43 am

Waste Book 2012 by Sen. Tom Coburn, M.D. (PDF file)

Senator Coburn, is a government pork/waste gadfly in the United States Senate.

Often humorous descriptions call attention to many programs or policies that appear to be pure waste.

I say “appear to be pure waste” because Senator Coburn’s reports are islands of commentary, in a sea crowded with such islands.

There is no opportunity to “connect the dots” with additional information, such as rebuttals, changes in agency policy or practices, or even the personnel responsible for the alleged waste.

Imagine a football (U.S. or European) stadium where every fan has a bull horn and is shouting their description of each play. That is the current status of reporting about issues in the U.S. federal government.

Senator Coburn’s latest report may be described in several thousand news publications, but other than its being issued, that group of shouts should be reduced to 1. The rest are just duplicative noise.

The Waste Book tries to do better than conservative talk radio or its imagined “liberal” press foe. The Waste Book cites sources for the claims that it makes. Over 1,000 footnote islands.

“Islands” because like the Waste Book, it isn’t easy to connect them with other information. Or to debate those connections.

Every increase in connection difficulty increases the likelihood of non-verficiation/validation. That is, you will just take their word for it.

The people who possess information realize that.

Why do you think government reports appear as nearly useless PDF files? Or why media stories, even online, are leaden lumps of information, that quickly sink in the sea of media shouting.

Identifiable someones, want you to “take their word” for any number of things.

They are counting your job, family and life in general leaving too little time for any other answer.

How would you like to disappoint them?

(More to follow on capturing information traffic between footnote “islands” and how to leverage it for yourself and others.)

Do Presidential Debates Approach Semantic Zero?

ReConstitution recreates debates through transcripts and language processing by Nathan Yau.

From Nathan’s post:

Part data visualization, part experimental typography, ReConstitution 2012 is a live web app linked to the US Presidential Debates. During and after the three debates, language used by the candidates generates a live graphical map of the events. Algorithms track the psychological states of Romney and Obama and compare them to past candidates. The app allows the user to get beyond the punditry and discover the hidden meaning in the words chosen by the candidates.

The visualization does not answer the thorny experimental question: Do presidential debates approach semantic zero?

Well, maybe the technique will improve by the next presidential election.

In the meantime, it was an impressive display of read time processing and analysis of text.

Imagine such an interface that was streaming text for you to choose subjects, associations between subjects, and the like.

Not trying to perfectly code any particular stretch of text but interacting with the flow of the text.

There are goals other than approaching semantic zero.

Designing for Consumer Search Behaviour (slideshow)

Filed under: Interface Research/Design,Search Behavior,Users — Patrick Durusau @ 10:40 am

Designing for Consumer Search Behaviour (slideshow) by Tony Russell-Rose.

From the post:

Here are the slides from the talk I gave recently at HCIR 2012 on Designing for Consumer Search Behaviour. This presentation is the counterpart to the previous one: while A Model of Consumer Search Behaviour introduced the model and described the analytic work that led to it, this talk looks at the practical design implications. In particular, it addresses the observation that although the information retrieval community is blessed with an abundance of analytic models, only a tiny fraction of these make any impression at all on mainstream UX design practice.

Why is this? In part, this may be simply a reflection of imperfect channels of communication between the respective communities. However, I suspect it may also be a by-product of the way researchers are incentivized: with career progression based almost exclusively on citations in peer-reviewed academic journals, it is hard to see what motivation may be left to encourage adoption by other communities such as design practitioners. Yet from a wider perspective, it is precisely this cross-fertilisation that can make the difference between an idea gathering the dust of citations within a closed community and actually having an impact on the mainstream search experiences that we as consumers all encounter.

I have encounter the “cross-community” question before. A major academic organization where I was employed and a non-profit in the field shared members. For more than a century.

They had no projects in common all that time. Knew about each other, but kept waiting for the “other” one to call first. Eventually did have a project or two together but members of communities tend to stay in those communities.

It is a question of a member’s “comfort” zone. How will members of other community react? Will they be accepting? Judgemental? Once you know, hard to go back to ignorance. Best just to stay at home. Imagine what it would be like “over there.” Less risky.

You might find members of other communities have the same hopes, fears, dreams that you do. Then what? Had to diss others when it means dissing yourself.

A cross-over UX design practitioner/researcher poster day, with lots of finger food, tables for ad hoc conversations/demos, would be a nice way to break the ice between the two communities?

Axemblr’s Java Client for the Cloudera Manager API

Filed under: Cloud Computing,Cloudera,Hadoop — Patrick Durusau @ 10:38 am

Axemblr’s Java Client for the Cloudera Manager API by Justin Kestelyn.

From the post:

Axemblr, purveyors of a cloud-agnostic MapReduce Web Service, have recently announced the availability of an Apache-licensed Java Client for the Cloudera Manager API.

The task at hand, according to Axemblr, is to ”deploy Hadoop on Cloud with as little user interaction as possible. We have the code to provision the hosts but we still need to install and configure Hadoop on all nodes and make it so the user has a nice experience doing it.” And voila, the answer is Cloudera Manager, with the process made easy via the REST API introduced in Release 4.0.

Thus, says Axemblr: “In the pursuit of our greatest desire (second only to coffee early in the morning), we ended up writing a Java client for Cloudera Manager’s API. Thus we achieved to automate a CDH3 Hadoop installation on Amazon EC2 and Rackspace Cloud. We also decided to open source the client so other people can play along.”

Another goodie to ease your way to Hadoop deployment on your favorite cloud.

Do you remember the lights at radio stations that would show “On Air?”

I need an “On Cloud” that lights up. More realistic than the data appliance.

Bigger Than A Bread Box

Filed under: Analytics,BigData,Hortonworks — Patrick Durusau @ 10:38 am

Hortonworks & Teradata: More Than Just an Elephant in a Box by Jim Walker.

I’m not going to wake up Christmas morning to find:

Teredata

But in case you are in the market for a big analytics hardware/software appliance, Jim writes:

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye… it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with. It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this

This is an engineered solution. Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL. This is a great approach but it lacks integration of metadata and metadata exchange. With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product. SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata. Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze. All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration

In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics. With this package, these valuable functions can now be applied to big data in Hadoop. This shortens the time it takes for an analyst to explore and discover value in big data. And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Just as well.

Red doesn’t really go with my office decor. Runs more towards the hulking black server tower, except for the artificial pink tree in the corner. 😉

Cross-Community? Try Japan, 1980’s, For Success, Today!

Filed under: Interface Research/Design,User Targeting,Users — Patrick Durusau @ 10:37 am

Leveraging the Kano Model for Optimal Results by Jan Moorman.

Jan’s post outlines what you need to know to understand and use a UX model known at the “Kano Model.”

In short, the Kano Model is a way to evaluate how customers (the folks who buy products, not your engineers) feel about product features.

You are ahead of me if you guessed that positive reactions to product features are the goal.

Jan and company returned to the original research. An important point because applying research mechanically will get you mechanical results.

From the post:

You are looking at a list of 18 proposed features for your product. Flat out, 18 are too many to include in the initial release given your deadlines, and you want identify the optimal subset of these features.

You suspect an executive’s teenager suggested a few. Others you recognize from competitor products. Your gut instinct tells you that none of the 18 features are game changers and you’re getting pushback on investing in upfront generative research.

It’s a problem. What do you do?

You might try what many agile teams and UX professionals are doing: applying a method that first emerged in Japan during the 1980’s called the ‘Kano Model’ used to measures customer emotional reaction to individual features. At projekt202, we’ve had great success in doing just that. Our success emerged from revisiting Kano’s original research and through trial and error. What we discovered is that it really matters how you design and perform a Kano study. It matters how you analyze and visualize the results.

We have also seen how the Kano Model is a powerful tool for communicating the ROI of upfront generative research, and how results from Kano studies inform product roadmap decisions. Overall, Kano studies are a very useful to have in our research toolkit.

Definitely an approach to incorporate in UX evaluation.

Open-Sankoré

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 10:37 am

Open-Sankoré

From the website:

Open-Sankoré is a multiplatform, open-source program that is compatible with every type of interactive hardware. It is also translated into many different languages. Its range of tools is adapted to all users: from beginners to experts.

I first saw this at H Open, which noted:

Open Sankoré is open source whiteboard software that offers a drop-in replacement for proprietary alternatives and adds a couple of interesting features such as support for the W3C Widgets specification. This means that users can, for example, embed interactive content which has been developed for Apache Wookie directly on the whiteboard.

Looks useful for instruction and perhaps WG3 meetings.

Any experience with it?

Anyone using one of the Watcom Bamboo tablets with it?

10gen: Growing the MongoDB world

Filed under: MongoDB — Patrick Durusau @ 10:36 am

10gen: Growing the MongoDB world by Dj Walker-Morgan.

From the post:

10gen, the company set up by the creators of the open source NoSQL database MongoDB, has been on a roll recently, creating business partnerships with numerous companies, making it a hot commercial proposition without creating any apparent friction with its open source community. So what has brought MongoDB to the fore?

One factor has been how easy it is to get up and running with the database, a feature that the company wants to actively maintain. 10gen president Max Schireson explained: “I think that it’s honestly a combination of the functionality of MongoDB itself, but also the effort that we’ve invested in packaging for the open source community. I see some open source companies taking the approach of ‘oh yeah the code’s open source but you’ll need a PhD to actually get a working build of it unless you are a subscriber’. While that might help monetisation, that’s not a way to build a big community”.

Schireson says the company isn’t going stand still though: although it’s easy to get a single node up and running, over time they want to make it easier to get more complex, sharded, implementations configured and deployed. “As people use more and more functionality, that of necessity brings in more complexity, we’re looking for ways to make that easier,” he says, pointing to the cluster manager being developed as a native part of MongoDB, which should make it easier to manage and upgrade clusters.

Always appreciate a plug for good documentation.

May not work for you but it certainly worked here.

Apache Hadoop YARN Meetup at Hortonworks

Filed under: Hadoop YARN,Hortonworks — Patrick Durusau @ 10:36 am

Apache Hadoop YARN Meetup at Hortonworks – Recap by Vinod Kumar Vavilapalli.

Just in case you missed the Apache Hadoop YARN meetup, summaries and slides are available for:

  • Chris Riccomini’s on “Building Applications on YARN”
  • YARN API Discussion
  • Efforts Underway

Enjoy!

OneZoom Tree of Life Explorer

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 4:46 am

OneZoom Tree of Life Explorer

From the website:

OneZoom is committed to heightening awareness about the diversity of life on earth, its evolutionary history and the threats of extinction. This website allows you to explore the tree of life in a completely new way: it’s like a map, everything is on one page, all you have to do is zoom in and out. OneZoom also provides free, open source, data visulation tools for science and education, currently focusing on the tree of life.

This is wicked cool! Be sure to watch the video introduction. You will be able to navigate without it but there are hidden bells and whistles.

Should provoke all sort of ideas about visualizing and exploring data.

See also: OneZoom: A Fractal Explorer for the Tree of Life (Rosindell J, Harmon LJ (2012) OneZoom: A Fractal Explorer for the Tree of Life. PLoS Biol 10(10): e1001406. doi:10.1371/journal.pbio.1001406) for more details on the project.

I first saw this at Science Daily: Tree of Life Branches out Online.

October 17, 2012

Calligra 2.6 Alpha Released [Entity/Association Recognition Writ Small?]

Filed under: Authoring Topic Maps,Word Processing,Writing — Patrick Durusau @ 9:21 am

Calligra 2.6 Alpha Released

The final version of Calligra 2.6 is due out in December of 2012. Too late to think about topic map features for that release.

But what about the release after that?

In 2.6 we will see:

Calligra Author is a new member of the growing Calligra application family. The application was announced just after the release of Calligra 2.5 with the following description:

The application will support a writer in the process of creating an eBook from concept to publication. We have two user categories in particular in mind:

  • Novelists who produce long texts with complicated plots involving many characters and scenes but with limited formatting.
  • Textbook authors who want to take advantage of the added possibilities in eBooks compared to paper-based textbooks.

Novelists and text book authors are prime candidates for topic maps, especially if integrated into a word processor.

Novelists track many relationships between people, places, things. What if entities were recognized and associations suggested, much like spell checking?

Not solving entity/association recognition writ large, but entity/association recognition writ small. Entity/association recognition for a single author.

Text book authors as well because they creating instructional maps of a field of study. Instructional maps that have to be updated with new information and references.

Separate indexes could be merged, to create meaningful indexes to entire series of works.

PS: In the interest of full disclosure, I am the editor of ODF, the default format for Calligra.

Google open sources Supersonic query engine

Filed under: Column-Oriented,Query Engine — Patrick Durusau @ 9:20 am

Google open sources Supersonic query engine

From the post:

Google has released Supersonic, a query engine designed to work efficiently with column-oriented databases. The announcement suggests that Supersonic would be “extremely useful for creating a column oriented database back-end”, and that it aims to offer “second-to-none execution times”. As part of achieving that design goal, the C++ library uses many low-level, cache-aware optimisations, SIMD instructions and vectorised execution so that it can make the best use of modern pipelined CPUs, while still working as a single process.

Supersonic can perform “Operations” on columnar data such as Compute, Filter, Sort, HashJoin, and more; on views these operations can be chained together to produce a final result. Data for these operations is currently held in memory; there is no current built-in data storage format, but the developers say that there is “a strong intention of developing one”. Other work in progress includes the provision of wide test coverage for the library. A tarball archive of the code is available to download, while the source can be git cloned from the Google Code project pages.

Do you ever wonder what “secret” software must be like to have packages like this in open source?

R at 12,000 Cores

Filed under: BigData,MPI,Parallel Programming,R — Patrick Durusau @ 9:19 am

R at 12,000 Cores

From the post:

I am very happy to introduce a new set of packages that has just hit the CRAN. We are calling it the Programming with Big Data in R Project, or pbdR for short (or as I like to jokingly refer to it, ‘pretty bad for dyslexics’). You can find out more about the pbdR project at http://r-pbd.org/

The packages are a natural programming framework that are, from the user’s point of view, a very simple extension of R’s natural syntax, but running in parallel over MPI and handling big data sets with ease. Much of the parallelism we offer is implicit, meaning that you can use code you are already using while achieving massive performance gains.

The packages are free as in beer, and free as in speech. You could call them “free and open source”, or libre software. The source code is free for everyone to look at, extend, re-use, whatever, forever.

At present, the project consists of 4 packages: pbdMPI, pbdSLAP, pbdBASE, and pbdDMAT. The pbdMPI package offers simplified hooks into MPI, making explicit parallel programming over much simpler, and sometimes much faster than with Rmpi. Next up the chain is pbdSLAP, which is a set of libraries pre-bundled for the R user, to greatly simplify complicated installations. The last two packages, pbdBASE and pbdDMAT, offer high-level R syntax for computing with distributed matrix objects at low-level programming speed. The only system requirements are that you have R and an MPI installation.

We have attempted to extensively document the project in a collection of package vignettes; but really, if you are already using R, then much of the work is already familiar to you. Want to take the svd of a matrix? Just use svd(x) or La.svd(x), only “x” is now a distributed matrix object.

One MPI source: OpenMPI. Interested to hear of experiences with other MPI installations.

If you can’t run MPI or don’t want to, be sure to also check out the RHadoop project.

I first saw this at R-Bloggers.

Improving the integration between R and Hadoop: rmr 2.0 released

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 9:14 am

Improving the integration between R and Hadoop: rmr 2.0 released

David Smith reports:

The RHadoop project, the open-source project supported by Revolution Analytics to integrate R and Hadoop, continues to evolve. Now available is version 2 of the rmr package, which makes it possible for R programmers to write map-reduce tasks in the R language, and have them run within the Hadoop cluster. This update is the "simplest and fastest rmr yet", according to lead developer Antonio Piccolboni. While previous releases added performance-improving vectorization capabilities to the interface, this release simplifies the API while still improving performance (for example, by using native serialization where appropriate). This release also adds some conveniance functions, for example for taking random samples from Big Data stored in Hadoop. You can find further details of the changes here, and download RHadoop here

RHadoop Project: Changelog

As you know, I’m not one to complain, ;-), but I read from above:

…this release simplifies the API while still improving performance [a good thing]

as contradicting the release notes that read in part:

…At the same time, we increased the complexity of the API. With this version we tried to define a synthesis between all the modes (record-at-a-time, vectorized and structured) present in 1.3, with the following goals:

  • bring the footprint of the API back to 1.2 levels.
  • make sure that no matter what the corner of the API one is exercising, he or she can rely on simple properties and invariants; writing an identity mapreduce should be trivial.
  • encourage writing the most efficient and idiomatic R code from the start, as opposed to writing against a simple API first and then developing a vectorized version for speed.

After reading the change notes, I’m going with the “simplifies the API” riff.

Take a close look and see what you think.

The Titan Informatics Toolkit

Filed under: BigData,Distributed RAM,Supercomputing — Patrick Durusau @ 9:14 am

The Titan Informatics Toolkit

From the webpage:

A collaborative effort between Sandia National Laboratories and Kitware Inc., the Titan™ Informatics Toolkit is a collection of scalable algorithms for data ingestion and analysis that share a common set of data structures and a flexible, component-based pipeline architecture. The algorithms in Titan span a broad range of structured and unstructured analysis techniques, and are particularly suited to parallel computation on distributed memory supercomputers.

Titan components may be used by application developers using their native C++ API on all popular platforms, or using a broad set of language bindings that include Python, Java, TCL, and more. Developers will combine Titan components with their own application-specific business logic and user interface code to address problems in a specific domain. Titan is used in applications varying from command-line utilities and straightforward graphical user interface tools to sophisticated client-server applications and web services, on platforms ranging from individual workstations to some of the most powerful supercomputers in the world.

I stumbled across this while searching for the Titan (as in graph database) project.

The Parallel Latent Semantic Analysis component is available now. I did not see release dates on other modules, such as Advanced Graph Algorithms.

Source (C++) for the Titan Informatics Toolkit is available.

Count unique items in a text file using Erlang

Filed under: Erlang,Sets — Patrick Durusau @ 9:08 am

Count unique items in a text file using Erlang by Paolo D’Incau.

From the post:

Many times during our programming daily routine, we have to deal with log files. Most of the log files I have seen so far are just text files where the useful information are stored line by line.

Let’s say you are implementing a super cool game backend in Erlang, probably you would end up with a bunch of servers implementing several actions (e.g. authentication, chat, store character progress etc etc); well I am pretty sure you would not store the characters info in a text file, but maybe (and I said maybe) you could find useful to store in a text file some of the information that comes from the authentication server.

Unique in the sense you are thinking.

But that happens, even in topic maps.

Parsing with Pictures

Filed under: Compilers,Graphs,Parsers,Parsing — Patrick Durusau @ 9:07 am

Parsing with Pictures by Keshav Pingali and Gianfranco Bilardi. (PDF file)

From an email that Keshav sent to the compilers@iecc.com email list:

Gianfranco Bilardi and I have developed a new approach to parsing context-free languages that we call “Parsing with pictures”. It provides an alternative (and, we believe, easier to understand) approach to context-free language parsing than the standard presentations using derivations or pushdown automata. It also unifies Earley, SLL, LL, SLR, and LR parsers among others.

Parsing problems are formulated as path problems in a graph called the grammar flow graph (GFG) that is easily constructed from a given grammar. Intuitively, the GFG is to context-free grammars what NFAs are to regular languages. Among other things, the paper has :

(i) an elementary derivation of Earley’s algorithm for parsing general context-free grammars, showing that it is an easy generalization of the well-known reachability-based NFA simulation algorithm,

(ii) a presentation of look-ahead that is independent of particular parsing strategies, and is based on a simple inter-procedural dataflow analysis,

(iii) GFG structural characterizations of LL and LR grammars that are simpler to understand than the standard definitions, and bring out a symmetry between these grammar classes,

(iv) derivations of recursive-descent and shift-reduce parsers for LL and LR grammars by optimizing the Earley parser to exploit this structure, and

(v) a connection between GFGs and NFAs for regular grammars based on the continuation-passing style (CPS) optimization.

Or if you prefer the more formal abstract:

The development of elegant and practical algorithms for parsing context-free languages is one of the major accomplishments of 20th century Computer Science. These algorithms are presented in the literature using string rewriting systems or abstract machines like pushdown automata, but the resulting descriptions are unsatisfactory for several reasons. First, even a basic understanding of parsing algorithms for some grammar classes such as LR(k) grammars requires mastering a formidable number of difficult concepts and terminology. Second, parsing algorithms for different grammar classes are often presented using entirely different formalisms, so the relationships between these grammar classes are obscured. Finally, these algorithms seem unrelated to algorithms for regular language recognition even though regular languages are a subset of context-free languages.

In this paper, we show that these problems are avoided if parsing is reformulated as the problem of finding certain kinds of paths in a graph called the Grammar Flow Graph (GFG) that is easily constructed from a context-free grammar. Intuitively, GFG’s permit parsing problems for context-free grammars to be formulated as path problems in graphs in the same way that non-deterministic finite-state automata do for regular grammars. We show that the GFG enables a unified treatment of Earley’s parser for general context-free grammars, recursive-descent parsers for LL(k) and SLL(k) grammars, and shift-reduce parsers for LR(k) and SLR(k) grammars. Computation of look-ahead sets becomes a simple interprocedural dataflow analysis. These results suggest that the GFG can be a new foundation for the study of context-free languages.

Odd as it may sound, some people want to be understood.

If you think being understood isn’t all that weird, do a slow read on this paper and provide feedback to the authors.

Apache Hadoop 2.0.2-alpha Released!

Filed under: Hadoop YARN — Patrick Durusau @ 9:06 am

Apache Hadoop 2.0.2-alpha Released! by Arun Murthy.

From the post:

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.0.2-alpha.

This is the second (alpha) release of the next generation release of Apache Hadoop 2.x and comes with significant enhancements to both the major components of Hadoop:

  • HDFS HA has undergone significant enhancements since the previous release for NameNode High Availability
  • YARN has undergone significant testing and stabilization and validation as is been heavily battle-tested since the previous release.

These are exciting times indeed for the Apache Hadoop community – personally, this is very reminiscent of the period in 2009 when we finally saw the light at the end of the tunnel during the stabilization of Apache Hadoop 1.x (then called Apache Hadoop 0.20.x). A déjà vu, if you will – albeit of the pleasant kind! Yes, we have a few miles to clock, but it feels like the hardest part is already behind us. At the time of release, YARN has already been deployed on super-sized clusters with 2,000 nodes and 3,600 nodes (totaling to nearly 6,000 nodes) at Yahoo alone*.

Exciting times indeed!

Not unlike a star ship fast enough for time dilation to kick in.

Great!

But which way do you go first?

Hadoop 2.0 offers more efficient crunching of data. But efficient crunching of data is a means, not a end.

Which way will you go with Hadoop 2.0?

What questions will you ask that you can’t ask now?

How will you evaluate the answers?

Neo4J, RDF and Kevin Bacon

Filed under: Graphs,Neo4j,RDF — Patrick Durusau @ 9:01 am

Neo4J, RDF and Kevin Bacon by Tom Morris.

From the post:

Today, I managed to wangle my way into Off the Rails, a train hack day. I was helping friends with data mangling: OpenStreetMap, Dbpedia, RDF and Neo4J.

It’s funny actually. Way back when, if I said to people that there is some data that fits quite well into graph models, they’d look at me like some kind of dangerous looney. Graphs? Why? Doesn’t MySQL and JSON do everything I need?

Actually, no.

If you are trying to model a system where there are trains that travel on tracks between stations, that maps quite nicely to graphs, nodes and edges. If only there were databases and data models for that stuff, right?

Oh, yeah, there is. There’s Neo4J and there’s our old friend RDF, and the various triple store databases. I finally had a chance to play with Neo4J today. It’s pretty cool. And it shows us one of the primary issues with the RDF toolchain: it usually fails to implement the one thing any reasonable person wants from a graph store.

Kevin Bacon. Finding shortest path from one node to another with some kind of predicate filter. If you ask people what the one thing they want to do with a graph is, they’ll say: shortest path.

This is what Neo4J makes easy. I can download Neo4J in a Java (or JRuby, Scala, whatever) project, instantiate a database in the form of an embedded database, kinda like SQLite in Rails, parse a load of nodes and relations into it, then in two damn lines of Java find the shortest path between nodes.

The proper starting point for any project is: What questions do you want to ask?

Discovering the answer to that question will point you toward an appropriate technology.

See also: Shortest path problem (and improve the answer while you are there).

FEMA Acronyms, Abbreviations and Terms

Filed under: Government,Government Data,Vocabularies — Patrick Durusau @ 9:00 am

FEMA Acronyms, Abbreviations and Terms (PDF)

From the webpage:

The FAAT List is a handy reference for the myriad of acronyms and abbreviations used within the federal government, emergency management and first response communities. This year’s new edition, which continues to reflect the evolving U.S. Department of Homeland Security, contains an approximately 50 percent increase in the number of entries and definitions bringing the total to over 6,200 acronyms, abbreviations and terms. Some items listed are obsolete, but they are included because they may still appear in publications and correspondence. Obsolete items can be found at the end of this document.

This may be handy for reading FEMA or related government documents.

Hasn’t been updated since 2009.

If you know of a more recent resource, please give a shout.

October 16, 2012

The “O” Word (Ontology) Isn’t Enough

Filed under: Bioinformatics,Biomedical,Gene Ontology,Genome,Medical Informatics,Ontology — Patrick Durusau @ 10:36 am

The Units Ontology makes reference to the Gene Ontology as an example of a successful web ontology effort.

As it should. The Gene Ontology (GO) is the only successful web ontology effort. A universe with one (1) inhabitant.

The GO has a number of differences from wannabe successful ontology candidates. (see the article below)

The first difference echoes loudly across the semantic engineering universe:

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers. Terms were created by those who had expertise in the domain, thus avoiding the huge effort that would have been required for a computer scientist to learn and organize large amounts of biological functional information. This also led to general acceptance of the terminology and its organization within the community. This is not to say that there have been no disagreements among biologists over the conceptualization, and there is of course a protocol for arriving at a consensus when there is such a disagreement. However, a model of a domain is more likely to conform to the shared view of a community if the modelers are within or at least consult to a large degree with members of that community.

Did you catch that first line?

One of the factors that account for GO’s success is that it originated from within the biological community rather than being created and subsequently imposed by external knowledge engineers.

Saying the “O” word, ontology, that will benefit everyone if they will just listen to you, isn’t enough.

There are other factors to consider:

A Short Study on the Success of the Gene Ontology by Michael Bada, Robert Stevens, Carole Goble, Yolanda Gil, Michael Ashburner, Judith A. Blake, J. Michael Cherry, Midori Harris, Suzanna Lewis.

Abstract:

While most ontologies have been used only by the groups who created them and for their initially defined purposes, the Gene Ontology (GO), an evolving structured controlled vocabulary of nearly 16,000 terms in the domain of biological functionality, has been widely used for annotation of biological-database entries and in biomedical research. As a set of learned lessons offered to other ontology developers, we list and briefly discuss the characteristics of GO that we believe are most responsible for its success: community involvement; clear goals; limited scope; simple, intuitive structure; continuous evolution; active curation; and early use.

Objectivity

Filed under: Graphs,InfiniteGraph — Patrick Durusau @ 10:14 am

Objectivity by Danny Bickson.

Danny has located one of the funniest “connect the dot” videos and a more serious one on InfiniteGraph, a distributed graph database.

Both videos are from Objectivity, maker of InfiniteGraph.

Danny mentions the full version of InfiniteGraph is “…rather expensive.”

Danny must not get out much.

A winning sports team (baseball, football, soccer), a successful business or effective government agency are expensive.

If you want to brag to the server at McDonald’s how cheap your IT costs are, that’s your choice as well.

Sometimes cheapness is its own reward.

Mortar Takes Aim at Hadoop Usability [girls keep out]

Filed under: Hadoop,Usability — Patrick Durusau @ 9:48 am

Maybe I am being overly sensitive but I don’t see a problem with:

…a phalanx of admins to oversee … a [Hadoop] operation

I mean, that why they have software/hardware is to provide places for admins to gather and play. Right? 😉

Or NOT!

Maybe Ian Armas Foster in Mortar Takes Aim at Hadoop Usability has some good points:

“Have a pile of under-utilized data? Want to use Hadoop but can’t spend weeks or months getting started?” According to fresh startup Mortar, these are questions that should appeal to potential Hadoop users, who are looking to wrap their arms around the elephant without hiring a phalanx of admins to oversee the operation.

Mortar claims to make Hadoop more accessible to the people most responsible for garnering insight from big data: data scientists and engineers. The young startup took flight when a couple of architects at Wireless Generation decided that big data tools and approaches were complex enough to warrant a new breed of offering–one that could take the hardware element out of Hadoop use.

(video omitted)

Hadoop is a terrific open-source data tool that can process and perform analytics (sometimes predictive) on big data and large datasets. An unfortunate property of Hadoop is its difficult utility. Many companies looking to get into big data simply invest in Hadoop clusters without a vision as to how to use the cluster or without the resources, human on monetary, to execute said vision.

“Hadoop is an amazing technology but for most companies it was out of reach,” said Young in a presentation at the New York City Data Business Meetup in September.

To combat this, Mortar is building a web based product-as-a-service in which someone need simply need log on to the Mortar website and then they can start writing the code allowing their pile of data to do what it wants. “We wanted to make operation very easy,” said Young “because it’s very hard to hire people with Hadoop expertise and because Hadoop is sort of famously hard to operate.”

A bit further in the article, it is claimed that a “data scientist” can be up and using Hadoop in one (1) hour.

Can you name another technology that is “…famously hard to operate?”

Do data integration, semantics, semantic web, RDF, master data management, topic maps come to mind?

If they do, what do you think can be done to make them easier to operate?

Having a hard to operate approach, technology or tool may be thrilling, in a “girls keep out” clubhouse sort of way, but it isn’t the road to success, commercial or otherwise.

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume

Filed under: Cloudera,Flume,Hadoop,Tweets — Patrick Durusau @ 9:15 am

Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume by Jon Natkins.

From the post:

This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.

Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.

Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.

A very good introduction to the use of Flume!

Does it seem to you that the number of examples using Twitter, not just for “big data” but in general seems to be on the rise?

Just a personal observation and subject to all the flaws, “all the buses were going the other way,” of such.

Judging from the state of my inbox, some people are still writing more than 140 characters at a time.

Will it make a difference in our tools/thinking if we focus on shorter strings as opposed to longer ones?

Core & Peel Algorithm

Filed under: Graphs,Networks,Subgraphs — Patrick Durusau @ 5:04 am

Detecting dense communities in large social and information networks with the Core & Peel algorithm by Marco Pellegrini, Filippo Geraci, Miriam Baglioni.

Abstract:

Detecting and characterizing dense subgraphs (tight communities) in social and information networks is an important exploratory tool in social network analysis. Several approaches have been proposed that either (i) partition the whole network into clusters, even in low density region, or (ii) are aimed at finding a single densest community (and need to be iterated to find the next one). As social networks grow larger both approaches (i) and (ii) result in algorithms too slow to be practical, in particular when speed in analyzing the data is required. In this paper we propose an approach that aims at balancing efficiency of computation and expressiveness and manageability of the output community representation. We define the notion of a partial dense cover (PDC) of a graph. Intuitively a PDC of a graph is a collection of sets of nodes that (a) each set forms a disjoint dense induced subgraphs and (b) its removal leaves the residual graph without dense regions. Exact computation of PDC is an NP-complete problem, thus, we propose an efficient heuristic algorithms for computing a PDC which we christen Core and Peel. Moreover we propose a novel benchmarking technique that allows us to evaluate algorithms for computing PDC using the classical IR concepts of precision and recall even without a golden standard. Tests on 25 social and technological networks from the Stanford Large Network Dataset Collection confirm that Core and Peel is efficient and attains very high precison and recall.

Great name for an algorithm, marred somewhat by the long paper title.

If subgraphs or small groups in a network are among your subjects, take the time to review this new graph exploration technique.

Newsfeed feature powered by Neo4j Graph Database

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 4:53 am

Newsfeed feature powered by Neo4j Graph Database

From the post:

Implementation of newsfeed or timeline feature became a must requirement for every social application. Achieving the same result through traditional relational databases would have been cumbersome as well as inefficient due to number of joins in SQL query. Would post an another article on why graph databases especially NEO4J is an apt choice for social application. For now, this post would directly jump into the implementation of simple newsfeed feature using powerful graph database NEO4J. (This article assumes that reader already understands basic concepts of graph databases representation, Neo4j and Cypher query language).

Neo4j Documentation has already given us the data model to represent social network in graph database and queries to retrieve data. Instead of repeating the whole story, this post would compliment the existing doc by giving cypher queries to create and retrieve data for below data model.

Would be Facebook killers will appreciate the example but tying Neo4j to a standard news feed would be more useful.

Programming Languages Influence Network

Filed under: Graphics,Graphs,Networks,Visualization — Patrick Durusau @ 4:41 am

Programming Languages Influence Network by Ramiro Gómez.

From the about tab:

This interactive visualization shows a network graph of programming language influences. The graph consists of 1169 programming language nodes and 908 edges that signify an influence relation.

The size of a node is determined by its out degree. The more influential a language is across all languages in the network, the bigger is the corresponding node in the network.

There are several ways of interaction: you can restrict the network to languages within a programming paradigm; you can choose between a Force Atlas 2 and a random graph layout; and you can highlight the connections of a language by moving the mouse over its node. For more details click on the help link in the top menu.

Impressive graphics/visualization!

Suggestive of techniques for other networks of “influence.”

Data Curation in the Networked Humanities [Semantic Curation?]

Filed under: Curation,Humanities,Literature — Patrick Durusau @ 4:29 am

Data Curation in the Networked Humanities by Michael Ullyot.

From the post:

These talks are the first phase of Encoding Shakespeare: my SSHRC-funded project for the next three years. Between now and 2015, I’m working to improve the automated encoding of early modern English texts, to enable text analysis.

This post’s three parts are brought to you by the letter p. First I outline the potential of algorithmic text analysis; then the problem of messy data; and finally the protocols for a networked-humanities data curation system.

This third part is the most tentative, as of this writing; Fall 2012 is about defining my protocols and identifying which tags the most text-analysis engines require for the best results — whatever that entails. (So I welcome your comments and resource links.)

A project that promises to touch on many of the issues in modern digital humanities. Do review and contribute if possible.

I have a lingering uneasiness with the notion of “data curation.” With the data and not curation part.

To say “data curation” implies we can identify the “data” that merits curation.

I don’t doubt we can identify some data that needs curation. The question being is it the only data that merits curation?

We know from the early textual history of the Bible that the text was curated and in that process, variant traditions and entire works were lost.

Just my take on it but rather than “data curation,” with the implication of a “correct” text, we need semantic curation.

Semantic curation attempts to preserve the semantics we see in a text, without attempting to find the correct semantics.

Ready to Contribute to Apache Hadoop 2.0?

Filed under: Hadoop,Hadoop YARN,Hortonworks — Patrick Durusau @ 4:08 am

User feedback is a contribution to a software project.

Software can only mature with feedback, your feedback.

Otherwise the final deliverable has a “works on my machine” outcome.

Don’t let Apache Hadoop 2.0 have a “works on my machine” outcome.

Download the preview and contribute your experiences back to the community.

We will all be glad you did!

Details:

Hortonworks Data Platform 2.0 Alpha is Now Available for Preview! by Jeff Sposetti.

From the post:

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.

Hortonworks strongly believes that for open source technologies to mature and become widely adopted in the enterprise, you must balance innovation with stability. With HDP 2.0 Alpha, Hortonworks provides organizations an easy way to evaluate and gain experience with the Apache Hadoop 2.0 technology stack, and it presents the perfect opportunity to help bring stability to the platform and influence the future of the technology.

Report on XLDB Tutorial on Data Structures and Algorithms

Filed under: Algorithms,Data Structures,Fractal Trees,TokuDB,Tokutek — Patrick Durusau @ 3:55 am

Report on XLDB Tutorial on Data Structures and Algorithms by Michael Bender.

From the post:

The tutorial was organized as follows:

  • Module 0: Tutorial overview and introductions. We describe an observed (but not necessary) tradeoff in ingestion, querying, and freshness in traditional database.
  • Module 1: I/O model and cache-oblivious analysis.
  • Module 2: Write-optimized data structures. We give the optimal trade-off between inserts and point queries. We show how to build data structures that lie on this tradeoff curve.
  • Module 2 continued: Write-optimized data structures perform writes much faster than point queries; this asymmetry affects the design of an ACID compliant database.
  • Module 3: Case study – TokuFS. How to design and build a write-optimized file systems.
  • Module 4: Page-replacement algorithms. We give relevant theorems on the performance of page-replacement strategies such as LRU.
  • Module 5: Index design, including covering indexes.
  • Module 6: Log-structured merge trees and fractional cascading.
  • Module 7: Bloom filters.

These algorithms and data structures are used both in NoSQL implementations such as MongoDB, HBase and in SQL-oriented implementations such as MySQL and TokuDB.

The slides are available here.

A tutorial offered by Michael and Bradley C. Kuszmaul at the 6th XLDB conference.

If you are committed to defending your current implementation choices against all comers, don’t bother with the slides.

If you want a peek at one future path in data structures, get the slides. You won’t be disappointed.

« Newer PostsOlder Posts »

Powered by WordPress