Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 5, 2012

Parallelizing Machine Learning– Functionally: A Framework and Abstractions for Parallel Graph Processing

Filed under: Parallel Programming,Scala — Patrick Durusau @ 7:57 pm

Parallelizing Machine Learning– Functionally: A Framework and Abstractions for Parallel Graph Processing by Heather Miller and Philipp Haller.

Abstract:

Implementing machine learning algorithms for large data, such as the Web graph and social networks, is challenging. Even though much research has focused on making sequential algorithms more scalable, their running times continue to be prohibitively long. Meanwhile, parallelization remains a formidable challenge for this class of problems, despite frameworks like MapReduce which hide much of the associated complexity.We present a framework for implementing parallel and distributed machine learning algorithms on large graphs, flexibly, through the use of functional programming abstractions. Our aim is a system that allows researchers and practitioners to quickly and easily implement (and experiment with) their algorithms in a parallel or distributed setting. We introduce functional combinators for the flexible composition of parallel, aggregation, and sequential steps. To the best of our knowledge, our system is the first to avoid inversion of control in a (bulk) synchronous parallel model.

An area of research that appears to have a great deal of promise. Very much worth your attention.

Highly Connected Data Models in NOSQL Stores

Filed under: Neo4j,NoSQL — Patrick Durusau @ 7:55 pm

Highly Connected Data Models in NOSQL Stores by Jim Webber.

From the description:

In this talks, we\’ll talk about the key ideas of NOSQL databases, including motivating similarities and more importantly their different strengths and weaknesses. In more depth, we’ll focus on the characteristics of graph stores for connected data and the kinds of problems for which they are best suited. To reinforce how useful graph stores are, we provide a rapid, code-focussed example using Neo4j covering the basics of graph stores, and the APIs for manipulating and traversing graphs. We\’ll then use this knowledge to explore the Doctor Who universe, using graph databases to infer useful knowledge from connected, semi-structured data. We conclude with a discussion of when different kinds of NOSQL stores are most appropriate the enterprise.

Deeply amusing and informative presentation.

Perhaps the most telling point was deciding between relational and graph database usage, based on the sparseness of the relational table. If sparse, don’t have “square data” and so probably better off with graph database.

Saw this in a tweet by Savas Pavastatidis.

February 4, 2012

ADMS Public Review is launched

Filed under: ADMS,Ontology,RDF — Patrick Durusau @ 3:40 pm

ADMS Public Review is launched

Public Review ends: 6 February 2012

From the post:

The ISA programme of the European Commission launched the public review of the Asset Description Metadata Schema (ADMS) on 6 January 2012 this will end on 6 February 2012 (inclusive).

From mid 2012, the Joinup platform, of the ISA programme, will make available a large number of semantic interoperability assets, described using ADMS, through a federation of asset repositories of Member States, standardisation bodies and other relevant stakeholders.

Apologies for the late notice but this item just came to my attention.

This is version 0.8 so unless the EC uses Hadoop numbering practices (jumping from 0.22 to 1.0) and such, I suspect there will be additional opportunities to comment.

ADMS 0.8 (has the following files):

At least as of today, 4 February 2012, the following two files don’t require you to answer if you are willing to participate in a post-download survey. I know every marketing department thinks their in-house and amateurish surveys are meaningful. Not. Ask a professional survey group if you really want to do surveys. Expensive but at least they will be meaningful.

These five (5) files require you to register and accept the post-download survey or answer: “No, I prefer to remain anonymous – start the download immediately.” five (5) times.

The ADMS_Specification-v0.8.zip file contains ADMS_Specification-v0.8.pdf (which is listed above).

The specification document is thirty-five (35) pages long so it won’t take you long to read.

I was puzzled by the documentation note (dcterms:abstract) in the adms08.rdf file that reads:

ADMS is intended as a model that facilitates federation and co-operation. It is not the primary intention that repository owners redesign or convert their current systems and data to conform to ADMS, but rather that ADMS can act as a common layer among repositories that want to exchange data.

But the examples found in ADMS_Examples-v0.8.zip are dated variously, 2011 – ADMS_Examples_Digitaliser_v0.03.pdf, 2010 – ADMS_Examples_ADMS_v0.03.pdf, ADMS_Examples_DCMES_v0.03.pdf, 2009 – ADMS_Examples_SKOS_v0.04.pdf, with version numbers, v0.03 and v.0.04 that leave doubt about the examples being current with the specification draft.

Morever, the examples are contrary to the goal of ADMS in that they represent presentation of data in ADMS rather than using ADMS as a target vocabulary. In other words, if you are a target vocabulary, give target vocabulary examples.

Do you have a feeling of deja vu reading these documents? Been here, done that? Which projects would you name off the top of your head that cover some, all or more than the ground covered here? (Extra points if you look up citations/URLs.)


Shameless self-promotion follows if you want to stop reading here.

It doesn’t look like my editing schedule is full for this year. Ghost or public editing of documentation or standards available. ODF 1.2 is an example of what is possible with a dedicated technical team like Sun had at Hamburg backing me as an editor. It is undergoing revision but no standard or document is ever perfect. Anyone who says differently is mis-informed or lying.

Functional Relational Programming with Cascalog

Filed under: Cascalog,Clojure,Functional Programming — Patrick Durusau @ 3:39 pm

Functional Relational Programming with Cascalog by Stuart Sierra.

From the post:

In 2006, Ben Mosely and Peter Marks published a paper, Out of the Tar Pit, in which they coined the term Functional Relational Programming. “Out of the Tar Pit” was influential on Clojure’s design, particularly its emphasis on immutability and the separation of state from behavior. Mosely and Marks went further, however, in recommending that data be manipulated as relations. Relations are the abstract concept behind tables in a relational database or “facts” in some logic programming systems. Clojure does not enforce a relational model, but Clojure can be used for relational programming. For example, the clojure.set namespace defines relational algebra operations such as project and join.

In the early aughts, Jeffrey Dean and Sanjay Ghemawat developed the MapReduce programming model at Google to optimize the process of ranking web pages. MapReduce works well for I/O-bound problems where the computation on each record is small but the number of records is large. It specifically addresses the performance characteristics of modern commodity hardware, especially “disk is the new tape.”

Stuart briefly traces the development of Cascalog and says it is an implementation of Functional Relational Programming.

What do you think?

Clojure and XNAT: Introduction

Filed under: Bioinformatics,Clojure,Neuroinformatics,Regexes,XNAT — Patrick Durusau @ 3:38 pm

Clojure and XNAT: Introduction

Over the last two years, I’ve been using Clojure quite a bit for managing, testing, and exploratory development in XNAT. Clojure is a new member of the Lisp family of languages that runs in the Java Virtual Machine. Two features of Clojure that I’ve found particularly useful are seamless Java interoperability and good support for interactive development.

“Interactive development” is a term that may need some explanation: With many languages — Java, C, and C++ come to mind — you write your code, compile it, and then run your program to test. Most Lisps, including Clojure, have a different model: you start the environment, write some code, test a function, make changes, and rerun your test with the new code. Any state necessary for the test stays in memory, so each write/compile/test iteration is fast. Developing in Clojure feels a lot like running an interpreted environment like Matlab, Mathematica, or R, but Clojure is a general-purpose language that compiles to JVM bytecode, with performance comparable to plain old Java.

One problem that comes up again and again on the XNAT discussion group and in our local XNAT support is that received DICOM files land in the unassigned prearchive rather than the intended project. Usually when this happens, there’s a custom rule for project identification where the regular expression doesn’t quite match what’s in the DICOM headers. Regular expressions are a wonderfully concise way of representing text patterns, but this sentence is equally true if you replace “wonderfully concise” with “maddeningly cryptic.”

Interesting “introduction” that focuses on regular expressions.

If you don’t know XNAT (I didn’t):

XNAT is an open source imaging informatics platform, developed by the Neuroinformatics Research Group at Washington University. It facilitates common management, productivity, and quality assurance tasks for imaging and associated data. Thanks to its extensibility, XNAT can be used to support a wide range of imaging-based projects.

Important neuroinformatics project based at Washington University, which has a history of very successful public technology projects.

Never hurts to learn more about any informatics project, particularly one in the medical sciences. With an introduction to Clojure as well, what more could you want?

Elasticsearch Using index templates & dynamic mappings

Filed under: Dynamic Mapping,logstash — Patrick Durusau @ 3:37 pm

Elasticsearch Using index templates & dynamic mappings

Enables faceted searches of logs using logstash.

If you don’t know logstash, you might want to take a quick tour.

I found it interesting that you can now parse events on a TCP socket.

What you want to add to logs, events, etc., for mapping purposes is entirely up to you.

Hacking Chess: Data Munging

Filed under: Data Mining,MongoDB — Patrick Durusau @ 3:36 pm

Hacking Chess: Data Munging

Kristina Chodorow specifies a conversion from portable game notation (PGN) to JSON. For loading the chess games into MongoDB.

Useful for her Hacking Chess with the MongoDB Pipeline post.

Addressing data in situ would be more robust but conversion is far more common.

When I get around to outlining a topic map book, I will have to include a chapter on data conversion techniques.

Cry Me A River, But First Let’s Agree About What A River Is

Filed under: Ontology,Semantic Diversity,Semantic Web — Patrick Durusau @ 3:34 pm

Cry Me A River, But First Let’s Agree About What A River Is

The post starts off well enough:

How do you define a forest? How about deforestation? It sounds like it would be fairly easy to get agreement on those terms. But beyond the basics – that a definition for the first would reflect that a forest is a place with lots of trees and the second would reflect that it’s a place where there used to be lots of trees – it’s not so simple.

And that has consequences for everything from academic and scientific research to government programs. As explained by Krzysztof Janowicz, perfectly valid definitions for these and other geographic terms exist by the hundreds, in legal texts and government documents and elsewhere, and most of them don’t agree with each other. So, how can one draw good conclusions or make important decisions when the data informing those is all over the map, so to speak.

….

Having enough data isn’t the problem – there’s official data from the government, volunteer data, private organization data, and so on – but if you want to do a SPARQL query of it to discover all towns in the U.S., you’re going to wind up with results that include the places in Utah with populations of less than 5,000, and Los Angeles too – since California legally defines cities and towns as the same thing.

“So this clearly blows up your data, because your analysis is you thinking that you are looking at small rural places,” he says.

This Big Data challenge is not a new problem for the geographic-information sciences community. But it is one that’s getting even more complicated, given the tremendous influx of more and more data from more and more sources: Satellite data, rich data in the form of audio and video, smart sensor network data, volunteer location data from efforts like the Citizen Science Project and services like Facebook Places and Foursquare. “The heterogeneity of data is still increasing. Semantic web tools would help you if you had the ontologies but we don’t have them,” he says. People have been trying to build top-level global ontologies for a couple of decades, but that approach hasn’t yet paid off, he thinks. There needs to be more of a bottom-up take: “The biggest challenge from my perspective is coming up with the rules systems and ontologies from the data.”

All true, many of which objectors to the current Semantic Web approach have been saying for a very long time.

I am not sure about the line: “The heterogeneity of data is still increasing.”

In part because I don’t know of any reliable measure of heterogeneity by which a comparison could be made. True there is more data now than at some X point in the past, but that isn’t necessarily an indication of increased heterogeneity. But that is a minor point.

More serious is the a miracle occurs statement that follows:

How to do it, he thinks, is to make very small and really local ontologies directly mined with the help of data mining or machine learning techniques, and then interlink them and use new kinds of reasoning to see how to reason in the presence of inconsistencies. “That approach is local ontologies that arrive from real application needs,” he says. “So we need ontologies and semantic web reasoning to have neater data that is human and also machine readable. And more effective querying based on analogy or similarity reasoning to find data sets that are relevant to our work and exclude data that may use the same terms but has different ontological assumptions underlying it.”

Doesn’t that have the same feel as the original Semantic Web proposals that were going to eliminate semantic ambiguity from the top down? The very approach that is panned in this article?

And “new kinds of reasoning,” ones I assume have not been invented yet, are going “to reason in the presence of inconsistencies.” And excluding data that “…has different ontological assumptions underlying it.”

Since we are the source of ontological assumptions that underlie the use of terms, I am real curious about how those assumptions are going to become available to these to be invented reasoning techniques?

Oh, that’s right, we are all going to specify our ontological assumptions at the bottom to percolate up. Except that to be useful for machine reasoning, they will have to be as crude as the ones that were going to be imposed from the top down.

I wonder why the indeterminate nature of semantics continues to elude Semantic Web researchers. A snapshot of semantics today may be slightly incorrect tomorrow, probably incorrect in some respect in a month and almost surely incorrect in a year or more.

Take Saddam Hussein for example. One time friend and confidant of Donald Rumsfeld (there are pictures). But over time those semantics changed, largely because Hussein slipped the lease and was no longer a proper vassal to the US. Suddenly, the weapons of mass destruction, in part nerve gas we caused to be sold to him, became a concern. And so Hussein became an enemy of the US. Same person, same facts. Different semantics.

There are less dramatic examples but you get the idea.

We can capture even changing semantics but we need to decide what semantics we want to capture and at what cost? Perhaps that is a better way to frame my objection to most Semantic Web activities, they are not properly scoped. Yes?

You Only Wish MongoDB Wasn’t Relational

Filed under: MongoDB — Patrick Durusau @ 3:32 pm

You Only Wish MongoDB Wasn’t Relational.

From the post:

When choosing the stack for our TV guide service, we became interested in NoSQL dbs because we anticipated needing to scale horizontally. We evaluated several and settled on MongoDB. The main reason was that MongoDB got out of the way and let us get work done. You can read a little more about our production setup here.

So when you read that MongoDB is a document store, you might get the wonderful idea to store your relationships in a big document. Since mongo lets you reach into objects, you can query against them, right?

Several times, we’ve excitedly begun a schema this way, only to be forced to pull the nested documents out into their own collection. I’ll show you why, and why it’s not a big deal.

Perhaps a better title would have been: MongoDB: Relationships Optional. 😉

That is you can specify relationships but only to the extent necessary.

Worth your time to read.

MapReduce Patterns, Algorithms and Use Cases

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:32 pm

MapReduce Patterns, Algorithms and Use Cases

Ilya Katsov writes:

In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found in the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below.

An extensive list of MapReduce patterns and algorithms, complete with references at the end!

Forward to anyone interested in MapReduce.

Heroku & Neo4j Go To The Movies

Filed under: Heroku,Neo4j — Patrick Durusau @ 3:31 pm

Well, maybe the real title is: Neoflix Movie Recommender, it’s hard to keep track when you look at a number of websites/blogs everyday. 😉

Seriously, you should check out this example of Neo4j on Heroku.

Where will your imagination take you with Neo4j and Heroku?

The NoSQL Landscape, Graph Db’s, and a Look at Neo4J

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:29 pm

The NoSQL Landscape, Graph Db’s, and a Look at Neo4J

Jason Daniels has compiled a summary of the major graph database offerings.

A starting good point if you need to develop a more detailed comparison.

February 3, 2012

Great Maps with ggplot2

Filed under: Ggplot2,Graphics,Mapping,Maps — Patrick Durusau @ 5:03 pm

Great Maps with ggplot2

I have mentioned ggplot2 before but this item caught my eye because of its skillful use with a map of cycle tours of London.

Not that I intend to take a cycle tour of London any time soon but it occurs to me that creating maps to resturants, entertainment, etc., from conference sites would be a good use of it. Coupled with a topic map, as the conference progresses, reviews/tweets about those locations could become available to other participants.

Other geographic locations/information could be plotted as well.

MINE: Maximal Information-based NonParametric Exploration

Filed under: Data Mining,R — Patrick Durusau @ 5:03 pm

MINE: Maximal Information-based NonParametric Exploration

From the post:

There was a lot of buzz in the blogosphere as well as the science community about a new family of algorithms that are able to find non-linear relationships over extremely large fields of data. What makes it particularly useful is that the measure(s) it uses are based upon mutual information rather than standard pearson’s correlation type measures, which do not capture non-linear relationships well.

The (java based) software can be downloaded here: http://www.exploredata.net/ In addition, there is the capability to directly run the software from R.

A data exploration tool for your weekend enjoyment!

Java, Python, Ruby, Linux, Windows, are all doomed

Filed under: Java,Linux OS,Parallelism,Python,Ruby — Patrick Durusau @ 5:02 pm

Java, Python, Ruby, Linux, Windows, are all doomed by Russell Winder.

From the description:

The Multicore Revolution gathers pace. Moore’s Law remains in force — chips are getting more and more transistors on a quarterly basis. Intel are now out and about touting the “many core chip”. The 80-core chip continues its role as research tool. The 48-core chip is now actively driving production engineering. Heterogeneity not homogeneity is the new “in” architecture.

Where Intel research, AMD and others cannot be far behind.

The virtual machine based architectures of the 1990s, Python, Ruby and Java, currently cannot cope with the new hardware architectures. Indeed Linux and Windows cannot cope with the new hardware architectures either. So either we will have lots of hardware which the software cannot cope with, or . . . . . . well you’ll just have to come to the session.

The slides are very hard to see so grab a copy at: http://www.russel.org.uk/Presentations/accu_london_2010-11-18.pdf

From the description: Heterogeneity not homogeneity is the new “in” architecture.

Is greater heterogeneity in programming languages coming?

Run your own Graph Database (neo4j) on ArchLinux PogoPlug

Filed under: Neo4j,Plug Computers — Patrick Durusau @ 5:01 pm

Run your own Graph Database (neo4j) on ArchLinux PogoPlug

Great guide to setting up Neo4j on ArchLinux PogoPlug.

I must confess I had to search for “ArchLinux PogoPlug” to understand what hardware was being described. 😉

I don’t run the heaters in my office very often during the winter (which are usually mild even in the Northern part of Georgia, US) because of the heat output from the computers and monitors. I suppose if it gets really cold I could setup some of the older equipment, which are real heat generators.

Plug computers could be an important platform for Neo4j instances + data so I am creating a category for them.

Please forward Neo4j relevant work on plug computers to my attention.

Graph Visualization and Neo4j (Parts 1, 2, 3 (so far))

Filed under: Graphs,Neo4j,Neography,Processing.js,Visualization — Patrick Durusau @ 4:57 pm

Max De Marzi has a truly amazing series of posts on graph visualization and Neo4j!

Here is a quick list:

Graph Visualization and Neo4j

Highlights (besides Max’s code): Processing.js, radial navigation (Donut).

Graph Visualization and Neo4j – Part Two

Highlights: Continues with Processing.js and the canvas element from HTML 5. If you don’t know the canvas element, see the Mozilla Developer page Canvas tutorial.

Graph Visualization and Neo4j – Part Three

Highlights: D3.js, chord flare visualization, and to answer Max’s question about the resulting graphic: Yes, yes it is pretty!

Vector Clocks – Easy/Hard?

Filed under: Erlang,Riak — Patrick Durusau @ 4:55 pm

The Basho blog has a couple of very good posts on vector clocks:

Why Vector Clocks are Easy

Why Vector Clocks are Hard

The problem statement was as follows:

Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. The planning starts with Alice suggesting they meet on Wednesday. Later, Dave discuss alternatives with Cathy, and they decide on Thursday instead. Dave also exchanges email with Ben, and they decide on Tuesday. When Alice pings everyone again to find out whether they still agree with her Wednesday suggestion, she gets mixed messages: Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can’t be reached, and so no one is able to determine the order in which these communications happened, and so none of Alice, Ben, and Cathy know whether Tuesday or Thursday is the correct choice.

Vector clocks are used to keep the order of communications clear. Something you will need in distributed systems, including those for topic maps.

Building Distributed Systems with Riak Core

Filed under: Distributed Systems,Riak — Patrick Durusau @ 4:54 pm

Building Distributed Systems with Riak Core by Steve Vinoski (Basho).

From the description:

Riak Core is the distributed systems foundation for the Riak distributed database and the Riak Search full-text indexing system. Riak Core provides a proven architecture and key functionality required to quickly build scalable, distributed applications. This talk will cover the origins of Riak Core, the abstractions and functionality it provides, and some guidance on building distributed systems.

Rest assured or be forewarned that there is no Erlang code in this presentation.

For all that, it is still a very informative presentation on building scalable, distributed applications.

Seal

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 4:53 pm

Seal

From the site:

Seal is a Hadoop-based distributed short read alignment and analysis toolkit. Currently Seal includes tools for: read demultiplexing, read alignment, duplicate read removal, sorting read mappings, and calculating statistics for empirical base quality recalibration. Seal scales, easily handling TB of data.

Features:

  • short read alignment (based on BWA)
  • duplicate read identification
  • sort read mappings
  • calculate empirical base quality recalibration tables
  • fast, scalable, reliable (runs on Hadoop)

Seal website with extensive documentation.

Karmasphere Studio Community Edition

Filed under: Hadoop,Hive,Karmasphere — Patrick Durusau @ 4:52 pm

Karmasphere Studio Community Edition

From the webpage:

Karmasphere Studio Community Edition is the free edition of our graphical development environment that facilitates learning Hadoop MapReduce jobs. It supports the prototyping, developing, and testing phases of the Hadoop development lifecycle.

The parallel and parameterized queries features in their Analyst product attracted me to the site:

From the webpage:

According to Karmasphere, the updated version of Analyst offers a parallel query capability that they say will make it faster for data analysts to iteratively query their data and create visualizations. The company claims that the new update allows data analysts to submit queries, view results, submit a new set and then compare those results across the previous outputs. In essence, this means users can run an unlimited number of queries concurrently on Hadoop so that one or more data sets can be viewed while the others are being generated.

Karmasphere also says that the introduction of parameterized queries allows users to submit their queries as they go, while offering them output in easy-to-read graphical representations of the findings, in Excel spreadsheets, or across a number of other outside reporting tools.

Hey, it says “…in Excel spreadsheets,” do you think they are reading my blog? (Spreadsheet -> Topic Maps: Wrong Direction? 😉 I didn’t really think so either.) I do take that as validation of the idea that offering users a familiar interface is more likely to be successful than an unfamiliar one.

DIMACS Implementation Challenges

Filed under: Graphs — Patrick Durusau @ 4:51 pm

DIMACS Implementation Challenges

Here you will find The Famous DIMACS Graph Format , which is also accepted by GraphInsight.

From the webpage:

Quite a few research papers have been referring to the DIMACS graph format. The first Challenge used networks (directed graphs with edge and node capacities) and undirected graphs (for matching), and the second Challenge used undirected graph. Extending these formats to directed graphs should be straightforward. Specifications for the Challenge 1 formats are available by anonymous ftp (or through the DIMACS web page Previous Challenges) in

The implementation challenges page lists several other challenges with a rich supply of materials, including “other” graph formats.

The challenges page is just a sampling of the resources located at DIMACS (homepage).

February 2, 2012

Pajek – Program for Large Network Analysis

Filed under: Graphs,Pajek,Visualization — Patrick Durusau @ 3:50 pm

Pajek – Program for Large Network Analysis

From the webpage:

Pajek (Slovene word for Spider) is a program, for Windows, for analysis and visualization of large networks. It is freely available, for noncommercial use, at its download page. See also a reference manual for Pajek (in PDF). The development of Pajek is traced in its History.

I was looking for input formats for GraphInsight and the “Pajek” format was one of the formats mentioned.

The reference manual runs almost one hundred (100) pages and the software has a large number of interesting looking features.

Question: Do you think Pajek (or any other graph program) can usefully visualize inconsistent data? That is two sets of inconsistently labeled graph data.

(Seeing Kristina’s Hacking Chess with the MongoDB Pipeline and the Pajek format made me realize something everyone else has probably known forever. More on that tomorrow.)

Introducing Hypernotation, an alternative to Linked Data

Filed under: Hypernotation,Linked Data,Semantic Web — Patrick Durusau @ 3:49 pm

Introducing Hypernotation, an alternative to Linked Data

A competing notation to Linked Data:

From the post:

URL, URI, IRI, URIref, CURIE, QName, slash URIs, hash URIs, bnodes, information resources, non-information resources, dereferencability, HTTP 303, redirection, content-negotiation, RDF model, RDF syntax, RDFa core, RDFa lite, Microdata, Turtle, N3, RDF/XML, JSON-LD, RDF/JSON…

Want to publish some data? Well, these are some of the things you will have to learn and understand to do so. Is the concept of data really so hard that you can’t publish it without understanding the concepts of information and non-information resources? Do you really need to deal with the HTTP 303 redirection and a number of different syntaxes? It’s just data, damn it!

Really, how have we got to this?

I did a detailed analysis on the problems of Linked Data, but it seems that I missed the most important thing. It’s not about the Web technologies but about economics. The key Linked Data problem is that it holds a monopoly in the market. One can’t compare it to anything else, and thus one can’t be objective about it. There is no competition, and without competition, there is no real progress. Without competition, it’s possible for many odd ideas to survive, such as requiring people to implement HTTP 303 redirection.

As a competitor to Linked Data, this proposal should lead to a re-examination of many of the decisions that have lead to and sustain Linked Data. I say “should,” not that it will lead to such a re-examination. At least not now. Perhaps when the next “universal” semantic syntax comes along.

You may find An example of Hypernotation useful in reading the Hypernotation post.

Apache Solr crash course

Filed under: Solr — Patrick Durusau @ 3:48 pm

Apache Solr crash course by Tommaso Teofili.

While I was looking for aggregation material on Solr, this slide deck came up.

Since the speaker isn’t present to fill in the gaps, I use presentation decks as outlines of the essential points.

Look those up in the documentation and spread out from there.

I think this one is particularly useful.

Ontopia 5.2.0

Filed under: Ontopia,Topic Map Software — Patrick Durusau @ 3:47 pm

Ontopia 5.2.0

A new release from the Ontopia project has hit the street! Ontopia 5.2.0!

From the “What’s New” document in the distribution:

This is the first release in the new Maven structure. It includes the modularization of Ontopia along with bug fixes along with some new functionality.

The following changes have been made:

  • Ontopia is now divided into Maven modules based functionality. For developers working with Ontopia as a dependency this means that there is a more controlled way of including parts of Ontopia as a dependency. This change does not affect Ontopia distribution users.
  • The distribution has been updated to include Tomcat version 6.
  • The DB2TM functionality has been extended and improved.
  • Ontopoly had several outstanding bugs. Support for exporting TM/XML and schema without data was added.
  • Tolog now supports negative integer values and some basic numeric operations through the numbers module.
  • Ontopia now uses Lucene 2.9.4 (up from 2.2.0).

Thirty-seven (37) bugs were squashed but you will need to consult the “What’s New” file for the details.

Please send notes of congratulation to the team for the new release. They know you are grateful but a little active encouragement can go a long way.

Hacking Chess with the MongoDB Pipeline

Filed under: Aggregation,MongoDB — Patrick Durusau @ 3:46 pm

Hacking Chess with the MongoDB Pipeline

Kristina Chodorow* writes:

MongoDB’s new aggegation framework is now available in the nightly build! This post demonstrates some of its capabilities by using it to analyze chess games.

Make sure you have a the “Development Release (Unstable)” nightly running before trying out the stuff in this post. The aggregation framework will be in 2.1.0, but as of this writing it’s only in the nightly build.

First, we need some chess games to analyze. Download games.json, which contains 1132 games that were won in 10 moves or less (crush their soul and do it quick).

You can use mongoimport to import games.json into MongoDB:

If you think this example of “aggregation” as merging where the subjects have a uniform identifier (chess piece/move), you will understand why I find this interesting.

Aggregation, as is shown by Kristina’s post, can form the basis for analysis of data.

Analysis that isn’t possible in the absence of aggregation (read merging).

I am looking forward to addition posts on the aggregation framework and need to drop by the MongoDB project to see what the future holds on aggregation/merging.

*Kristina is the author of two O’Reilly titles, MongoDB: the definitive guide and Scaling MongoDB.

Query time joining in Lucene

Filed under: Joins,Lucene — Patrick Durusau @ 3:40 pm

Query time joining in Lucene

From the post:

Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6.

Lets say we have articles and comments. With the query time join you can store these entities as separate documents. Each comment and article can be updates without re-indexing large parts of your index. Even better would be to store articles in an article index and comments in a comment index! In both cases a comment would have a field containing the article identifier.

Joins based upon matching terms in different indexes.

Work is not finished yet so now would be the time to contribute your experiences or opinions.

Spot

Filed under: Tweets,Visualization — Patrick Durusau @ 3:39 pm

Spot by Jeff Clark.

From the post:

Spot is an interactive real-time Twitter visualization that uses a particle metaphor to represent tweets. The tweet particles are called spots and get organized in various configurations to illustrate information about the topic of interest.

Spot has an entry field at the lower-left corner where you can type any valid Twitter search query. The latest 200 tweets will be gathered and used for the visualization. Note that Twitter search results only go back about a week so a search for a rare topic may only return a few. When you enter a query the URL is changed so you can easily bookmark it or send it to someone…

The Different Views

Here is a complete list of the views and what they show:

  1. Group View (speech bubble icon) places tweets that share common words inside large circles
  2. Timeline View (watch icon) places tweets along a timeline based on when they were sent
  3. User View (person icon) shows a bar chart with the people sending the most tweets in the set
  4. Word View (Word Circle icon) directly shows word bubbles with tweets attracted to the words they contain
  5. Source View (Megaphone icon) a bar chart showing the tool used to send the tweets (or sometimes the news source)

What do you like/dislike about the visualization? Is it specific to Twitter or do you see adaptations that could be made for other data sets?

IMDb Alternative Interfaces

Filed under: Data,Dataset,IMDb — Patrick Durusau @ 3:39 pm

IMDb Alternative Interfaces.

From the webpage:

This page describes various alternate ways to access The Internet Movie Database locally by holding copies of the data directly on your system. See more about using our data on the Non-Commercial Licensing page.

It’s an interesting data set and I am sure its owners would not mind your sending them a screencast of some improved access you have created to their data.

That might actually be an interesting model for developing better interfaces to data served up to the public anyway. Release it for strictly personal use and see who does the best job with it. A screencast would not disclose any of your source code or processes, protecting the interest of the software author.

Just a thought.

First noticed this on PeteSearch.

« Newer PostsOlder Posts »

Powered by WordPress