Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 17, 2011

TunedIT

Filed under: Algorithms,Data Mining,Machine Learning — Patrick Durusau @ 2:52 pm

TunedIT Machine Learning & Data Mining Algorithms Automated Tests, Repeatable Experiments, Meaningful Results

There are two parts to the TunedIT site:

TunedIT Research

TunedIT Research is an open platform for reproducible evaluation of machine learning and data mining algorithms. Everyone may use TunedIT tools to launch reproducible experiments and share results with others. Reproducibility is achieved through automation. Datasets and algorithms, as well as experimental results, are collected in central databases: Repository and Knowledge Base, to enable comparison of wide range of algorithms, and to facilitate dissemination of research findings and cooperation between researchers. Everyone may access the contents of TunedIT and contribute new resources and results.

TunedIT Challenge

The TunedIT project was established in 2008 as a free and open experimentation platform for data mining scientists, specialists and programmers. It was extended in 2009 with a framework for online data mining competitions, used initially for laboratory classes at universities. Today, we provide a diverse range of competition types – for didactic, scientific and business purposes.

  • Student Challenge — For closed members groups. Perfectly suited to organize assignments for students attending laboratory classes. Restricted access and visibility, only for members of the group. FREE of charge
  • Scientific Challenge — Open contest for non-commercial purpose. Typically associated with a conference, journal or scientific organization. Concludes with public dissemination of results and winning algorithms. May feature prizes. Fee: FREE or 20%
  • Industrial Challenge — Open contest with commercial purpose. Intellectual Property can be transfered at the end. No requirement for dissemination of solutions. Fee: 30%

This looks like a possible way to generate some publicity about and interest in topic maps.

Suggestions of existing public data sets that would be of interest to a fairly broad audience?

Thinking we are likely to model some common things the same and other common things differently.

Would be interesting to compare results.

sones GraphDB 2.0

Filed under: GraphDB,NoSQL — Patrick Durusau @ 2:51 pm

sones GraphDB 2.0

From the press release on GraphDB 2.0:

– High-performance graph database based on a property hypergraph – Optimized for multiprocessor/multicore systems – Platform-independent (Linux, Windows, OSX) – Modular architecture – OpenSource (AGPLv3) and proprietary enterprise license – Intuitive, easy-to-learn query language: GQL (Graph Query Language) – Powerful API and traverse API – Integrated REST interfaces and administration tools – Optional persistence plug-ins – Client libraries in many popular programming languages (Java, C#, Javascript, PHP, …) – Integrated Javascript UI

Another OpenSource high performance graph database.

Graph databases seem to be on the rise in popularity.

Can model the relational model and more with graph databases.

But is that like markup trees being subsets of graphs?

That we find it easier to use subsets of the capabilities of graphs?

AllegroGraph – A General Purpose Graph DB

Filed under: Graphs — Patrick Durusau @ 2:50 pm

AllegroGraph – A General Purpose Graph DB

From the webpage:

Recently, there has been significant interest in the application of graphs in different domains and while Franz’s focus has been primarily in the Semantic Web domain, AllegroGraph is a general purpose graph database designed to store more than standard RDF.

In April we presented an invited talk at GDM 2011 (International Workshop on Graph Data Management: Techniques and Applications) and discussed the capabilities of AllegroGraph as a general purpose graph engine. During this webcast we will present our GDM 2011 material covering tips and techniques for using AllegroGraph as a general graph database. We will also present some comparisons vs other graph databases and a Gremlin inspired graph traversal language.

Video and slides are available.

Riak Search Explained

Filed under: Erlang,Riak — Patrick Durusau @ 2:49 pm

Riak Search Explained

From Alex Popescu myNoSQL, pointer to an explanation of Riak search.

Covers:

  • Full-text search built on Riak Core
  • Easy to use (start, join, done)
  • Solr compatible interface (just mentioned)
  • Riak KV integration (bulk of the presentation)

Focus of the presentation is to integrate full-text search with Riak Core with another application.

Riak Search is a superset of Riak KV (only install one).

Riak search source code.

Eventually Consistent?

Filed under: Erlang,Riak — Patrick Durusau @ 2:48 pm

statebox, an eventually consistent data model for Erlang (and Riak)

From the post:

When you choose an eventually consistent data store you’re prioritizing availability and partition tolerance over consistency, but this doesn’t mean your application has to be inconsistent. What it does mean is that you have to move your conflict resolution from writes to reads. Riak does almost all of the hard work for you [2], but if it’s not acceptable to discard some writes then you will have to set allow_mult to true on your bucket(s) and handle siblings [3] from your application. In some cases, this might be trivial. For example, if you have a set and only support adding to that set, then a merge operation is just the union of those two sets.

statebox is my solution to this problem. It bundles the value with repeatable operations [4] and provides a means to automatically resolve conflicts. Usage of statebox feels much more declarative than imperative. Instead of modifying the values yourself, you provide statebox with a list of operations and it will apply them to create a new statebox. This is necessary because it may apply this operation again at a later time when resolving a conflict between siblings on read.

I like that, “move conflict resolution from writes to reads.”

Sounds like where ISO/IEC 13250 points out two or more topic links maybe merged, and/or applications may process and/or render them as if they have been merged. (5.2.1 Topic Link Architectural Form)

Which fits your topic maps use case better? Consistency (one representative per subject) on write or read?

Data Serialization

Filed under: Data,Data Streams — Patrick Durusau @ 2:47 pm

Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB

From the post:

I recently evaluated several serialization frameworks including Thrift, Protocol Buffers, and Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.

If you are in need of a data serialization framework this is a good place to start reading.

May 16, 2011

The Filter Bubble: Algorithm vs. Curator & the Value of Serendipity

Filed under: Data Silos,Filters,Personalization — Patrick Durusau @ 3:33 pm

The Filter Bubble: Algorithm vs. Curator & the Value of Serendipity by Maria Popova.

Covers the same TED presentation that I mention at On the dangers of personalization but with the value-add that Maria both interviews Eli Pariser and talks about his new book, The Filter Bubble.

I remain untroubled by filtering.

We filter the information we give others around us.

Advertisers filter the information they present in commercials.

For example, I don’t recall any Toyota ads that end with: Buy a Toyota ****, your odds of being in a recall are 1 in ***. That’s filtering.

Two things would increase my appreciation for Google filtering:

First, much better filtering, where I can choose narrow-band filter(s) based on my interests.

Second, the ability to turn the filters off at my option.

You see, I don’t agree that there is information I need to know as determined by someone else.

Here’s an interesting question: What information would you filter from: www.cnn.com?

Semagia Oomap Loomap

Filed under: Oomap Loomap,TMQL — Patrick Durusau @ 3:32 pm

Semagia Oomap Loomap

From Lars Heuer:

GUI for Topic Maps query languages.

Supported languages:

Similar projects:
* TMQL Console <https://github.com/mhoyer/tmql-console>
* Tamana <https://code.google.com/a/eclipselabs.org/p/tamana/>

Can robots create their own language?

Filed under: Artificial Intelligence,Language — Patrick Durusau @ 3:25 pm

Can robots create their own language?

Another Luc Steels item from Robert Cerny:

Note

An overview over Luc Steels work including video demonstrations of robots playing the Naming Game.

Quote

What do we need to put in [the robots] so that they would self-organize a symbolic communication system?

Video of robots playing naming game.

Thought occurs to me, what if we had a video of people playing the naming game?

Or better yet, the subject identification game?

Or better still, the two instances are the same subject game?

Or are all three of those the same games?

The Recruitment Theory of Language Origins

Filed under: Artificial Intelligence,Language — Patrick Durusau @ 3:20 pm

The Recruitment Theory of Language Origins

An entry by Robert Cerny to Luc Steels’ acquisition of language research:

Note

Section 5, titled “The Naming Challenge”, describes a game in the field of robotics where agents need to find a way to communicate about a set of objects. This game is known as the “Naming Game”. It is interesting to look at these insights with a Topic Mappish mindset. It also confirms my point that subject descriptions decay in space and time.


Quote

Clearly every human language has a way to name individual objects or more generally categories to identify classes of objects. Computer simulations have already been carried out to determine what strategy for tackling this naming challenge could have become innate through natural selection or how a shared lexicon could arise through a simulation of genetic evolution.


The recruitment theory argues instead that each agent should autonomonously discover strategies that allow him to successfully build up and negotiate a shared lexicon in peer-to-peer interaction and that the emerging lexicon is a temporal cultural consensus which is culturally transmitted.

Follow the link at Robert’s post to read Steels paper in full. It’s important.

Emerging multidisciplinary research across database management systems

Filed under: Conferences,Database — Patrick Durusau @ 3:19 pm

Emerging multidisciplinary research across database management systems by Anisoara Nica, Fabian Suchanek (INRIA Saclay – Ile de France), Aparna Varde.

Abstract:

The database community is exploring more and more multidisciplinary avenues: Data semantics overlaps with ontology management; reasoning tasks venture into the domain of artificial intelligence; and data stream management and information retrieval shake hands, e.g., when processing Web click-streams. These new research avenues become evident, for example, in the topics that doctoral students choose for their dissertations. This paper surveys the emerging multidisciplinary research by doctoral students in database systems and related areas. It is based on the PIKM 2010, which is the 3rd Ph.D. workshop at the International Conference on Information and Knowledge Management (CIKM). The topics addressed include ontology development, data streams, natural language processing, medical databases, green energy, cloud computing, and exploratory search. In addition to core ideas from the workshop, we list some open research questions in these multidisciplinary areas.

Good overview of papers from PIKM 2010, a number of which will be of interest to topic mappers.

PIKM 2010 (You will need to use the Table of Contents tab.)

Annotations: dynamic semantics in stream processing

Filed under: Data Streams,Semantics — Patrick Durusau @ 3:15 pm

Annotations: dynamic semantics in stream processing by Juan Amiguet, Andreas Wombacher, and, Tim E. Klifman.

Abstract:

In the field of e-science stream data processing is common place facilitating sensor networks, in particular for prediction and supporting decision making. However, sensor data may be erroneous, like e.g. due to measurement errors (outliers) or changes of the environment. While it can be foreseen that there will be outliers, there are a lot of environmental changes which are not foreseen by scientists and therefore are not considered in the data processing. However, these unforeseen semantic changes – represented as annotations – have to be propagated through the processing. Since the annotations represent an unforeseen, hence un-understandable, annotation, the propagation has to be independent of the annotation semantics. It nevertheless has to preserve the significance of the annotation on the data despite structural and temporal transformations. And should remain meaningful for a user at the end of the data processing. In this paper, we identify the relevant research questions.In particular, the propagation of annotations is based on structural, temporal, and significance contribution. While the consumption of the annotation by the user is focusing on clustering information to ease accessibility.

Examines the narrow case of temperature sensors but I don’t know of any reason why semantic change could not occur in a stream of data reported by a person.

May 15, 2011

Breaking Bin Laden: A Closer Look

Filed under: Authoring Topic Maps,Graphs — Patrick Durusau @ 5:57 pm

Breaking Bin Laden: A Closer Look

A post from the SocialFlow blog:

Since last Friday, when we first published Breaking Bin Laden: Visualizing the Power of a Single Tweet, our analysis and data visualization of the way news filtered out around the Bin Laden raid via the Twitter, we’ve been overwhelmed by the response. Thousands of Tweets, many in Spanish, French, German and Japanese.

There have been quite a few interesting articles written about our post as well. The Guardian asked important questions about how journalists can respond to the tremendous velocity of the real-time web. Over at Fast Company, Brian Solis used our visualization as a jumping off point for a discussion of who matters in “the information economy.”

There have been plenty of inquiries about the graph itself, so we wanted to provide you with the opportunity to explore it in greater depth. Click on the image below or download it, and zoom in to get a closer look at all of the intersecting forces that propelled a single tweet to its eventual astonishing spread.

Truly unusual work.

Makes me wonder about several things:

  1. What would it take to trace the addition of information to a topic map in a similar way?
  2. What would it look like to add information to the nodes in these graphs using a topic map?
  3. For that matter, what information would you want to add and why?

Data First

Filed under: Dataset,Topic Maps — Patrick Durusau @ 5:56 pm

The Death of Open Data?

Not recent (April 2011) but interesting article from technology review (MIT) on the impact of budget cuts on Obama’s Open Data initiatives.

Speaking of data.gov, Rufus Pollock (director Open Knowledge Foundation) says:

Pollock says what’s most concerning about cutting the Electronic Government Fund is that it represents a turn away from Obama’s open-government policies. “The website is really great, but the crucial thing is the actual data,” he says. Though data.gov is a symbol whose loss would be painful, the real question is whether the U.S. government will continue to make its data more accessible and useful, with or without it.

In an era of budget reductions, it needs to be data first (including documentation) and agency website search/presentations later, if at all.

If you know anyone with an effective voice in this debate, suggest to them data first, agency presentation/spin only after the data has been posted for download.

Luke 3.1

Filed under: Lucene,Luke — Patrick Durusau @ 5:55 pm

Luke 3.1

Luke is a development and diagnostic tool for use with Lucene.

Luke is now being numbered consistently with Lucene.

See my prior blog post on Luke.

May 14, 2011

Neo4j 1.4.M02 – Index-ability

Filed under: Neo4j — Patrick Durusau @ 6:25 pm

Neo4j 1.4.M02 – Index-ability

From the blog post:

One of the most striking features of this milestone release is the new index component based on Lucene 3.1. Besides the cleaner API, this upgrade brings impressive improvements in index lookup speed, with index write operations enjoying measurable speedup as well. So, depending on your use case, you might benefit from indexing operations that are twice as fast!

Read for coverage of the REST interface and other improvements.

XMLSH

Filed under: Authoring Topic Maps,Data Mining — Patrick Durusau @ 6:25 pm

XMLSH – Command line shell for processing XML.

Another tool for your data mining/manipulation tool-kit!

Cheat Sheet: Algorithms for Supervised and Unsupervised Learning

Filed under: Machine Learning — Patrick Durusau @ 6:25 pm

Cheat Sheet: Algorithms for Supervised and Unsupervised Learning

A nice “cheat sheet,” more of a summary of key information on algorithms for supervised and unsupervised learning.

Data Visualization with ElasticSearch and Protovis

Filed under: ElasticSearch,Visualization — Patrick Durusau @ 6:24 pm

Data Visualization with ElasticSearch and Protovis

This is a great article on ElasticSearch and visualization with Protovis but I mention it because of the following:

Nevertheless, a modern full-text search engine can do much more than that. At its core lies the inverted index, a highly optimized data structure for efficient lookup of documents matching the query. But it also allows to compute complex aggregations of our data, called facets. (Emphasis and links in original)

Do you think of facets as aggregations of data?

Is merging an aggregation of data?

May 13, 2011

Neo4j graph database server image in Amazon EC2

Filed under: Neo4j — Patrick Durusau @ 7:23 pm

Neo4j graph database server image in Amazon EC2

From the blog post:

Neo4j graph database server image is available in Amazon EC2. The purpose of the AMI is to offer instant and on-demand access to a Neo4j Server environment to help the rapidly growing Neo4j developer community to test and deploy Neo4j-enabled applications.

Guess I will have to check on an Amazon EC2 account.

Anyone tried this?

3 Free E-Books and a Tutorial on Erlang

Filed under: Erlang — Patrick Durusau @ 7:23 pm

3 Free E-Books and a Tutorial on Erlang.

From ReadWrite Hack a collection of resources on Erlang.

You really need to see the “classic” movie promoting Erlang for telephony.

Maybe a contest for the “classic” movie promoting topic maps? Suggestions?

Solr Spellchecker internals (now with tests!)

Filed under: Solr,Topic Map Systems — Patrick Durusau @ 7:21 pm

Solr Spellchecker internals (now with tests!)

Emmanuel Espina says:

But today I’m going to talk about Solr SpellChecker. In contrast with from google, Solr spellcheker isn’t much more than a pattern similarity algorithm. You give it a word and it will find similar words. But what is interpreted as “similar” by Solr? The words are interpreted just as an array of characters, so, two words are similar if they have many coincidences in their character sequences. That may sound obvious, but in natural languages the bytes (letters) have little meaning. It is the entire word that has a meaning. So, Solr algorithms won’t even know that you are giving them words. Those byte sequences could be sequences of numbers, or sequences of colors. Solr will find the sequences of numbers that have small differences with the input, or the sequences of colors, etc. By the way, this is not the approach that Google follows. Google knows the frequent words, the frequent misspelled words, and the frequent way humans make mistakes. It is my intention to talk about these interesting topics in a next post, but now let’s study how solr spellchecker works in detail, and then make some tests.

Looks like a good series on the details of spellcheckers.

Useful if you want to incorporate spell-check in a topic map application.

And for a deep glimpse into how computers are different from us.

SPARQL 1.1 Drafts – Last Call

Filed under: Query Language,RDF,SPARQL — Patrick Durusau @ 7:19 pm

SPARQL 1.1 Drafts – Last Call

From the W3C News:

May 12, 2011

Lessons of History? Crowdsourcing

Filed under: Crowd Sourcing — Patrick Durusau @ 8:00 am

The post by Panos Ipeirotis, Crowdsourcing: Lessons from Henry Ford on his presentation (and slides), reminded me of Will and Auriel Durant’s Lessons of History observation (paraphrasing):

If you could select them, 10% of the population produces as much as the other 90% combined. History does exactly that.

So Panos saying that “A few workers contribute the majority of the work…” is no surprise.

If you don’t think asking people for their opinions is all that weird, you may enjoy his presentation.*

His summary:

The main points that I wanted to make:

  • It is common to consider crowdsourcing as the “assembly line for knowledge work” and think of the workers as simple cogs in a big machine. It is almost a knee-jerk reaction to think negatively about the concept. However, it was the proper use of the assembly line (together with the proper automation) by Henry Ford that led to the first significant improvement in the level of living for the masses.
  • Crowdsourcing suffers a lot due to significant worker turnover: Everyone who experimented with large tasks on MTurk knows that the participation distribution is very skewed. A few workers contribute the majority of the work, while a large number of workers contribute only minimally. Dealing with these hit-and-run workers is a pain, as we cannot apply any statistically meaningful mechanism for quality control.
  • We ignore the fact that workers give back what they are given. Pay peanuts, get monkeys. Pay well, and get good workers. Needless to say, reputation and other quality signaling mechanisms are of fundamental importance for this task.
  • Keeping the same workers around can give significant improvements in quality. Today on MTurk we have a tremendous turnover of workers, wasting significant effort and efficiencies. Whomever builds a strong base of a few good workers can pay the workers much better and, at the same time, generate a better product for lower cost than relying on an army of inexperienced, noisy workers.

Yes, at the end, crowdourcing is not about the crowd. It is about the individuals in the crowd. And we can now search for these valuable individuals very effectively. Crowdsourcing is crowdsearching.


*It isn’t that people are the best judges of semantics. They are the only judges of semantics.

Automated systems for searching, indexing, sorting, etc., are critical to modern information infrastructures. What they are not doing, appearances to the contrary notwithstanding, is judging semantics.

Intuition & Data-Driven Machine Learning

Filed under: Clustering,Machine Learning — Patrick Durusau @ 7:59 am

Intuition & Data-Driven Machine Learning

Ilya Grigorik includes his Intelligent Ruby: Getting Started with Machine Learning presentation and, asks the following question:

… to perform a clustering, we need to define a function to measure the “pairwise distance” between all pairs of objects. Can you think of a generic way to do so?

Think about it and then go to his blog post to see the answer.

Machine Learning and Probabilistic Graphical Models

Filed under: Machine Learning,Probalistic Models — Patrick Durusau @ 7:59 am

Machine Learning and Probabilistic Graphical Models

From the website:

Instructor: Sargur Srihari Department of Computer Science and Engineering, University at Buffalo

Machine learning is an exciting topic about designing machines that can learn from examples. The course covers the necessary theory, principles and algorithms for machine learning. The methods are based on statistics and probability– which have now become essential to designing systems exhibiting artificial intelligence. The course emphasisizes Bayesian techniques and probabilistic graphical models (PGMs). The material is complementary to a course on Data Mining where statistical concepts are used to analyze data for human, rather than machine, use.

The textbooks for different parts of the course are “Pattern Recognition and Machine Learning” by Chris Bishop (Springer 2006) and “Probabilistic Graphical Models” by Daphne Koller and Nir Friedman (MIT Press 2009).

Lecture slides and some videos of lectures.

KFTF – Keeping Found Things Found™

Filed under: Information Retrieval,Information Reuse — Patrick Durusau @ 7:58 am

KFTF – Keeping Found Things Found™

From the website:

Much of our lives is spent in the finding of things. Find a house or a car that’s just right for you. Find your dream job. Or your dream mate. But, once found, what then?

As with other things, so it is with our information. Finding is just the first step. How do we keep this information so that it’s there later when we need it? How do we organize it in ways that make sense for us in the lives we want to lead? Information found does us little good if we misplace it or forget to use it. And just as we must maintain a house or a car, we need to maintain our information – backing it up, archiving or deleting old information, updating information that is no longer accurate. In our digital world, advances in technologies of search and storage have far outpaced balancing advances in tools and techniques that help us to manage and make sense of our information. This project combines fieldwork with selective prototyping in an effort to understand what is needed for us to “keep found things found.”

There is also software, called Plantz™, that has been open sourced by the project.

Take control of the information in your life through one consolidated interface. Plan by typing your thoughts freehand. Link your thoughts to files, Web pages, and email messages. Organize everything into a single, integrated document that helps you manage all the projects you want to get done. Planz™ is an overlay to your file system so your information stays under your control.

Is anyone familiar with this software? Thanks!

Topic Map Metrics II

Filed under: Authoring Topic Maps,Topic Maps — Patrick Durusau @ 7:57 am

No real insight on how to construct a topic map metric, even by contract but do have a couple of more examples for a discussion of metrics:

Example 1:

Robert Cerny wants a topic map of concrete things, like members of a sports team.

OK, if I add their wives to the topic map does that make it “more complete?”

What if I add their mistresses (current/former) and illegitimate children?

Does your answer change if the wives don’t (already) know about the mistresses?

That is, how would I set boundaries on the associations that are included in a topic map?

Example 2:

What if I am building a topic map for a journal archive?

There is a traditional index which does author, title indices, maybe even a subject index.

Assume that I convert that into a topic map and that is the “base” for the topic map.

That is the topic map has to contain every entry that is found in the printed index.

At least now we can measure when the topic map falls short.

I think the second example is important because without a specified base of information, you are just waving your hands to talk about making a topic map more complete.

Well, maybe not a specified base but you do have to say “what subjects you want to talk about” as Steve Newcomb would say.

Not a lot but perhaps a useful place to start the discussion.

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Filed under: Cassandra,CouchDB,HBase,MongoDB,NoSQL,Redis,Riak — Patrick Durusau @ 7:56 am

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Good thumb-nail comparison of the major features of all six (6) NoSQL databases by Kristóf Kovács.

Sorry to see that Neo4J didn’t make the comparison.

« Newer PostsOlder Posts »

Powered by WordPress