Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 19, 2012

From the Bin Laden Letters: Reactions in the Islamist Blogosphere

Filed under: Intelligence,Text Analytics — Patrick Durusau @ 4:41 pm

From the Bin Laden Letters: Reactions in the Islamist Blogosphere

From the post:

Following our initial analysis of the Osama bin Laden letters released by the Combating Terrorism Center (CTC) at West Point, we’ll more closely examine interesting moments from the letters and size them up against what was publicly reported as happening in the world in order to gain a deeper perspective on what was known or unknown at the time.

There was a frenzy of summarization and highlight reel reporting in the wake of the Abbottabad documents being publicly released. Some focused on the idea that Osama bin Laden was ostracized, some pointed to the seeming obsession with image in the media, and others simply took a chance to jab at Joe Biden for the suggestions made about his lack of preparedness for the presidency.

What we’ll do in this post is take a different approach, and rather than focus on analyst viewpoints we’ll compare reactions to the Abbottabad documents from a unique source – Islamist discussion forums.

There we find rebukes over the veracity of the documents released, support for the efforts of operatives such as Faisal Shahzad, and a little interest in the Arab Spring.

Interesting visualizations as always.

The question I would ask as a consumer of such information services is: How do I integrate this analysis with in-house analysis tools?

Or perhaps better: How do I evaluate non-direct references to particular persons or places? That is a person or place is implied but not named. What do I know about the basis for such an identification?

New Mechanical Turk Categorization App

Filed under: Amazon Web Services AWS,Classification,Mechanical Turk — Patrick Durusau @ 10:52 am

New Mechanical Turk Categorization App

Categorization is one of the more popular use cases for the Amazon Mechanical Turk. A categorization HIT (Human Intelligence Task) asks the Worker to select from a list of options. Our customers use HITs of this type to assign product categories, match URLs to business listings, and to discriminate between line art and photographs.

Using our new Categorization App, you can start categorizing your own items or data in minutes, eliminating the learning curve that has traditionally accompanied this type of activity. The app includes everything that you need to be successful including:

  1. Predefined HITs (no HTML editing required).
  2. Pre-qualified Master Workers (see Jinesh’s previous blog post on Mechanical Turk Masters).
  3. Price recommendations based on complexity and comparable HITs.
  4. Analysis tools.

The Categorization App guides you through the four simple steps that are needed to create your categorization project.

I thought the contrast between gamers (the GPU post) and MTurkers would be a nice to close the day. 😉

Although, there are efforts to create games where useful activity happens, whether intended or not. (Would that take some of the joy out of a game?)

If you use this particular app, please blog or post a note about your experieince.

Thanks!

May 18, 2012

Cloud-Hosted GPUs And Gaming-As-A-Service

Filed under: Games,GPU,NVIDIA — Patrick Durusau @ 4:24 pm

Cloud-Hosted GPUs And Gaming-As-A-Service by Humayun

From the post:

NVIDIA is all buckled up to redefine the dynamics of gaming. The company has spilled the beans over three novel cloud technologies aimed at accelerating the available remote computational power by endorsing the number-crunching potential of its very own (and redesigned) graphical processing units.

At the heart of each of the three technologies lies the latest Kepler GPU architecture, custom-tailored for utility in volumetric datacenters. Through virtualization software, a number of users achieve access through the cutting-edge computational capability of the GPUs.

Jen-Hsun Huang, NVIDIA’s president and CEO, firmly believes that the Kepler cloud GPU technology is bound to take cloud computing to an entirely new level. He advocates that the GPU has become a significant constituent of contemporary computing devices. Digital artists are essentially dependent upon the GPU for conceptualizing their thoughts. Touch devices owe a great deal to the GPU for delivering a streamlined graphical experience.

With the introduction of the cloud GPU, NVIDIA is all set to change the game—literally. NVIDIA’s cloud-based GPU will bring an amazingly pleasant experience to gamers on a hunt to play in an untethered manner from a console or personal computer.

First in line is the NVIDIA VGX platform, an enterprise-level execution of the Kepler cloud technologies, primarily targeting virtualized desktop performance boosts. The company is hopeful that ventures will make use of this particular platform to ensure flawless remote computing and cater to the most computationally starved applications to be streamed directly to a notebook, tablet or any other mobile device variant. Jeff Brown, GM at NVIDIA’s Professional Solutions Group, is reported to have marked the VGX as the starting point for a “new era in desktop virtualization” that promises a cost-effective virtualization solution offering “an experience almost indistinguishable from a full desktop”.

Results with GPUs have been encouraging and spreading their availability as a cloud-based GPU should lead to a wider variety of experiences.

The emphasis here is making the lives of gamers more pleasant but one expects serious uses, such as graph processing, to not be all that far behind.

Lavastorm Desktop Public

Filed under: Analytics,Lavastorm Desktop Public — Patrick Durusau @ 4:09 pm

Lavastorm Desktop Public

Lavastorm Desktop Public is a powerful, visual and easy-to-use tool for anyone combining and analyzing data. A free version of our award winning Lavastorm Desktop software, the Public edition allows you to harness the power of our enterprise-class analytics engine right on your desktop. You’ll love Lavastorm Desktop Public if you want to:

  • Get more productive by reducing the time to create analytics by 90% or more compared to underpowered analytic tools, such as Excel or Access
  • Stop flying blind by unifying data locked away in silos or scattered on your desktop
  • Eliminate time spent waiting for others to integrate data or implement new analytics
  • Gain greater control for analyzing data against complex business logic and for manipulating data from Excel, CSV or ASCII files

First time I have encountered it.

Suggestions/comments?

Notes on the analysis of large graphs

Filed under: Graph Traversal,Graphs — Patrick Durusau @ 3:57 pm

Notes on the analysis of large graphs

From the post:

This post is part of a series on managing and analyzing graph data. Posts to date include:

My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.

How big can a graph be? That of course depends on:

  • The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
  • The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
  • The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.

*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.

I would think the specifics, for nodes and/or edges are going to depend upon the data set and your requirements for it. Neither one of which can be judged in the abstract or in advance.

Comments?

Using BerkeleyDB to Create a Large N-gram Table

Filed under: BerkeleyDB,N-Gram,Natural Language Processing,Wikipedia — Patrick Durusau @ 3:16 pm

Using BerkeleyDB to Create a Large N-gram Table by Richard Marsden.

From the post:

Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.

Large datasets such as the Wikipedia and Gutenberg English language corpora cannot be used to create N-gram frequency tables using the previous script due to the script’s large in-memory requirements. The solution is to create the frequency table as a disk-based dataset. For this, the BerkeleyDB database in key-value mode is ideal. This is an open source “NoSQL” library which supports a disk based database and in-memory caching. BerkeleyDB can be downloaded from the Oracle website, and also ships with a number of Linux distributions, including Ubuntu. To use BerkeleyDB from Python, you will need the bsddb3 package. This is included with Python 2.* but is an additional download for Python 3 installations.

Richard promises to make the resulting data sets available as an Azure service. Sample code, etc, will be posted to his blog.

Another Wikipedia based analysis.

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Filed under: Data,Knowledge,Machine Learning,Stream Analytics — Patrick Durusau @ 3:06 pm

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

From the post:

Here is the first series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!).The titles of the talks are linked to the presentation slides. The full program which ends tomorrow is here.. All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

Posted by Igor Carron at Nuit Blanche.

Finding enough hours to watch all of these is going to be a problem!

Which ones do you like best?

Predictive Analytics: Data Preparation [part 2]

Filed under: Predictive Analytics — Patrick Durusau @ 2:49 pm

Predictive Analytics: Data Preparation by Ricky Ho.

From the post:

As a continuation of my last post on predictive analytics, in this post I will focus in describing how to prepare data for the training the predictive model., I will cover how to perform necessary sampling to ensure the training data is representative and fit into the machine processing capacity. Then we validate the input data and perform necessary cleanup on format error, fill-in missing values and finally transform the collected data into our defined set of input features.

Different machine learning model will have its unique requirement in its input and output data type. Therefore, we may need to perform additional transformation to fit the model requirement

Part 2 of Ricky’s posts on predictive analytics.

Predictive Analytics: Overview and Data visualization

Filed under: Predictive Analytics — Patrick Durusau @ 2:45 pm

Predictive Analytics: Overview and Data visualization by Ricky Ho.

From the post:

I plan to start a series of blog post on predictive analytics as there is an increasing demand on applying machine learning technique to analyze large amount of raw data. This set of technique is very useful to me and I think they should be useful to other people as well. I will also going through some coding example in R. R is a statistical programming language that is very useful for performing predictive analytic tasks. In case you are not familiar with R, here is a very useful link to get some familiarity in R.

Predictive Analytics is a specialize data processing techniques focusing in solving the problem of predicting future outcome based on analyzing previous collected data. The processing cycle typically involves two phases of processing:

  1. Training phase: Learn a model from training data
  2. Predicting phase: Deploy the model to production and use that to predict the unknown or future outcome

The whole lifecycle of training involve the following steps.

Ricky has already posted part 2 but I am going to create separate entries for them. Mostly to make sure I don’t miss any of his posts.

Enjoy!

Interannotator Agreement for Chunking Tasks liked Named Entities and Phrases

Filed under: Annotation,LingPipe,Natural Language Processing — Patrick Durusau @ 2:40 pm

Interannotator Agreement for Chunking Tasks liked Named Entities and Phrases

Bob Carpenter writes:

Krishna writes,

I have a question about using the chunking evaluation class for inter annotation agreement : how can you use it when the annotators might have missing chunks I.e., if one of the files contains more chunks than the other.

The answer’s not immediately obvious because the usual application of interannotator agreement statistics is to classification tasks (including things like part-of-speech tagging) that have a fixed number of items being annotated.

An issue that is likely to come up in crowd sourcing analysis/annotation of text as well.

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Filed under: Concept Detection,Dictionary,Entities,Wikipedia,Word Meaning — Patrick Durusau @ 2:12 pm

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas by Valentin Spitkovsky and Peter Norvig (Google Research Team).

From the post:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

(examples omitted)

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data. (emphasis added)

Did you catch those numbers?

Now there is a truly remarkable resource.

What will you make out of it?

May 17, 2012

“…Things, Not Strings”

Filed under: Google Knowledge Graph,Marketing,RDF,RDFa,Semantic Web,Topic Maps — Patrick Durusau @ 6:30 pm

The brilliance at Google spreads beyond technical chops and into their marketing department.

Effective marketing can be what you do but what you don’t do as well.

What did Google not do with the Google Knowledge Graph?

Google Knowledge Graph does not require users to:

  • learn RDF/RDFa
  • learn OWL
  • learn various syntaxes
  • build/choose ontologies
  • use SW software
  • wait for authoritative instructions from Mount W3C

What does Google Knowledge Graph do?

It gives users information about things, things that are of interest to users. Using their web browsers.

Let’s see, we can require users to do what we want, or, we can give users what they want.

Which one do you think is the most likely to succeed? (No peeking!)

Google and Going Beyond Search

Filed under: Google Knowledge Graph,Searching — Patrick Durusau @ 6:12 pm

Google and Going Beyond Search

Stephen Arnold writes:

The idea for this blog began when I worked through selected Ramanathan Guha patent documents. I have analyzed these in my 2007 Google Version 2. If you are not familiar with them, you may want to take a moment, download these items, and read the “background” and “claims” sections of each. Here are several filings I found interesting:

US2007 003 8600
US2007 003 8601
US2007 003 8603
US2007 003 8614
US2007 003 8616

The utility of Dr. Guha’s invention is roughly similar to the type of question answering supported by WolframAlpha. However, there are a number of significant differences. I have explored these in the chapter in The Google Legacy “Google and the Programmable Search Engine.”

I read with interest the different explanations of Google’s most recent enhancement to its search results page. I am not too eager to highlight “Introducing the Knowledge Graph: Things, Not Strings” because it introduces terminology which is more poetic and metaphorical than descriptive. Nevertheless, you will want to take a look at how Google explains its “new” approach. Keep in mind that some of the functions appear in patent documents and technical papers which date from 2006 or earlier. The question this begs is, “Why the delay?” Is the roll out strategic in that it will have an impact on Facebook at a critical point in the company’s timeline or is it evidence that Google experiences “big company friction” when it attempts to move from demonstration to production implementation of a mash up variant.

First, we have hyperlinks for a reason, to make it easier on readers to follow references (among others).

So, the patents that Stephen cites above:

  • US2007 003 8600 Missing. Cited in numerous patents with a hyperlink but the USPTO returns no patent.
  • US2007 003 8601 Aggregating context data for programmable search engines
  • US2007 003 8603 Sharing context data across programmable search engines
  • US2007 003 8614 Generating and presenting advertisements based on context data for programmable search engines
  • US2007 003 8616 Missing. Cited in numerous patents with a hyperlink but the USPTO returns no patent.

Three out of five? I wonder what Stephen was reading for the two that are missing?

BTW, Stephen concludes:

So Google has gone beyond search. The problem is that I don’t want to go there via the Google, Bing, or any other intermediary’s intellectual training wheels. I want to read, think, decide, and formulate my view. In short, I like the dirty, painful research process.

I fully understand running materials “back to the sources” as it were. As a student, lawyer, bible scholar, standards editor, bystander to semantic drive-bys, etc.

But one goal of research is to blaze trails that others can follow, so they can dig deeper than we could.

Google has hardly eliminated the need for research, unless it is a very superficial type of research. And that hardly merits the name research.

Exploring The Universe with Machine Learning

Filed under: Astroinformatics,BigData,Machine Learning — Patrick Durusau @ 3:49 pm

Exploring The Universe with Machine Learning

Webinar: Wednesday, May 30, 2012 9:00 AM – 10:00 AM (Pacific Daylight Time), (4:00pm GMT)

From the post:

WHAT IT’S ABOUT:

There is much to discover in the big, actually astronomically big, datasets that are (and will be) available. The challenge is how to effectively mine these massive datasets.

In this webinar attendees will learn how CANFAR (the Canadian Advanced Network for Astronomical Research) is using Skytree’s high performance and scalable machine learning system in the cloud. The combination enables astronomers to focus on their analyses rather than having to waste time implementing scalable complex algorithms and architecting the infrastructure to handle the massive datasets involved.

CANFAR is designed with usability in mind. Implemented as a virtual machine (VM), users can deploy their existing desktop code to the CANFAR cloud – delivering instant scalability (replication of the VM as required), without additional development.

WHO SHOULD ATTEND:

Anyone interested in performing machine learning or advanced analytics on big (astronomical) data sets.

Well, I quality on two counts. How about you? 😉

From Skytree Big Data Analytics. They have a free server version that I haven’t looked at, yet.

Designing Search (part 4): Displaying results

Filed under: Interface Research/Design,Search Behavior,Search Interface,Searching — Patrick Durusau @ 3:41 pm

Designing Search (part 4): Displaying results

Tony Russell-Rose writes:

In an earlier post we reviewed the various ways in which an information need may be articulated, focusing on its expression via some form of query. In this post we consider ways in which the response can be articulated, focusing on its expression as a set of search results. Together, these two elements lie at the heart of the search experience, defining and shaping much of the information seeking dialogue. We begin therefore by examining the most universal of elements within that response: the search result.

As usual, Tony does a great job of illustrating your choices and trade-offs in presentation of search results. Highly recommended.

I am curious since Tony refers to it as an “information seeking dialogue,” has anyone mapped reference interview approaches to search interfaces? I suspect that is just my ignorance of the literature on that subject so would appreciate any pointers you can throw my way.

I would update Tony’s bibliography:

Marti Hearst (2009) Search User Interfaces. Cambridge University Press

Online as full text: http://searchuserinterfaces.com/

All Presentation Software is Broken

Filed under: Communication,Presentation,Web Analytics,Writing — Patrick Durusau @ 3:28 pm

All Presentation Software is Broken by Ilya Grigorik.

From the post:

Whenever the point I’m trying to make lacks clarity, I often find myself trying to dress it up: fade in the points, slide in the chart, make prettier graphics. It is a great tell when you catch yourself doing it. Conversely, I have yet to see a presentation or a slide that could not have been made better by stripping the unnecessary visual dressing. Simple slides require hard work and a higher level of clarity and confidence from the presenter.

All presentation software is broken. Instead of helping you become a better speaker, we are competing on the depth of transition libraries, text effects, and 3D animations. Prezi takes the trophy. As far as I can tell, it is optimized for precisely one thing: generating nausea.

Next Presentation Platform: Browser

If you want your message to travel, then the browser is your (future) presentation platform of choice. No proprietary formats, no conversion nightmares, instant access from billions of devices, easy sharing, and more. Granted, the frameworks and the authoring tools are still lacking, but that is only a matter of time.

Unfortunately, we are off to a false start. Instead of trying to make the presenter more effective, we are too busy trying to replicate the arsenal of useless visual transitions with the HTML5, CSS3 and WebGL stacks. Spinning WebGL cubes and CSS transitions make for a fun technology demo but add zero value – someone, please, stop the insanity. We have web connectivity, ability to build interactive slides, and get realtime feedback and analytics from the audience. There is nothing to prove by imitating the broken features of PowerPoint and Keynote, let’s leverage the strengths of the web platform instead. (emphasis added)

Imagine that. Testing your slides. Sounds like testing software before it is released to paying customers.

Test your slides on a real audience before a conference or meeting with your board or important client. What a novel concept.

By “real audience” I mean someone other than yourself or one of your office mates.

When you are tempted to say, “they just don’t understand….,” substitute, “I didn’t explain …. well.” (Depends on whether you want to feel smart or be an effective communicator. Your call.)

Presentation software isn’t fixable.

Presenters on the other hand, maybe.

But you have to fix yourself, no one can do it for you.

How to Visualize and Compare Distributions

Filed under: Graphics,R,Statistics,Visualization — Patrick Durusau @ 3:08 pm

How to Visualize and Compare Distributions by Nathan Yau.

Nathan writes:

Single data points from a large dataset can make it more relatable, but those individual numbers don’t mean much without something to compare to. That’s where distributions come in.

There are a lot of ways to show distributions, but for the purposes of this tutorial, I’m only going to cover the more traditional plot types like histograms and box plots. Otherwise, we could be here all night. Plus the basic distribution plots aren’t exactly well-used as it is.

Before you get into plotting in R though, you should know what I mean by distribution. It’s basically the spread of a dataset. For example, the median of a dataset is the half-way point. Half of the values are less than the median, and the other half are greater than. That’s only part of the picture.

What happens in between the maximum value and median? Do the values cluster towards the median and quickly increase? Are there are lot of values clustered towards the maximums and minimums with nothing in between? Sometimes the variation in a dataset is a lot more interesting than just mean or median. Distribution plots help you see what’s going on.

You will find distributions useful in many aspects of working with topic maps.

The most obvious use is the end-user display of data in a delivery situation. But distributions can also help you decide what areas of a data set look more “interesting” than others.

Nathan does his typically great job explaining distributions and you will learn a bit of R in the process. Not a bad evening at all.

New Features in Apache Pig 0.10

Filed under: Pig — Patrick Durusau @ 2:56 pm

New Features in Apache Pig 0.10 by Daniel Dai.

This is a useful summary of new features.

Daniel covers each new feature, gives an example and when necessary, a pointer to additional documentation. How cool is that?

Just to whet your appetite, Daniel covers:

  • Boolean Data Type
  • Nested Cross/Foreach
  • JRuby UDF
  • Hadoop 0.23 (a.k.a. Hadoop 2.0) Support

and more.

Definitely worth your time to read and to emulate when you write blog posts about new features.

Big Game Hunting in the Database Jungle

Filed under: Calvin,NoSQL,Oracle,SQL — Patrick Durusau @ 2:31 pm

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

Alexander Thomson and Daniel Abadi write:

In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential—some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

Preview of a paper that Alex is presenting at SIGMOD next week. Introducing “Calvin,” a new approach to database processing.

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access).

What I find exciting about this report (and the paper) is the re-thinking of current assumptions concerning data processing. May be successful or may not be. But the exciting part is the attempt to transcend decades of acceptance of the maxims of our forefathers.

BTW, Calvin is reported to support 500,000 transactions a second.

Big game hunting anyone?*


* I don’t mean that as an expression of preference for or against Oracle.

I suspect Calvin will be a wake up call to R&D at Oracle to re-double their own efforts at ground breaking innovation.

Breakthroughs in matching up multi-dimensional indexes would be attractive to users who need to match up disparate data sources.

Speed is great but a useful purpose attracts customers.

Elegant exact string match using BWT

Filed under: Burrows-Wheeler Transform (BWT),Compression,Searching — Patrick Durusau @ 1:25 pm

Elegant exact string match using BWT by Santhosh Kumar.

From the post:

This post describes an elegant and fast algorithm to perform exact string match. Why another string matching algorithm? To answer the question, let’s first understand the problem we are trying to solve.

In short, the problem is to match billions of short strings (about 50-100 characters long) to a text which is 3 billion characters long. The 3 billion character string (also called reference) is known ahead and is fixed (at least for a species). The shorter strings (also called reads) are generated as a result of an experiment. The problem arises due to the way the sequencing technology works, which in its current form, breaks the DNA into small fragments and ‘reads’ them. The information about where the fragments came from is lost and hence the need to ‘map’ them back to the reference sequence.

We need an algorithm that allows repeatedly searching on a text as fast as possible. We are allowed to perform some preprocessing on the text once if that will help us achieve this goal. BWT search is one such algorithm. It requires a one-time preprocessing of the reference to build an index, after which the query time is of the order of the length of the query (instead of the reference).

Burrows Wheeler transform is a reversible string transformation that has been widely used in data compression. However the application of BWT to perform string matching was discovered fairly recently in this paper. This technique is the topic of this post. Before we get to the searching application, a little background on how BWT is constructed and some properties of BWT.

Complete with careful illustrations of the operation of the Burrows Wheeler transform (BWT).

A separate post to follow details finding the BWT index of a long string efficiently.

Definitely a series to follow.

Apache HBase 0.94 is now released

Filed under: Cloudera,HBase — Patrick Durusau @ 10:40 am

Apache HBase 0.94 is now released by Himanshu Vashishtha.

Some of the new features:

  • More powerful first aid box: The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. HBASE-5128: “Uber hbck”, adds these missing features to the first aid box.
  • Simplified Region Sizing: Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. HBASE-4365: “Heuristic for Region size” adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.
  • Smarter transaction semantics: Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations. HBASE-3584: “Atomic Put & Delete in a single transaction” enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is on by default.

BTW, also from the post:

Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).

Less than four (4) months as I count it between HBase 0.92 and 0.94.

Sounds like a lot of people have been working very hard.

And making serious progress.

May 16, 2012

Google Advertises Topic Maps – Breaking News – Please ReTweet

Filed under: Google Knowledge Graph,Marketing,Topic Maps — Patrick Durusau @ 3:50 pm

Actually the post is titled: Introducing the Knowledge Graph: things, not strings.

It reads in part:

Search is a lot about discovery—the basic human need to learn and broaden your horizons. But searching still requires a lot of hard work by you, the user. So today I’m really excited to launch the Knowledge Graph, which will help you discover new information quickly and easily.

Take a query like [taj mahal]. For more than four decades, search has essentially been about matching keywords to queries. To a search engine the words [taj mahal] have been just that—two words.

But we all know that [taj mahal] has a much richer meaning. You might think of one of the world’s most beautiful monuments, or a Grammy Award-winning musician, or possibly even a casino in Atlantic City, NJ. Or, depending on when you last ate, the nearest Indian restaurant. It’s why we’ve been working on an intelligent model—in geek-speak, a “graph”—that understands real-world entities and their relationships to one another: things, not strings.

The Knowledge Graph enables you to search for things, people or places that Google knows about—landmarks, celebrities, cities, sports teams, buildings, geographical features, movies, celestial objects, works of art and more—and instantly get information that’s relevant to your query. This is a critical first step towards building the next generation of search, which taps into the collective intelligence of the web and understands the world a bit more like people do.

Google’s Knowledge Graph isn’t just rooted in public sources such as Freebase, Wikipedia and the CIA World Factbook. It’s also augmented at a much larger scale—because we’re focused on comprehensive breadth and depth. It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. And it’s tuned based on what people search for, and what we find out on the web.

Google just set the bar for search/information appliances, including topic maps.

What is the value add of your appliance when compared to Google?

When people ask me to explain topic maps now I can say:

You know Google’s Knowledge Graph? It’s like that but customized to your interests and data.

(I would just leave it at that. Let them start imagining what they want to do beyond the reach of Google. In their “dark data.”)

Who knew? Google advertising for topic maps. Without any click-through. Amazing.

Mobilizing Knowledge Networks for Development

Filed under: Conferences,Marketing — Patrick Durusau @ 3:35 pm

Mobilizing Knowledge Networks for Development

June 19—20, 2012
The World Bank Group
1818 H Street NW, Washington DC 20433

From the webpage:

The goal of the workshop is to explore ways to become better providers and connectors of knowledge in a world where the sources of knowledge are increasingly diverse and disbursed. At the World Bank, for example, we are seeking ways to connect with new centers of research, emerging communities of practice, and tap the practical experience of development organizations and the policy makers in rapidly developing economies. Our goal is to find better ways to connect those that have the development knowledge with those that need it, when they need it.

We are also seeking to engage research communities and civil society organizations through an Open Development initiative that makes data and publications freely available. We understand that many other organizations are exploring similar initiatives. The Conference and Knowledge fair will provide an opportunity for knowledge organizations working in development to learn from one another about their knowledge services, practices, and successes and challenges in providing these services.

You can register to attend in person or over the Internet.

As always, networking opportunities are what you make of them. This will be a good opportunity to spread the good news about topic maps.

From the Bin Laden Letters: Mapping OBL’s Reach into Yemen

Filed under: Intelligence — Patrick Durusau @ 3:25 pm

From the Bin Laden Letters: Mapping OBL’s Reach into Yemen

I puzzled over this headline. A close friend refers to President Obama as “OB1” so I had a moment of confusion when reading the headline. Didn’t make sense for Bin Laden’s letters to map President Obama’s reach into Yemen.

With some diplomatic cables and White House internal documents, that would be an interesting visualization as well.

The mining of a larger corpus of 70,000+ public sources for individuals mentioned in the Ben Laden letters is responsible for the visualizations.

What we don’t know is what means of analysis produced the visualizations in question.

Some process was used to reduce redundant references to the same actors, events and relationships. Just by way of example.

That isn’t a complaint, simply an observation. It isn’t possible to evaluate the techniques used to obtain the results.

It would be interesting to see Recorded Future in one of the TREC competitions. At least then the results would be against a shared data set.

Do be aware that when the text says “open source,” what is meant is “open source intelligence.”

The better practice would be to say “open source intelligence or (OSINT)” and not “open source,” the latter having a well recognized meaning in the software community.

Need cash? NLnet advances open source technology by funding new projects

Filed under: Funding,Open Source — Patrick Durusau @ 1:52 pm

Need cash? NLnet advances open source technology by funding new projects

Next Round of Ideas Due: June 1st 2012.

Lead story at OpenSource.com today.

From the story:

If you have a valuable idea or project that can help create a more open global information society, and are looking for financial means to make your ideas come through, we might be able to help you. Indeed our mission is to fund open source projects and individuals to improve important and strategic networking technologies for the better of mankind. Whether this concerns more robust internet technologies and standards, privacy enhancing technologies or open document formats – we are open for your proposals.

We are independent. We are not like other funding bodies you may have experience with, because we only have to judge on quality and relevance, and not on politics or any other dimension. What is important for us is that the technology you develop and promote is usable for others and has real impact. And we are also interested to hear your inspiring ideas if you are unable to manage it yourself.

We spend our money in supporting strategic initiatives that contribute to an open information society, especially where these are aimed at development and dissemination of open standards and network related technology.

More details in the story or at the NLnet website.

What’s your great idea?

OpenSource.com

Filed under: Open Data,Open Source — Patrick Durusau @ 1:30 pm

OpenSource.com

Not sure how I got to OpenSource.com but it showed up as a browser tab after a crash. Maybe it is a new feature and not a bug.

Thought I would take the opportunity to point it out (and record it here) as a source of projects and news from the open source community.

Not to mention data sets, source code, marketing opportunities, etc.

Identifying And Weighting Integration Hypotheses On Open Data Platforms

Filed under: Crowd Sourcing,Data Integration,Integration,Open Data — Patrick Durusau @ 12:58 pm

Identifying And Weighting Integration Hypotheses On Open Data Platforms by Julian Eberius, Katrin Braunschweig, Maik Thiele, and Wolfgang Lehner.

Abstract:

Open data platforms such as data.gov or opendata.socrata.com provide a huge amount of valuable information. Their free-for-all nature, the lack of publishing standards and the multitude of domains and authors represented on these platforms lead to new integration and standardization problems. At the same time, crowd-based data integration techniques are emerging as new way of dealing with these problems. However, these methods still require input in form of specific questions or tasks that can be passed to the crowd. This paper discusses integration problems on Open Data Platforms, and proposes a method for identifying and ranking integration hypotheses in this context. We will evaluate our findings by conducting a comprehensive evaluation using on one of the largest Open Data platforms.

This is interesting work on Open Data platforms but it is marred by claims such as:

Open Data Platforms have some unique integration problems that do not appear in classical integration scenarios and which can only be identi ed using a global view on the level of datasets. These problems include partial- or duplicated datasets, partitioned datasets, versioned datasets and others, which will be described in detail in Section 4.

Really?

Would come as a surprise to the World Data Centre for Aerosols which had Synthesis and INtegration of Global Aerosol Data Sets. Contract No. ENV4-CT98-0780 (DG 12 –EHKN) produced on data sets from 1999 to 2001. One of the specific issues they addressed were duplicate data sets.

More than a decade ago counts for a “classical integration scenario” I think.

Another quibble. Cited sources do not support the text.

New forms of data management such as dataspaces and pay-as-you-go data integration [2, 6] are a hot topic in database research. They are strongly related to Open Data Platforms in that they assume large sets of heterogeneous data sources lacking a global or mediated schemata, which still should be queried uniformly.

2 M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec., 34:27{33, December 2005.

6 J. Madhavan, S. R. Je ery, S. Cohen, X. . Dong, D. Ko, C. Yu, A. Halevy, and G. Inc. Web-scale Data Integration: You Can Only A fford to Pay As You Go. In Proc. of CIDR-07, 2007.

Articles written seven (7) and five (5) years ago, do not justify a “hot topic(s) in database research.” claim today.

There are other issues, major and minor but for all that, this is important work.

I want to see reports that do justice to its importance.

Modeling vs Mining?

Filed under: Data Mining,Data Models — Patrick Durusau @ 12:07 pm

Steve Miller writes in Politics of Data Models and Mining:

I recently came across an interesting thread, “Is data mining still a sin against the norms of econometrics?”, from the Advanced Business Analytics LinkedIn Discussion Group. The point of departure for the dialog is a paper entitled “Three attitudes towards data mining”, written by couple of academic econometricians.

The data mining “attitudes” range from the extremes that DM techniques are to be avoided like the plague, to one where “data mining is essential and that the only hope that we have of using econometrics to uncover true economic relationships is to be found in the intelligent mining of data.” The authors note that machine learning phobia is currently the norm in economics research.

Why is this? “Data mining is considered reprehensible largely because the world is full of accidental correlations, so that what a search turns up is thought to be more a reflection of what we want to find than what is true about the world.” In contrast, “Econometrics is regarded as hypothesis testing. Only a well specified model should be estimated and if it fails to support the hypothesis, it fails; and the economist should not search for a better specification.”

In other words, econometrics focuses on explanation, expecting its practitioners to generate hypotheses for testing with regression models. ML, on the other hand, obsesses on discovery and prediction, often content to let the data talk directly, without the distraction of “theory.” Just as bad, the results of black-box ML might not be readily interpretable for tests of economic hypotheses.

Watching other communities fight over odd questions is always more enjoyable than serious disputes of grave concern in our own. (See Using “Punning” to Answer httpRange-14 for example.)

I mention the economist’s dispute, not simply to make jests at the expense of “econometricians.” (Do topic map supporters need a difficult name? TopicMapologists? Too short.)

The economist’s debate is missing an understanding that modeling requires some knowledge of the domain (mining whether formal or informal) and mining requires some idea of an output (models whether spoken or unspoken). A failing that is all too common across modeling/mining domains.

To put it another way:

We never stumble upon data that is “untouched by human hands.”

We never build models without knowledge of the data we are modeling.

The relevant question is: Does the model or data mining provide a useful result?

(Typically measured by your client’s joy or sorrow over your results.)

Progressive NoSQL Tutorials

Filed under: Cassandra,Couchbase,CouchDB,MongoDB,Neo4j,NoSQL,RavenDB,Riak — Patrick Durusau @ 10:20 am

Have you ever gotten an advertising email with clean links in it? I mean a link without all the marketing crap appended to the end. The stuff you have to clean off before using it in a post or sending it to a friend?

Got my first one today. From Skills Matter on the free videos for their Progressive NoSQL Tutorials that just concluded.

High quality presentations, videos freely available after presentation, friendly links in email, just a few of the reasons to support Skills Matter.

The tutorials:

Lucene-1622

Filed under: Indexing,Lucene,Synonymy — Patrick Durusau @ 9:32 am

Multi-word synonym filter (synonym expansion at indexing time) Lucene-1622

From the description:

It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).

The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):

  • if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match;
  • there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
  • if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won’t be found. Example “big apple” synonym for “new york city”. A phrase query “big apple restaurants” won’t match “new york city restaurants”.

I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

This remains an open issue as of 16 May 2012.

It is also an important open issue.

Think about it.

As “big data” gets larger and larger, at some point traditional ETL isn’t going to be practical. Due to storage, performance, selective granularity or other issues, ETL is going to fade into the sunset.

Indexing, on the other hand, which treats data “in situ” (“in position” for you non-archaeologists in the audience), avoids many of the issues with ETL.

The treatment of synonyms, that is synonyms across data sets, multi-word synonyms, specifying the ranges of synonyms (both for indexing and search), synonym expansion, a whole range of synonyms features and capabilities, needs to “man up” to take on “big data.”

« Newer PostsOlder Posts »

Powered by WordPress