January « 2012 « Another Word For It

January 22, 2012

Draft (polysemy and ambiguity)

Filed under: Ambiguity,Language,Polysemy — Patrick Durusau @ 7:40 pm

Draft by Mark Liberman

From the post:

In a series of Language Log posts, Geoff Pullum has called attention to the prevalence of polysemy and ambiguity:

The people who think clarity involves lack of ambiguity, so we have to strive to eliminate all multiple meanings and should never let a word develop a new sense… they simply don’t get it about how language works, do they?

Languages love multiple meanings. They lust after them. They roll around in them like a dog in fresh grass.

The other day, as I reading a discussion in our comments about whether English draftable does or doesn’t refer to the same concept as Finnish asevelvollisuus (“obligation to serve in the military”), I happened to be sitting in a current of uncomfortably cold air. So of course I wondered how the English word draft came to refer to military conscription as well as air flow. And a few seconds of thought brought to mind several others senses of the the noun draft and its associated verb. I figured that this must represent a confusion of several originally separate words. But then I looked it up.

If you like language and have an appreciation for polsemy and ambiguity, you will enjoy this post a lot.

Comments Off

Combining Heterogeneous Classifiers for Relational Databases (Of Relational Prisons and such)

Filed under: Business Intelligence,Classifier,Database,Schema — Patrick Durusau @ 7:39 pm

Combining Heterogeneous Classifiers for Relational Databases by Geetha Manjunatha, M Narasimha Murty and Dinkar Sitaram.

Abstract:

Most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a ‘flat’ form (mega-join), even the human-specified semantic information present in the relations is lost. In this paper, we present a practical, two-phase hierarchical meta-classification algorithm for relational databases with a semantic divide and conquer approach. We propose a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual database tables. The proposed algorithm was evaluated on three diverse datasets, namely TPCH, PKDD and UCI benchmarks and showed considerable reduction in classification time without any loss of prediction accuracy.

When I read:

So, a typical enterprise dataset resides in such expert-designed multiple relational database tables. On the other hand, as known, most traditional classication algorithms still assume that the input dataset is available in a single table – a flat representation of data attributes. So, for applying these state-of-art single-table data mining techniques to enterprise data, one needs to convert the distributed relational data into a flat form.

a couple of things dropped into place.

First, the problem being described, the production of a flat form for analysis reminds me of the problem of record linkage in the late 1950’s (predating relational databases). There records were regularized to enable very similar analysis.

Second, as the authors state in a paragraph or so, conversion to such a format is not possible in most cases. Interesting that the choice of relational database table design has the impact of limiting the type of analysis that can be performed on the data.

Therefore, knowledge mining over real enterprise data using machine learning techniques is very valuable for what is called an intelligent enterprise. However, application of state-of-art pattern recognition techniques in the mainstream BI has not yet taken o [Gartner report] due to lack of in-memory analytics among others. The key hurdle to make this possible is the incompatibility between the input data formats used by most machine learning techniques and the formats used by real enterprises.

If freeing data from its relational prison is a key aspect to empowering business intelligence (BI), what would you suggest as a solution?

Comments Off

Construction of Learning Path Using Ant Colony Optimization from a Frequent Pattern Graph

Filed under: Authoring Topic Maps,Education,Graphs — Patrick Durusau @ 7:38 pm

Construction of Learning Path Using Ant Colony Optimization from a Frequent Pattern Graph by Souvik Sengupta, Sandipan Sahu and Ranjan Dasgupta.

Abstract:

In an e-Learning system a learner may come across multiple unknown terms, which are generally hyperlinked, while reading a text definition or theory on any topic. It becomes even harder when one tries to understand those unknown terms through further such links and they again find some new terms that have new links. As a consequence they get confused where to initiate from and what are the prerequisites. So it is very obvious for the learner to make a choice of what should be learnt before what. In this paper we have taken the data mining based frequent pattern graph model to define the association and sequencing between the words and then adopted the Ant Colony Optimization, an artificial intelligence approach, to derive a searching technique to obtain an efficient and optimized learning path to reach to a unknown term.

The phrase “multiple unknown terms, which are generally hyperlinked” is a good description of any location in a topic map for anyone other than its author and other experts in the field it describes.

Although couched in terms of a classroom educational setting, I suspect techniques very similar to these could be used with any topic map interface with users.

Comments Off

A Description Logic Primer

Filed under: Description Logic,Logic,Ontology — Patrick Durusau @ 7:37 pm

A Description Logic Primer by Markus Krötzsch, Frantisek Simancik and Ian Horrocks.

Abstract:

This paper provides a self-contained first introduction to description logics (DLs). The main concepts and features are explained with examples before syntax and semantics of the DL SROIQ are defined in detail. Additional sections review light-weight DL languages, discuss the relationship to the Web Ontology Language OWL and give pointers to further reading.

It’s an introduction to description logics but it is also a readable introduction to description logics (DLs). And it will give you a good overview of the area.

As the paper points out, DLs are older than their use with web ontology languages but that is the use that you are most likely to encounter.

You won’t find anything new information here but it may be a good refresher.

Comments Off

The Role of Social Networks in Information Diffusion

Filed under: Networks,Social Graphs,Social Media,Social Networks — Patrick Durusau @ 7:35 pm

The Role of Social Networks in Information Diffusion by Eytan Bakshy, Itamar Rosenn, Cameron Marlow and Lada Adamic.

Abstract:

Online social networking technologies enable individuals to simultaneously share information with any number of peers. Quantifying the causal effect of these technologies on the dissemination of information requires not only identification of who influences whom, but also of whether individuals would still propagate information in the absence of social signals about that information. We examine the role of social networks in online information diffusion with a large-scale field experiment that randomizes exposure to signals about friends’ information sharing among 253 million subjects in situ. Those who are exposed are significantly more likely to spread information, and do so sooner than those who are not exposed. We further examine the relative role of strong and weak ties in information propagation. We show that, although stronger ties are individually more influential, it is the more abundant weak ties who are responsible for the propagation of novel information. This suggests that weak ties may play a more dominant role in the dissemination of information online than currently believed.

Sample size: 253 million Facebook users.

Pay attention to the line:

We show that, although stronger ties are individually more influential, it is the more abundant weak ties who are responsible for the propagation of novel information.

If you have an “Web scale” (whatever that means) information delivery issue, you need to not only target CNN and Drudge with press releases but should consider targeting actors with abundant weak ties.

Thinking this could be important in topic map driven applications that “push” novel information into the social network of a large, distributed company. You know how few of us actually read the tiresome broadcast stuff from HR, etc., so what if the important parts were “reported” piecemeal by others?

It is great to have a large functioning topic map but it doesn’t become useful until people make the information it delivers their own and take action based upon it.

Comments Off

TXR: a Pattern Matching Language (Not Just)….

Filed under: Pattern Matching,Text Extraction — Patrick Durusau @ 7:34 pm

TXR: a Pattern Matching Language (Not Just) for Convenient Text Extraction

From the webpage:

TXR (“texer” or “tee ex ar”) is a new and growing language oriented toward processing text, packaged as a utility (the txr command) that runs in POSIX environments and on Microsoft Windows.

Working with TXR is different from most text processing programming languages. Constructs in TXR aren’t imperative statements, but rather pattern-matching directives: each construct terminates by matching, failing, or throwing an exception. Searching and backtracking behaviors are implicit.

The development of TXR began when I needed a utility to be used in shell programming which would reverse the action of a “here-document”. Here-documents are a programming language feature for generating multi-line text from a boiler-plate template which contains variables to be substituted, and possibly other constructs such as various functions that generate text. Here-documents appeared in the Unix shell decades ago, but most of today’s web is basically a form of here-document, because all non-trivial websites generate HTML dynamically, substituting variable material into templates on the fly. Well, in the given situation I was programming in, I didn’t want here documents as much as “there documents”: I wanted to write a template of text containing variables, but not to generate text but to do the reverse: match the template against existing text which is similar to it, and capture pieces of it into variables. So I developed a utility to do just that: capture these variables from a template, and then generate a set of variable assignments that could be eval-ed in a shell script.

That was sometime in the middle of 2009. Since then TXR has become a lot more powerful. It has features like structured named blocks with nonlocal exits, structured exception handling, pattern matching functions, and numerous other features. TXR is powerful enough to parse grammars, yet simple to use on trivial tasks.

For things that can’t be easily done in the pattern matching language, TXR has a built-in Lisp dialect, which supports goodies like first class functions with closures over lexical environments, I/O (including string and list streams), hash tables with optional weak semantics, and arithmetic with arbitrary precision (“bignum”) integers.

A powerful tool for text extraction/manipulation.

Comments Off

Stan: A (Bayesian) Directed Graphical Model Compiler

Filed under: Bayesian Data Analysis,Graphical Models — Patrick Durusau @ 7:33 pm

Stan: A (Bayesian) Directed Graphical Model Compiler

Post with link to presentation to NYC machine learning meetup.

Stan: a C++ library for probability and sampling has not (yet) been released (BSD license) but has the following components:

From the Google Code page:

Directed Graphical Model Compiler

(Adaptive) Hamiltonian Monte Carlo Sampling

Hamiltonian Monte Carlo Sampling

Gibbs Sampling for Discrete Parameters

Reverse Mode Algorithmic Differentiation

Probability Distributions

Special Functions

Matrices and Linear Algebra

Comments (1)

Is it time to get rid of the Linux OS model in the cloud?

Filed under: Linux OS,Topic Map Software,Topic Map Systems — Patrick Durusau @ 7:32 pm

Is it time to get rid of the Linux OS model in the cloud?

From the post:

You program in a dynamic language, that runs on a JVM, that runs on a OS designed 40 years ago for a completely different purpose, that runs on virtualized hardware. Does this make sense? We’ve talked about this idea before in Machine VM + Cloud API – Rewriting The Cloud From Scratch, where the vision is to treat cloud virtual hardware as a compiler target, and converting high-level language source code directly into kernels that run on it.

As new technologies evolve the friction created by our old tool chains and architecture models becomes ever more obvious. Take, for example, what a team at USCD is releasing: a phase-change memory prototype – a solid state storage device that provides performance thousands of times faster than a conventional hard drive and up to seven times faster than current state-of-the-art solid-state drives (SSDs). However, PCM has access latencies several times slower than DRAM.

This technology has obvious mind blowing implications, but an interesting not so obvious implication is what it says about our current standard datacenter stack. Gary Athens has written an excellent article, Revamping storage performance, spelling it all out in more detail:

Computer scientists at UCSD argue that new technologies such as PCM will hardly be worth developing for storage systems unless the hidden bottlenecks and faulty optimizations inherent in storage systems are eliminated.

Moneta, bypasses a number of functions in the operating system (OS) that typically slow the flow of data to and from storage. These functions were developed years ago to organize data on disk and manage input and output (I/O). The overhead introduced by them was so overshadowed by the inherent latency in a rotating disk that they seemed not to matter much. But with new technologies such as PCM, which are expected to approach dynamic random-access memory (DRAM) in speed, the delays stand in the way of the technologies’ reaching their full potential. Linux, for example, takes 20,000 instructions to perform a simple I/O request.

By redesigning the Linux I/O stack and by optimizing the hardware/software interface, researchers were able to reduce storage latency by 60% and increase bandwidth as much as 18 times.

The I/O scheduler in Linux performs various functions, such as assuring fair access to resources. Moneta bypasses the scheduler entirely, reducing overhead. Further gains come from removing all locks from the low-level driver, which block parallelism, by substituting more efficient mechanisms that do not.

Moneta performs I/O benchmarks 9.5 times faster than a RAID array of conventional disks, 2.8 times faster than a RAID array of flash-based solid-state drives (SSDs), and 2.2 times faster than fusion-io’s high-end, flash-based SSD.

Read the rest of the post and then ask yourself what architecture do you envision for a topic map application?

What if rather that moving data from one data structure to another, that the data structure addressed is identified by the data? If you wish to “see” the data as a table, it reports is location by table/column/row. If you wish to “see” the data as a matrix, it reports its matrix position. If you wish to “see” the data as a linked list, it can report its value, plus those ahead and behind.

It isn’t that difficult to imagine that data reports its location on a graph as the result of an operation. Perhaps storing its graph location for every graphing operation that is “run” using that data point.

True enough we need to create topic maps that run on conventional hardware/software but that isn’t an excuse to ignore possible futures.

Reminds me of a “grook” that I read years ago: “You will conquer the present suspiciously fast – if you smell of the future and stink of the past.” (Piet Hein but I don’t remember which book.)

Comments Off

NGINX: The Faster Web Server Alternative

Filed under: Software,Web Server — Patrick Durusau @ 7:31 pm

NGINX: The Faster Web Server Alternative by Steven J. Vaughan-Nichols.

From the post:

Picking a Web server used to be easy. If you ran a Windows shop, you used Internet Information Server (IIS); if you didn’t, you used Apache. No fuss. No muss. Now, though, you have more Web server choices, and far more decisions to make. One of the leading alternatives, the open-source NGINX, is now the number two Web server in the world, according to Netcraft, the Web server analytics company.

NGINX (pronounced “engine X”) is an open-source HTTP Web server that also includes mail services with an Internet Message Access Protocol (IMAP) and Post Office Protocol (POP) server. NGINX is ready to be used as a reverse proxy, too. In this mode NGINX is used to load balance among back-end servers, or to provide caching for a slower back-end server.

Companies like the online TV video on demand company Hulu use NGINX for its stability and simple configuration. Other users, such as Facebook and WordPress.com, use it because the web server’s asynchronous architecture gives it a small memory footprint and low resource consumption, making it ideal for handling multiple, actively changing Web pages.

That’s a tall order. According to NGINX’s principal architect Igor Sysoev, here’s how NGINX can support hundreds of millions of Facebook users.

I have to admit, NGINX being web server #2 caught my attention. Not to mention that it powers Hulu, Facebook and WordPress.com.

It has been years since I have even looked at an Apache web server (use to run them) but I do remember their stability and performance. And Apache would be my reflex recommendation for delivering web pages from a topic map application. Why re-write what already works?

Now NGINX comes along with impressive performance numbers and potentially new ways to organize on the server side.

Read the article, grab a copy of NGINX and let me know what you think.

Comments Off

January 21, 2012

Mining information across multiple domains:….

Filed under: Data Mining,Interface Research/Design,Law,Metadata,Patents,Statistics — Patrick Durusau @ 10:14 pm

Mining information across multiple domains: A case study of application to patent laws and regulations in biotechnology by Hang Yu, Siddharth Taduri, Jay Kesan, Gloria Lau and Kincho H. Law.

Abstract:

In this paper, we present a framework that can process a user query for retrieval of information from documents of different properties across multiple domains, with specific application to patent laws and regulations. The framework has three basic components. The first component is ontology mapping and generation. What happens is that the keywords entered by users are mapped into a subset of relevant keywords. This step is performed by looking up those words in an ontology database. The second component is the joint and cross search in various document domains; in our case, they are patents and scientific publications. The last component is to modify the search results by applying user feedback statistics. The results of feedback will be saved as metadata for future uses.

A case example is given to demonstrate how results from multiple domain searches can be combined using ontology and cross referencing. We use an example of well-known biotechnology patents on erythropoietin (EPO) and give detailed analysis on each document domain with this keyword. Relationships between each domain are demonstrated.

A user feedback mechanism is also discussed in this paper. The ability to take user feedback into the framework is important. There is no doubt that domain knowledge from expert or experienced users could be a very good compliment to the proposed system. Both direct and indirect user feedbacks are discussed.

The full text of this article is available now so I suggest that you grab a copy. Apparently some content from the journal is freely available but older material is not.

This a *must read* article.

I particularly liked the use of statistical user feedback to drive the feed back process. Not as exact as having experts curate every mention but a lot less expensive at the same time.

So, do all the NLP, statistics, probability, data mining, etc., posts seem a bit more relevant to topic maps now?

No one method or approach is going to produce as good a result as taking the strong parts from a number of approaches and being willing to consider both additions as well as deletions to your method matrix.

Comments Off

Open-sourcing Sky Map….

Filed under: Astroinformatics — Patrick Durusau @ 10:13 pm

Open-sourcing Sky Map and collaborating with Carnegie Mellon University

In May 2009 we launched Google Sky Map: our “window on the sky” for Android phones. Created by half a dozen Googlers at the Pittsburgh office in our 20% time, the app was designed to show off the amazing capabilities of the sensors in the first generation Android phones. Mostly, however, we wrote it because we love astronomy. And, thanks to Android’s broad reach, we have managed to share this passion with over 20 million Android users as well as with our local community at events such as the Urban Sky Party.

Today, we are delighted to announce that we are going to share Sky Map in a different way: we are donating Sky Map to the community. We are collaborating with Carnegie Mellon University in an exciting partnership that will see further development of Sky Map as a series of student projects. Sky Map’s development will now be driven by the students, with Google engineers remaining closely involved as advisors. Additionally, we have open-sourced the app so that other astronomy enthusiasts can take the code and augment it as they wish.

I mention this because I am sure there will be opportunities to use topic maps to map in additional astronomical information to the app.

Comments Off

Chart of Congressional activity on Twitter related to SOPA/PIPA

Filed under: Graphs,Tweets,Visualization — Patrick Durusau @ 10:12 pm

Chart of Congressional activity on Twitter related to SOPA/PIPA by Drew Conway.

From the post:

As many of you know, this week thousands of people mobilized to protest two laws being considered in Congress: the Stop Online Piracy Act (SOPA) and it’s Senate version the PROTECT IP Act (PIPA). Several Internet mainstays, such as Wikipedia, Reddit andy O’Reilly blacked out their sites to protest the bill. For some information on why this legislation is so dangerous check out this excellent video by The Guardian.

The mobilization against SOPA/PIPA also included many grassroots efforts to contact Congress and demand the bill be stopped. Given the attention the bill was getting, I was curious if there was any surge in discussion of the bill by members of Congress on Twitter.

So, I created a visualization that is a cumulative timeline of tweets by members of the U.S. Congress for “SOPA” or “PIPA.” To see if there was any surge, check out the visualization for yourself.

Go see Drew’s post and draw the graph for yourself.

OK, but my question would be, who are they tweeting to? Need to distinguish targets of the tweets from those who actually read the tweets. One possible mechanism would be retweets.

That is who retweeted messages from a particular member of Congress? Would be interesting to map that to say a list of congressional contributors. Different set of identifiers for Twitter versus donation records but same subjects.

But Twitter is just surface traffic and public traffic at that. I assume after the “see my pants” episode last year that most members of Congress are a little more careful with Twitter accounts. Perhaps not.

What I would be interested in seeing is all the incoming/outgoing phone and other hidden traffic. Like Blackberries. Would not care about the content, just the points of contact. A “pen register” I think they used to call them. Not sure what you would call it for cellphone traffic.

Comments Off

Getting Started With The Gephi…

Filed under: Gephi,Graphs,Networks,Visualization — Patrick Durusau @ 10:10 pm

Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I by Tony Hirst.

From the post:

A couple of weeks ago, I came across Gephi, a desktop application for visualising networks.

And quite by chance, a day or two after I was asked about any tools I knew of that could visualise and help analyse social network activity around an OU course… which I take as a reasonable justification for exploring exactly what Gephi can do 🙂

So, after a few false starts, here’s what I’ve learned so far…

First up, we need to get some graph data – netvizz – facebook to gephi suggests that the netvizz facebook app can be used to grab a copy of your Facebook network in a format that Gephi understands, so I installed the app, downloaded my network file, and then uninstalled the app… (can’t be too careful 😉

Once Gephi is launched (and updated, if it’s a new download – you’ll see an updates prompt in the status bar along the bottom of the Gephi window, right hand side) Open… the network file you downloaded.

If you like part 1 as an introduction to Gephi, be sure to take in:

Getting Started With Gephi Network Visualisation App – My Facebook Network, Part II: Basic Filters

which starts out:

In Getting Started With Gephi Network Visualisation App – My Facebook Network, Part I I described how to get up and running with the Gephi network visualisation tool using social graph data pulled out of my Facebook account. In this post, I’ll explore some of the tools that Gephi provides for exploring a network in a more structured way.

If you aren’t familiar with Gephi, and if you haven’t read Part I of this series, I suggest you do so now…

…done that…?

Okay, so where do we begin? As before, I’m going to start with a fresh worksheet, and load my Facebook network data, downloaded via the netvizz app, into Gephi, but as an undirected graph this time! So far, so exactly the same as last time. Just to give me some pointers over the graph, I’m going to set the node size to be proportional to the degree of each node (that is, the number of people each person is connected to).

You will find lots more to explore with Gephi but this should give you a good start.

Comments Off

Sensei – Major Update

Filed under: NoSQL,Sensei — Patrick Durusau @ 10:08 pm

Sensei

My first post on Sensei was December 10, 2010 – Sensei – which if you follow the link given there, redirects to the new page.

The present homepage reads in part:

SenseiDB

Open-source, distributed, realtime, semi-structured database

Powering LinkedIn homepage and LinkedIn Signal.

Some Features:

Full-text search

Fast realtime updates

Structured and faceted search

Fast key-value lookup

High performing under concurrent heavy update and query volumes

Hadoop integration

Quite different and not idle claims about numbers. I have heard of LinkedIn, as I am sure you have as well. 😉

I appreciate the effort to stay as close to SQL as possible but lacking a copy of the current SQL standard (I need to fix that), I don’t know how much Sensei has diverged from SQL or why?

Not to nit-pick too much but entries like:

Note that wildcards % and _, not Lucene’s * and ? are used in BQL. This is mainly to make BQL more compatible with SQL. However, if * or ? is used, it is also accepted.

that I saw just scanning the documentation says to me that a close editing pass would be a useful thing.

I haven’t run the examples (yet) but marks for the cars data example and capturing a Twitter stream.

Comments Off

This post on Google+ statistics is a billion* times better than any other post

Filed under: Authoring Topic Maps,Humor — Patrick Durusau @ 10:07 pm

This post on Google+ statistics is a billion* times better than any other post by Rocky Agrawal.

From the post:

In Thursday’s Google earnings call, CEO Larry Page told the world that the company’s fledgling social network, Google+ has reached 90 million registered users. He went on to say that, “Over 60 percent of Google+ users use Google products on a daily basis. Over 80 percent of Google+ users use Google products every week.”

I’m not impressed by the numbers, and I’m not impressed by what Page was trying to do with them.

Counting registered users instead of daily active users tells us nothing about the popularity of the service. Think of the millions of people who’ve registered for Google+ but never use it. Second, given the huge popularity of Google search, Gmail, and YouTube, it’s actually surprising that so few people who have registered for Google+ are using those more popular services on a daily basis — only 60 percent. After all, remember that a lot of Google+ users accidentally became Google+ users only because they were already attached to another Google service.

But what concerns me most is that Google is touting these meaningless statistics in the hopes that journalists will misunderstand them and report that Google+ is seeing rapid growth. The bottom line is, those 60 percents, 80 percents and 90 million registered users are just there to mask the fact that Google doesn’t want to tell us how many people are actually using Google+.

It’s intellectually dishonest. And as a public company, it raises questions of Google’s intent — the market is watching Google’s moves in social and needs to see traction. I expect better from Google.

Some journalists, to say nothing of the great mass of the unwashed, will be misled by Larry Page’s statements. Whether that was his intent or not. But many of them would have mis-understood his comments had they been delivered with the aid of anatomically correct dolls.

To illustrate another reporting approach to statements about usage by Larry Page, consider the Wall Street Journal coverage on the same topic:

The company has pushed into social media, launching Google+ as an alternative to Facebook’s popular website. (From: Google Shares Plunge As Earnings Report Raises Growth Concerns. Viewed January 21, 2012, 4:46 PM East Coast time.

That’s it, one sentence. And apparently investors were among those not misled by Page’s comments. If they cared at all about Page’s comments on usage of Google+.

Topic map authoring tip:

For business topic maps, remember your readers are interested in facts or statements of facts that can form the basis for investment decisions or fraud lawsuits following investments. Lies about extraneous or irrelevant matters need not be included.

Comments Off

Neo4j Graph Database Tutorial:…Route Planner, and other Examples

Filed under: Graphs,Neo4j — Patrick Durusau @ 10:06 pm

Neo4j Graph Database Tutorial: How to build a Route Planner and other Examples by Micha Kops.

From the post:

Often in the life of developer’s life there is a scenario where using a relational database tends to get complicated or sometimes even slow – especially when there are fragments with multiple relationships or multiple connections present. This often leads to complex database queries or desperate software engineers trying to handle those problems with their ORM framework.

A possible solution might be to switch from a relational database to a graph database – and – neo4j is our tool of choice here. In the following tutorial we’re going to implement several examples to demonstrate the strengths of a graph database .. from a route planner to a social graph.

A very well organized and illustrated tutorial!

Uses the yED graph editor for the illustrations.

Comments Off

Neo4j Videos – The Motherlode!

Filed under: Neo4j — Patrick Durusau @ 10:05 pm

Neo4j Videos

At present count, thirty-one (31) videos on different aspects of Neo4j, from introductions to the latest query language developments.

Just what you need if you plan to enter the Seed the Cloud challenge using Neo4j.

If you are making a presentation and/or video about Neo4j, this would be a great place to leave a copy.

Comments Off

January 20, 2012

Exploring News

Filed under: Data Mining,News — Patrick Durusau @ 9:22 pm

Exploring News by Matthew Hurst.

From the post:

In experimenting with news aggregation and mining on the d8taplex site, I’ve come up with the following questions:

Why are some news articles picked up and others not? News sources such as Reuters create articles that are either directly consumed or which are picked up by other publications and passed along.

Who are these people writing these articles? What are their interests, areas of expertise and personalities?

What is the role of the editor and how do they influence the selection and form of the content produced by the news machine?

The next round of experimentation with news aggregation has resulted in the current new site. It has the following features.

Drop by and give Matthew a hand.

Story selection has many factors. Which ones do you think are important?

Comments Off

Will DOJ Tech Project Die After 10 Years?

Filed under: Government — Patrick Durusau @ 9:21 pm

Will DOJ Tech Project Die After 10 Years?

From the post:

A secure, interoperable radio network that the Department of Justice has been working on for more than a decade and that has cost the agency $356 million may be headed for failure, according to a new report by the agency’s inspector general.

According to the report, inadequate funding, frequent revisions to DOJ’s plans, and poor coordination threaten the success of the Integrated Wireless Network (IWS) and could leave the agency with obsolete radio equipment that doesn’t communicate well with other radio systems, which could in turn pose a threat to public safety.

The program, DOJ’s response to the 9/11 Commission’s push for interoperable law enforcement communications, originally aimed to provide wireless communications to more than 81,000 federal law enforcement agents nationwide in the Department of Justice, Department of Homeland Security, and Department of the Treasury. General Dynamics, one of IWN’s top contractors, claims that the project could also save agency costs in the long run by reducing the number of radio towers by more than 50%.

At page 33:

As we found in our 2007 audit, establishing the requirements and designing a system that met the needs of a diverse group of users proved troublesome and time consuming because the users had conflicting priorities and all users believed that their needs were the most important. We found that little has changed since our 2007 audit.

I mention this horror story not because it is particularly relevant to topic maps from a technical standpoint but it is very relevant in terms of future government contracts.

You may or may not have heard about plans to consolidate federal agencies, cost cutting measures, etc., most of which have a mis-guided faith that technology can solve what are governance or personnel issues.

Trust me on this one: Technology cannot solve governance or personnel issues. Not even topic maps can do that. 😉 Truth be told, no technology can.

Nor can any technology make us more willing to listen to other opinions, willing to consider evidence for positions we “know” aren’t true, etc.

To be sure, all the contractors got paid in this sad story but the public (that would be the rest of us), aren’t any better off. In fact we may be worse off.

Comments Off

Semantic Tech the Key to Finding Meaning in the Media

Filed under: Ambiguity,Linked Data — Patrick Durusau @ 9:21 pm

Semantic Tech the Key to Finding Meaning in the Media by Chris Lamb.

Chris starts off well enough:

News volume has moved from infoscarcity to infobesity. For the last hundred years, news in print was delivered in a container, called a newspaper, periodically, typically every twenty-four hours. The container constrained the product. The biggest constraints of the old paradigm were periodic delivery and limitations of column inches.

Now information continually bursts through our Google Readers, our cell phones, our tablets, display screens in elevators and grocery stores. Do we really need to read all 88,731 articles on the Bernie Madoff trial? Probably not. And that’s the dilemma for news organizations.

In the old metaphor, column-inches was the constraint. In the new metaphor, reader attention span becomes the constraint.

But, then quickly starts to fade:

Disambiguation is a technique to uniquely identify named entities: people, cities, and subjects. Disambiguation can identify that one article is about George Herbert Walker Bush, the 41st President of the US, and another article is about George Walker Bush, number 43. Similarly, the technology can distinguish between Lincoln Continental, the car, and Lincoln, Nebraska, the town. As part of the metadata, many tagging engines that disambiguate return unique identifiers called Uniform Resource Identifiers (URI). A URI is a pointer into a database.

If tagging creates machine readable assets, disambiguation is the connective tissue between these assets. Leveraging tagging and disambiguation technologies, applications can now connect content with very disparate origins. Today’s article on George W. Bush can be automatically linked to an article he wrote when he owned the Texas Ranger’s baseball team. Similarly the online bio of Bill Gates can be automatically tied to his online New Mexico arrest record in April 1975.

Apparently he didn’t read the paper The communicative function of ambiguity in language.

The problem with disambiguation is that you and I may well set up a system to disambiguate named entities differently. To be sure, we will get some of them the same, but the question becomes which ones? Is 80% of them the same enough?

Depends on the application doesn’t it? What if we are looking for a terrorist who may have fissionable material? Does 80% look good enough?

Ironic. Disambiguation is subject to the same ambiguity as it set out to solve.

PS: URIs aren’t necessarily pointers into databases.

Comments Off

The communicative function of ambiguity in language

Filed under: Ambiguity,Context,Corpus Linguistics,Linguistics — Patrick Durusau @ 9:20 pm

The communicative function of ambiguity in language by Steven T. Piantadosi, Harry Tily and Edward Gibson. (Cognition, 2011) (PDF file)

Abstract:

We present a general information-theoretic argument that all efficient communication systems will be ambiguous, assuming that context is informative about meaning. We also argue that ambiguity allows for greater ease of processing by permitting efficient linguistic units to be re-used. We test predictions of this theory in English, German, and Dutch. Our results and theoretical analysis suggest that ambiguity is a functional property of language that allows for greater communicative efficiency. This provides theoretical and empirical arguments against recent suggestions that core features of linguistic systems are not designed for communication.

This is a must read paper if you are interesting in ambiguity and similar issues.

At page 289, the authors report:

These findings suggest that ambiguity is not enough of a problem to real-world communication that speakers would make much effort to avoid it. This may well be because actual language in context provides other information that resolves the ambiguities most of the time.

I don’t know if our communication systems are efficient or not but I think the phrase “in context” is covering up a very important point.

Our communication systems came about in very high-bandwidth circumstances. We were in the immediate presence of a person speaking. With all the context that provides.

Even if we accept an origin of language of say 200,000 years ago, written language, which provides the basis for communication without the presence of another person, emerges only in the last five or six thousand years. Just to keep it simple, 5 thousand years would be 2.5% of the entire history of language.

So for 97.5% of the history of language, it has been used in a high bandwidth situation. No wonder it has yet to adapt to narrow bandwidth situations.

If writing puts us into a narrow bandwidth situation and ambiguity, where does that leave our computers?

Comments (1)

Simon says: Single Byte Norms are Dead!

Filed under: Indexing,Lucene — Patrick Durusau @ 9:19 pm

Simon says: Single Byte Norms are Dead!

From the post:

Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, by default a float value composed out of length normalization and field boost. Nothing special so far! However, this float value is encoded into a single byte and written down to the index. Lucene trades some precision lost for space on disk and eventually in memory since norms are loaded into memory per field upon first access.

In lots of cases this precision lost is a fair trade-off, but once you find yourself in a situation where you need to store more information based on statistics collected during indexing you end up writing your own auxiliary data structure or “fork” Lucene for your app and mess with the source.

The upcoming version of Lucene already added support for a lot more scoring models like:

Divergence from Randomness

Language Models

Information Based Models

Okapi BM25

The abstractions added to Lucene to implement those models already opens the door for applications that either want to roll their own “awesome” scoring model or modify the low level scorer implementations. Yet, norms are still one byte!

Don’t worry! The post has a happy ending!

Read on if you want to be on the cutting edge of Lucene work.

Thanks Lucene Team!

Comments Off

ISO 25964--1 Thesauri for information retrieval

Filed under: Cloud Computing,Information Retrieval,ISO/IEC,JTC1,Standards,Thesaurus — Patrick Durusau @ 9:18 pm

Information and documentation -- Thesauri and interoperability with other vocabularies -- Part 1: Thesauri for information retrieval

Actually that is the homepage for Networked Knowledge Organization Systems/Services – N K O S but the lead announcement item is for ISO 25964-1, etc.

From that webpage:

New international thesaurus standard published

ISO 25964--1 is the new international standard for thesauri, replacing ISO 2788 and ISO 5964. The full title is Information and documentation -- Thesauri and interoperability with other vocabularies -- Part 1: Thesauri for information retrieval. As well as covering monolingual and multilingual thesauri, it addresses 21st century needs for data sharing, networking and interoperability.

Content includes:

construction of mono-- and multi--lingual thesauri;

clarification of the distinction between terms and concepts, and their inter--relationships;

guidance on facet analysis and layout;

guidance on the use of thesauri in computerized and networked systems;

best practice for the management and maintenance of thesaurus development;

guidelines for thesaurus management software;

a data model for monolingual and multilingual thesauri;

brief recommendations for exchange formats and protocols.

An XML schema for data exchange has been derived from the data model, and is available free of charge at http://www.niso.org/schemas/iso25964/ . Coming next ISO 25964--1 is the first of two publications. Part 2: Interoperability with other vocabularies is in the public review stage and will be available by the end of 2012.

Find out how you can obtain a copy from the news release.

Let me help you there, the correct number is: ISO 25964-1:2011 and the list price for a PDF copy is CHF 238,00, or in US currency (today), $257.66 (for 152 pages).

Shows what I know about semantic interoperability.

If you want semantic interoperability, you change people $1.69 per page (152 pages) for access to the principles of thesauri to be used for information retrieval.

ISO/IEC and JTC 1 are all parts of a system of viable international (read non-vendor dominated) organizations for information/data standards. They are the natural homes for the management of data integration standards that transcend temporal, organizational, governmental and even national boundaries.

But those roles will not fall to them by default. They must seize the initiative and those roles. Clinging to old-style publishing models for support makes them appear timid in the face of current challenges.

Even vendors recognize their inability to create level playing fields for technology/information standards. And the benefits that come to vendors from de jure as well as non-de jure standards organizations.

ISO/IEC/JTC1, provided they take the initiative, can provide an international, de jure home for standards that form the basis for information retrieval and integration.

The first step to take is to make ISO/IEC/JTC1 information standards publicly available by default.

The second step is to call up all members and beneficiaries, both direct and indirect, of ISO/IEC/JTC 1 work, to assist in the creation of mechanisms to support the vital roles played by ISO/IEC/JTC 1 as de jure standards bodies.

We can all learn something from ISO 25964-1 but how many of us will with that sticker price?

Comments Off

January 19, 2012

Balisage: Where the Wild Things Are (Conference)

Filed under: Conferences — Patrick Durusau @ 7:43 pm

I had to add the (Conference) at the end because Wikipedia thinks there is a film with the title: Where the Wild Things Are. Maybe so, maybe so, but in any event, Balisage is an event!

From the “official” call for participation:

Each year, Balisage gathers together an eclectic mix of participants interested in markup and puts them together in one of the world’s great cities for three and half days of discussion about points of interest in the use of descriptive markup to build strong, lasting information systems. Practitioners and theorists, vendors and users, tool-users and tool-makers, all provide their perspectives at Balisage.

Nominations for paper proposals and peer reviewers are solicited.

As always, papers at Balisage can address any aspect of the use of markup and markup languages to represent information and build information systems. Possible topics include but are not limited to:

XML and related technologies

Non-XML markup languages

Implementation experience with XML parsing, XSLT processors, XQuery processors, XML databases, Topic Map engines, XProc integrations, or any markup-related technology

Semantics, overlap, and other complex fundamental issues for markup languages

Case studies of markup design and deployment

Quality of information in markup systems

JSON and XML

Efficiency of Markup Software

Markup systems in and for the mobile web

The future of XML and of descriptive markup in general

Interesting applications of markup

Gee, I remember having a lot more fun than that!

There was the year that we all did the Wilbur Mills tidal pool thing in the Europa lobby. Yeah, all of us at one time.

And the year of the big head, sorry, hat competition. Liam was declared to be the “Cat in the Hat.”

Did you know many drugs are available without a prescription in Montreal? 😉

You get a free pass to walk/watch St. Catherine Street. (I would help pay for a webcam there if you are interested.)

Oh, and there are presentations with really smart people talking about computer stuff. SGML/XML/overlap/DOM/SAX/XSLT/XPath/XQuery and a bunch of other funny letters strung together. Reminds me of a Numb3rs espisode but without the FBI, at least openly.

Go, submit a paper, be a peer reviewer, you won’t regret it!

Comments Off

Beepl Launches A Twitter-Simple, “Social Q&A Site”

Filed under: Search Engines — Patrick Durusau @ 7:42 pm

Beepl Launches A Twitter-Simple, “Social Q&A Site” by Kit Eaton.

From the post:

People, meet Beepl. It launched to the general public yesterday in the online expertise-sharing/question-and-answer sphere after a short private test run. Branding itself as a “social Q&A site” that “lets users seek answers and opinion from subject specialists, enthusiasts and their social graph,” Beepl also “understands the topics that questions relate to and users’ interests and expertise so that questions automatically reach the best people to reach them.” That bit of lateral thinking differentiates Beepl in a pretty bustling market, but it’s only one of the novel surprises from the company (starting with the lack of a launch press release).

There was this exchange with the founder, Steve O’Hear:

….How can you trust that it’ll connect you to something interesting to you, or perhaps something you have vital insight into for others? Does it mean you may miss out on fringe questions about things you never knew about, but may be fascinated by?

Beepl addresses this, O’Hear says, because the “most aggressive part is for people that are actively using the site. It looks at questions you’ve clicked on, any you’ve answered, any you’ve asked. It even takes a tiny amount from if you do a search on the site.”

I guess if you think politicians really answer each other in debates you could consider that to be a response. 😉 Well, from a dialogue standpoint it was a response but it wasn’t a very helpful one.

From a topic map standpoint, how would you go about mapping the stream of questions and answers? Clues you would look for? Not quite as short as tweets. Enough for context?

Comments Off

All Your HBase Are Belong to Clojure

Filed under: Clojure,Hadoop,HBase — Patrick Durusau @ 7:41 pm

All Your HBase Are Belong to Clojure by

I’m sure you’ve heard a variation on this story before…

So I have this web crawler and it generates these super-detailed log files, which is great ‘cause then we know what it’s doing but it’s also kind of bad ‘cause when someone wants to know why the crawler did this thing but not that thing I have, like, literally gajigabytes of log files and I’m using grep and awk and, well, it’s not working out. Plus what we really want is a nice web application the client can use.

I’ve never really had a good solution for this. One time I crammed this data into a big Lucene index and slapped a web interface on it. One time I turned the data into JSON and pushed it into CouchDB and slapped a web interface on that. Neither solution left me with a great feeling although both worked okay at the time.

This time I already had a Hadoop cluster up and running, I didn’t have any experience with HBase but it looked interesting. After hunting around the internet, thought this might be the solution I had been seeking. Indeed, loading the data into HBase was fairly straightforward and HBase has been very responsive. I mean, very responsive now that I’ve structured my data in such a way that HBase can be responsive.

And that’s the thing: if you are loading literally gajigabytes of data into HBase you need to be pretty sure that it’s going to be able to answer your questions in a reasonable amount of time. Simply cramming it in there probably won’t work (indeed, that approach probably won’t work great for anything). I loaded and re-loaded a test set of twenty thousand rows until I had something that worked.

An excellent tutorial on Hadoop, HBase and Clojure!

First seen at myNoSQL but the URL is not longer working at in my Google Reader.

Comments (2)

Designing Search (part 1): Entering the query

Filed under: HCIR,Search Behavior,Search Interface,Searching — Patrick Durusau @ 7:40 pm

Designing Search (part 1): Entering the query by Tony Russell-Rose.

From the post:

In an earlier post we reviewed models of information seeking, from an early focus on documents and queries through to a more nuanced understanding of search as an information journey driven by dynamic information needs. While each model emphasizes different aspects of the search process, what they share is the principle that search begins with an information need which is articulated in some form of query. What follows below is the first in a mini-series of articles exploring the process of query formulation, starting with the most ubiquitous of design elements: the search box.

If you are designing or using search interfaces, you will benefit from reading this post.

Suggestion: Don’t jump to the summary and best practices. Tony’s analysis is just as informative as the conclusions he reaches.

Comments Off

csvkit 0.4.2 (beta)

Filed under: CSV — Patrick Durusau @ 7:40 pm

csvkit 0.4.2 (beta)

From the webpage:

csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.

It is inspired by pdftk, gdal and the original csvcut utility by Joe Germuska and Aaron Bycoffe.

Important links:

Repository: https://github.com/onyxfish/csvkit

Issues: https://github.com/onyxfish/csvkit/issues

Documentation: http://csvkit.rtfd.org/

Schemas: https://github.com/onyxfish/ffs

Something for the toolbox that may prove to be useful.

Comments Off

Comprehensions

Filed under: Functional Programming — Patrick Durusau @ 7:38 pm

Comprehensions

From the post:

Prompted by some recent work I’ve been doing on reasoning about monadic computations, I’ve been looking back at the work from the 1990s by Phil Trinder, Limsoon Wong, Leonidas Fegaras, Torsten Grust, and others, on monad comprehensions as a framework for database queries.

The idea goes back to the adjunction between extension and intension in set theory—you can define a set by its extension, that is by listing its elements:

$\displaystyle \{ 1, 9, 25, 49, 81 \}$

or by its intension, that is by characterizing those elements:

$\displaystyle \{ n^2 \mid 0 < n < 10 \land n \equiv 1 (\mathop{mod} 2) \}$

Expressions in the latter form are called set comprehensions. They inspired a programming notation in the SETL language from NYU, and have become widely known through list comprehensions in languages like Haskell. The structure needed of sets or of lists to make this work is roughly that of a monad, and Phil Wadler showed how to generalize comprehensions to arbitrary monads, which led to the “do” notation in Haskell. Around the same time, Phil Trinder showed that comprehensions make a convenient database query language. The comprehension notation has been extended to cover other important aspects of database queries, particularly aggregation and grouping. Monads and aggregations have very nice algebraic structure, which leads to a useful body of laws to support database query optimization.

If you are interesting in functional programming and/or “database query optimization” you will enjoy this post.

Comments Off

NYT uses R to map the 1%

Filed under: R,Visualization — Patrick Durusau @ 7:36 pm

NYT uses R to map the 1%

From the post:

Last Saturday, the New York Times published a feature article on the wealthiest 1% of Americans. The on-line version of the article included interactive features like this interactive map showing where your household ranks in the country and in local regions. The print edition, however, included some different (and necessarily static) representations of US wealth data, such as this map of where the wealthiest 1% live:

Interesting for a couple of reasons:

First, the ease with which the initial mapping was done using R. Meaning that you can quickly map/experiment with mappings using R.

Second, consider that R was the vehicle to map the data onto the map. Could have just as easily mapped campaign finance records along with the top 1% data. What subject you decide to map together is entirely up to you.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 22, 2012

January 21, 2012

January 20, 2012

January 19, 2012