September « 2011 « Another Word For It

September 27, 2011

k-means Approach to the Karhunen-Loéve Transform (aka PCA – Principal Component Analysis)

Filed under: Algorithms,Principal Component Analysis (PCA) — Patrick Durusau @ 6:52 pm

k-means Approach to the Karhunen-Loeve Transform by Krzysztof Misztal, Przemyslaw Spurek, and Jacek Tabor.

Abstract:

We present a simultaneous generalization of the well-known Karhunen-Loeve (PCA) and k-means algorithms. The basic idea lies in approximating the data with k affine subspaces of a given dimension n. In the case n=0 we obtain the classical k-means, while for k=1 we obtain PCA algorithm.

We show that for some data exploration problems this method gives better result then either of the classical approaches.

I know, it is a very forbidding title but once you look at the paper you will be glad you did.

First, the authors begin with a graphic illustration of the goal of their technique (no, you have to look at the paper to see it), which even the most “lay” reader can appreciate.

Second, the need for topic maps strikes again as in the third paragraph we learn: “…Karhunen-Loéve transform (called also PCA – Principle Component Analysis)….”

Third, some of the uses of this technique:

data mining – we can detect important coordinates and subsets with similar properties;
clustering – our modification of k-means can detect different, high dimensional relation in data;
image compression and image segmentation;
pattern recognition – thanks to detection of relation in data we can use it to assign data to defined before classes.

A sample implementation is available at: http://www.ii.uj.edu.pl/~misztalk.

Comments (1)

Linked Data Semantic Issues (same for topic maps?)

Filed under: Linked Data,LOD,Marketing,Merging,Topic Maps — Patrick Durusau @ 6:51 pm

Sebastian Schaffert posted a message on the pub-lod@w3c.org list that raised several issues about Linked Data. Issues that sound relevant to topic maps. See what you think.

From the post:

We are working together with many IT companies (with excellent software developers) and trying to convince them that Semantic Web technologies are superior for information integration. They are already overwhelmed when they have to understand that a database ID for an object is not enough. If they have to start distinguishing between the data object and the real world entity the object might be representing, they will be lost completely.

I guess being told that a “real world entity” may have different ways to be identified must seem to be the road to perdition.

Curious because the “real world” is a messy place. Or is that the problem? That the world of developers is artificially “clean,” at least as far as identification and reference.

Perhaps CS programs need to train developers for encounter with the messy “real world.”

From the post:

> When you dereference the URL for a person (such as …/561666514#), you get back RDF. Our _expectation_, of course, is that that RDF will include some remarks about that person (…/561666514#), but there can be no guarantee of this, and no guarantee that it won’t include more information than you asked for. All you can reliably expect is that _something_ will come back, which the service believes to be true and hopes will be useful. You add this to your knowledge of the world, and move on.

There I have my main problem. If I ask for “A”, I am not really interested in “B”. What our client implementation therefore does is to throw away everything that is about B and only keeps data about A. Which is – in case of the FB data – nothing. The reason why we do this is that often you will get back a large amount of irrelevant (to us) data even if you only requested information about a specific resource. I am not interested in the 999 other resources the service might also want to offer information about, I am only interested in the data I asked for. Also, you need to have some kind of “handle” on how to start working with the data you get back, like:
1. I ask for information about A, and the server gives me back what it knows about A (there, my expectation again …)
2. From the data I get, I specifically ask for some common properties, like A foaf:name ?N and do something with the bindings of N. Now how would I know how to even formulate the query if I ask for A but get back B?

Ouch! That one cuts a little close. 😉

What about the folks who are “…not really interested in ‘B’.” ?

How do topic maps serve their interests?

Or have we decided for them that more information about a subject is better?

Or is that a matter of topic map design? What information to include?

That “merging” and what gets “merged” is a user/client decision?

That is how it works in practice simply due to time, resources, and other constraints.

Marketing questions:

How to discover data users would like to have appear with other data, prior to having a contract to do so?

Can we re-purpose search logs for that?

Comments Off

Tying up the loose ends in fully LZW-compressed pattern matching

Filed under: Pattern Compression,Pattern Matching — Patrick Durusau @ 6:51 pm

Tying up the loose ends in fully LZW-compressed pattern matching by Pawel Gawrychowski.

Abstract:

We consider a natural generalization of the classical pattern matching problem: given compressed representations of a pattern p[1..M] and a text t[1..N] of sizes m and n, respectively, does p occur in t? We develop an optimal linear time solution for the case when both p and t are compressed using the LZW method. This improves the previously known O((n+m)log(n+m)) time solution of Gasieniec and Rytter, and essentially closes the line of research devoted to studying LZW-compressed exact pattern matching.

I don’t know of any topic maps that are yet of the size that they and queries against them need to take advantage of compressed queries against compress data but this paper outlines “an optimal linear time solution … for fully LZW-compressed pattern matching.” (page 2)

I suspect it may be more relevant to data mining prior to the construction of a topic map. But in either case, when needed it will be a welcome solution.

Comments Off

Learning Discriminative Metrics via Generative Models and Kernel Learning

Filed under: Kernel Methods,Machine Learning — Patrick Durusau @ 6:50 pm

Learning Discriminative Metrics via Generative Models and Kernel Learning by Yuan Shi, Yung-Kyun Noh, Fei Sha, and Daniel D. Lee.

Abstract:

Metrics specifying distances between data points can be learned in a discriminative manner or from generative models. In this paper, we show how to unify generative and discriminative learning of metrics via a kernel learning framework. Specifically, we learn local metrics optimized from parametric generative models. These are then used as base kernels to construct a global kernel that minimizes a discriminative training criterion. We consider both linear and nonlinear combinations of local metric kernels. Our empirical results show that these combinations significantly improve performance on classification tasks. The proposed learning algorithm is also very efficient, achieving order of magnitude speedup in training time compared to previous discriminative baseline methods.

Combination of machine learning techniques within a framework.

It may be some bias in my reading patterns but I don’t recall any explicit combination of human + machine learning techniques? I don’t take analysis of search logs to be an explicit human contribution since the analysis is guessing as to why a particular link and not another was chosen. I suppose time on the resource chosen might be an indication but a search log per se isn’t going to give that level of detail.

For that level of detail you would need browsing history. Would be interesting to see if a research library or perhaps employer (fewer “consent” issues) would permit browsing history collection over some long period of time, say 3 to 6 months. So that not only is the search log captured but the entire browsing history.

Hard to say if that would result in enough increased accuracy on search results to be worth the trouble.

Interesting paper about combining purely machine learning techniques and promises significant gains. What these plus human learning would produce remains a subject for future research papers.

Comments Off

Production and Network Formation Games with Content Heterogeneity

Filed under: Games,Group Theory,Networks — Patrick Durusau @ 6:49 pm

Production and Network Formation Games with Content Heterogeneity by Yu Zhang, Jaeok Park, and Mihaela van der Schaar.

Abstract:

Online social networks (e.g. Facebook, Twitter, Youtube) provide a popular, cost-effective and scalable framework for sharing user-generated contents. This paper addresses the intrinsic incentive problems residing in social networks using a game-theoretic model where individual users selfishly trade off the costs of forming links (i.e. whom they interact with) and producing contents personally against the potential rewards from doing so. Departing from the assumption that contents produced by difference users is perfectly substitutable, we explicitly consider heterogeneity in user-generated contents and study how it influences users’ behavior and the structure of social networks. Given content heterogeneity, we rigorously prove that when the population of a social network is sufficiently large, every (strict) non-cooperative equilibrium should consist of either a symmetric network topology where each user produces the same amount of content and has the same degree, or a two-level hierarchical topology with all users belonging to either of the two types: influencers who produce large amounts of contents and subscribers who produce small amounts of contents and get most of their contents from influencers. Meanwhile, the law of the few disappears in such networks. Moreover, we prove that the social optimum is always achieved by networks with symmetric topologies, where the sum of users’ utilities is maximized. To provide users with incentives for producing and mutually sharing the socially optimal amount of contents, a pricing scheme is proposed, with which we show that the social optimum can be achieved as a non-cooperative equilibrium with the pricing of content acquisition and link formation.

The “content heterogeneity” caught my eye but after reading the abstract, this appears relevant to topic maps for another reason.

One of the projects I hear discussed from time to time is a “public” topic map that encourages users to interact in a social context and to add content to the topic map. Group dynamics and the study of the same seem directly relevant to such “public” topic maps.

Interesting paper but I am not altogether sure about the “social optimum” as outlined in the paper. Not that I find it objectionable, but more that “social optimums” are a matter of social practice than engineering.

Comments Off

LucidWorks 2.0, the search platform for Apache Solr/Lucene (stolen post)

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 6:48 pm

LucidWorks 2.0, the search platform for Apache Solr/Lucene by David M. Fishman.

Apologies to David because I stole his entire post, with links to the Lucid site. Could not figure out what to leave out so I included it all.

If you’re a search application developer or architect, if you’ve got big data on your hands or on the brain, or if you’ve got big plans for Apache Lucene/Solr, this announcement is for you.

Today marks the 2.0 release of LucidWorks, the search platform that accelerates and simplifies development of highly accurate, scalable, and cost-effective search applications. We’ve bottled the best of Apache Lucene/Solr, including key innovations from the 4.x branch, in a commercial-grade package that’s designed for the rigors of production search application deployment.

Killer search applications are popping up everywhere, and it’s no surprise. On the one hand, big data technologies disrupting old barriers of speed, structure, cost and addressability of data storage; on the other, the new frontier of query-driven analytics is shifting from old-school reporting to instant, unlimited reach into mixed data structures, driven by users. (There are places these converge: 7 years of data in Facebook combine content with user context, creating a whole new way to look at life as we know it on line.)

Or, to put it a little less breathlessly: Search is now the UI for Big Data. LucidWorks 2.0 is the only distribution of Apache Solr/Lucene that lets you:

Build killer business-critical search apps more quickly and easily

Streamline search setup and optimization for more reliable operations

Access big data and enterprise content faster and more securely

Scale to billions without spending millions

If you surf through our website, you’ll find info on features and benefits, screenshots, a detailed technical overview, and access to product documentation. But that’s all talk. Download LucidWorks Enterprise 2.0, or apply for a spot in the Private Beta for LucidWorks Cloud, and take it for a spin.

They say imitation is the sincerest form of flattery. Maybe that will make David feel better!

Seriously, this is an important milestone, both for today and for what is yet to come in the search arena.

Comments (1)

Faceted Search using Solr – what it is and what benefits does it provide..?

Filed under: Education,Facets,Solr — Patrick Durusau @ 6:47 pm

Faceted Search using Solr – what it is and what benefits does it provide..? by James Spencer (eduserv blog).

From the post:

What is Faceted Search?

Faceted search is a more advanced searching technology that enables the end user to structure their search and ultimately drill down using categories to find the end result they are looking for via the site search. Rather than relying on simple keyword searching, faceted searching allows a user to perform a keyword search but then filter content by pre-defined categories and filtering criteria.

Faceted searching also enables you to gain advanced funcionality like suggested search terms, auto completion on search terms and have associated links to content. This provides users with quicker, more flexible, dynamic and accurate search results.

The post goes on to list the benefits of faceted searching in a very accessible way, explains Solr, uses of Solr by the Department of Education (US), and gives additional examples of faceted searching.

Very high marks for presenting the material at a web developer/advanced user level. Hard to judge that consistently but this post comes as close as any I have seen recently.

Comments Off

The Node Beginner Book

Filed under: Javascript,node-js — Patrick Durusau @ 6:46 pm

The Node Beginner Book by Manuel Kiessling.

I ran across this the other day when I was posting Node.js at Scale and just forgot to post it.

Nice introduction to Node.js.

The ubiquity of small-world networks

Filed under: Clustering,Graphs,Networks — Patrick Durusau @ 6:46 pm

The ubiquity of small-world networks by Qawi K. Telesford, Karen E. Joyce, Satoru Hayasaka, Jonathan H. Burdette, and Paul J. Laurienti.

Abstract:

Small-world networks by Watts and Strogatz are a class of networks that are highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs. These characteristics result in networks with unique properties of regional specialization with efficient information transfer. Social networks are intuitive examples of this organization with cliques or clusters of friends being interconnected, but each person is really only 5-6 people away from anyone else. While this qualitative definition has prevailed in network science theory, in application, the standard quantitative application is to compare path length (a surrogate measure of distributed processing) and clustering (a surrogate measure of regional specialization) to an equivalent random network. It is demonstrated here that comparing network clustering to that of a random network can result in aberrant findings and networks once thought to exhibit small-world properties may not. We propose a new small-world metric, {\omega} (omega), which compares network clustering to an equivalent lattice network and path length to a random network, as Watts and Strogatz originally described. Example networks are presented that would be interpreted as small-world when clustering is compared to a random network but are not small-world according to {\omega}. These findings have significant implications in network science as small-world networks have unique topological properties, and it is critical to accurately distinguish them from networks without simultaneous high clustering and low path length.

What sort of network is your topic map?

Wonder if there will emerge classes of topic maps? Some of which are small-world networks and others that are not? I ask because knowing the conditions/requirements that lead to one type or the other would be another tool for designing topic maps for particular purposes.

Comments Off

A Faster LZ77-Based Index

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 7:19 am

A Faster LZ77-Based Index by Travis Gagie and Pawel Gawrychowski.

Abstract:

Suppose we are given an AVL-grammar with $r$ rules for a string (S [1..n]) whose LZ77 parse consists of $z$ phrases. Then we can add $\Oh{z \log \log z}$ words and obtain a compressed self-index for $S$ such that, given a pattern (P [1..m]), we can list the occurrences of $P$ in $S$ in $\Oh{m^2 + (m + \occ) \log \log n}$ time.

Not the best abstract I have ever read. At least in terms of attracting the most likely audience to be interested.

I would have started with: “Indexing of genomes, which are 99.9% same, can be improved in terms of searching, response times and reporting of secondary occurrences.” Then follow with the technical description of the contribution. Don’t make people work for a reason to read the paper.

Any advancement in indexing, but particularly in an area like genomics, is important to topic maps.

Update: See the updated version of this paper: A Faster Grammar-Based Self-Index.

Comments (1)

September 26, 2011

Lucene and Solr’s CheckIndex to the Rescue!

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 7:03 pm

Lucene and Solr’s CheckIndex to the Rescue! by Rafał Kuć.

From the post:

While using Lucene and Solr we are used to a very high reliability. However, there may come a day when Solr will inform us that our index is corrupted, and we need to do something about it. Is the only way to repair the index to restore it from the backup or do full indexation? No – there is hope in the form of CheckIndex tool.

What is CheckIndex ?

CheckIndex is a tool available in the Lucene library, which allows you to check the files and create new segments that do not contain problematic entries. This means that this tool, with little loss of data is able to repair a broken index, and thus save us from having to restore the index from the backup (of course if we have it) or do the full indexing of all documents that were stored in Solr.

The question about when the last backup was run at the end of the article isn’t meant to be funny.

When I was training to be a NetWare sysadmin, more than a little while ago, one of the manuals advised that the #1 reason for sysadmins being fired was failure to maintain proper backups. I suspect that is probably still the case. Or at least I hope it is. There really is no excuse for failing to maintain proper backups.

Comments Off

Index external websites with Apache Nutch

Filed under: Indexing,Solr — Patrick Durusau @ 7:02 pm

Index external websites with Apache Nutch by Stefan Sprenger.

Walks through using Apache Nutch with Solr.

Comes at an opportune time because I have a data set (URIs) that I want to explore using a variety of methods. No one of which will be useful for all use cases.

If you need a mapping metaphor, think of it as setting off into unexplored territory and the map (read tool) I am using changes the landscape I will have to explore.

Probably not doing the first instalment this week but either late this week or early next.

Comments Off

Building Distributed Indexing for Solr: MurmurHash3 for Java

Filed under: Indexing,Java,Solr — Patrick Durusau @ 7:01 pm

Building Distributed Indexing for Solr: MurmurHash3 for Java by Yonik Seeley.

From the post:

Background

I needed a really good hash function for the distributed indexing we’re implementing for Solr. Since it will be used for partitioning documents, it needed to be really high quality (well distributed) since we don’t want uneven shards. It also needs to be cross-platform, so a client could calculate this hash value themselves if desired, to predict which node has a given document.

MurmurHash3

MurmurHash3 is one of the top favorite new hash function these days, being both really fast and of high quality. Unfortunately it’s written in C++, and a quick google did not yield any suitable high quality port. So I took 15 minutes (it’s small!) to port the 32 bit version, since it should be faster than the other versions for small keys like document ids. It works in 32 bit chunks and produces a 32 bit hash – more than enough for partitioning documents by hash code.

Something for your Solr friends.

Comments Off

Node.js at Scale

Filed under: Javascript,node-js — Patrick Durusau @ 7:01 pm

Node.js at Scale by Tom Hughes-Croucher (Joyent).

ABSTRACT

When we talk about performance what do we mean? There are many metrics that matter in different scenarios but it’s difficult to measure them all. Tom Hughes-Croucher looks at what performance is achievable with Node today, which metrics matter and how to pick the ones that most matter to you. Most importantly he looks at why metrics don’t matter as much as you think and the critical decision making involved in picking a programming language, a framework, or even just the way you write code.

BIOGRAPHY

Tom Hughes-Croucher is the Chief Evangelist at Joyent, sponsors of the Node.js project. Tom mostly spends his days helping companies build really exciting projects with Node and seeing just how far it will scale. Tom is also the author of the O’Reilly book “Up and running with Node.js”. Tom has worked for many well known organizations including Yahoo, NASA and Tesco.

I thought the discussion of metrics was going to be the best part. It is worth your time but I stayed around for the node.js demonstration and it was impressive!

Comments (1)

> 100 New KDD Models/Methods Appear Every Month

Filed under: Astroinformatics,Data Mining,KDD,Knowledge Discovery — Patrick Durusau @ 7:00 pm

Got your attention? It certainly got mine when I read:

Make an inventory of existing methods relevant for astrophysical applications (more than 100 new KDD models and methods appear every month on specialized journals).

A line from the charter of the KDD-IG (Knowledge Discovery and Data Mining-Interest Group) of IVOA (International Virtual Observatory Alliance).

See: IVOA Knowledge Discovery in Databases

I checked the A census of Data Mining and Machine Learning methods for astronomy wiki page but it had no takers, much less any content.

I have written to Professor Giuseppe Longo of University Federico II in Napoli, the chair of this activity to inquire about opportunities to participate in the KDD census. I will post an updated entry when I have more information.

Separate and apart from the census, over 1,200 new KDD models/methods a year, that is an impressive number. I don’t think a census will make that slow down. If anything, greater knowledge of other efforts may spur the creation of even more new models/methods.

Comments Off

DAta Mining & Exploration (DAME)

Filed under: Astroinformatics,Data Mining,Machine Learning — Patrick Durusau @ 7:00 pm

DAta Mining & Exploration (DAME)

From the website:

What is DAME

Nowadays, many scientific areas share the same need of being able to deal with massive and distributed datasets and to perform on them complex knowledge extraction tasks. This simple consideration is behind the international efforts to build virtual organizations such as, for instance, the Virtual Observatory (VObs). DAME (DAta Mining & Exploration) is an innovative, general purpose, Web-based, distributed data mining infrastructure specialized in Massive Data Sets exploration with machine learning methods.

Initially fine tuned to deal with astronomical data only, DAME has evolved in a general purpose platform program, hosting a cloud of applications and services useful also in other domains of human endeavor.

DAME is an evolving platform and new services as well as additional features are continuously added. The modular architecture of DAME can also be exploited to build applications, finely tuned to specific needs.

Follow DAME on YouTube

The project represents what is commonly considered an important element of e-science: a stronger multi-disciplinary approach based on the mutual interaction and interoperability between different scientific and technological fields (nowadays defined as X-Informatics, such as Astro-Informatics). Such an approach may have significant implications in the Knowledge Discovery in Databases process, where even near-term developments in the computing infrastructure which links data, knowledge and scientists will lead to a transformation of the scientific communication paradigm and will improve the discovery scenario in all sciences.

So far there is only one video at YouTube and it could lose the background music with no ill-effect.

The lessons learned (or applied) here should be applicable to other situations with very large data sets, say from satellites revolving the Earth?

Comments Off

VOGCLUSTERS: an example of DAME web application

Filed under: Astroinformatics,Data Integration,Marketing — Patrick Durusau @ 6:59 pm

VOGCLUSTERS: an example of DAME web application by Marco Castellani, Massimo Brescia, Ettore Mancini, Luca Pellecchia, and Giuseppe Longo.

Abstract:

We present the alpha release of the VOGCLUSTERS web application, specialized for data and text mining on globular clusters. It is one of the web2.0 technology based services of Data Mining & Exploration (DAME) Program, devoted to mine and explore heterogeneous information related to globular clusters data.

VOGCLUSTERS (The alpha website.)

From the webpage:

This page is the entry point to the VOGCLUSTERS Web Application (alpha release) specialized for data and text mining on globular clusters. It is a toolset of DAME Program to manage and explore GC data in various formats.

In this page the users can obtain news, documentation and technical support about the web application.

The goal of the project VOGCLUSTERS is the design and development of a web application specialized in the data and text mining activities for astronomical archives related to globular clusters. Main services are employed for the simple and quick navigation in the archives (uniformed under VO standards and constraints) and their manipulation to correlate and integrate internal scientific information. The project has not to be intended as a straightforward website for the globular clusters, but as a web application. A website usually refers to the front-end interface through which the public interact with your information online. Websites are typically informational in nature with a limited amount of advanced functionality. Simple websites consist primarily of static content where the data displayed is the same for every visitor and content changes are infrequent. More advanced websites may have management and interactive content. A web application, or equivalently Rich Internet Application (RIA) usually includes a website component but features additional advanced functionality to replace or enhance existing processes. The interface design objective behind a web application is to simulate the intuitive, immediate interaction a user experiences with a desktop application.

Note the use of DAME as a foundation to “…manage and explore GC data in various formats.”

Just in case you are unaware, astronomy/radio astronomy, along with High Energy Physics (HEP) were the original big data.

If you have an interest in astronomy, this would be a good project to follow and perhaps to suggest topic map techniques.

Effective marketing of topic maps requires more than writing papers and hoping that someone reads them. Invest your time and effort into a project, then suggest (appropriately) the use of topic maps. You and your proposal will have more credibility that way.

Comments Off

Ergodic Control and Polyhedral approaches to PageRank Optimization

Filed under: PageRank,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 6:58 pm

Ergodic Control and Polyhedral approaches to PageRank Optimization by Olivier Fercoq, Marianne Akian, Mustapha Bouhtou, Stéphane Gaubert (Submitted on 10 Nov 2010 (v1), last revised 19 Sep 2011 (this version, v2))

Abstract:

We study a general class of PageRank optimization problems which consist in finding an optimal outlink strategy for a web site subject to design constraints. We consider both a continuous problem, in which one can choose the intensity of a link, and a discrete one, in which in each page, there are obligatory links, facultative links and forbidden links. We show that the continuous problem, as well as its discrete variant when there are no constraints coupling different pages, can both be modeled by constrained Markov decision processes with ergodic reward, in which the webmaster determines the transition probabilities of websurfers. Although the number of actions turns out to be exponential, we show that an associated polytope of transition measures has a concise representation, from which we deduce that the continuous problem is solvable in polynomial time, and that the same is true for the discrete problem when there are no coupling constraints. We also provide efficient algorithms, adapted to very large networks. Then, we investigate the qualitative features of optimal outlink strategies, and identify in particular assumptions under which there exists a “master” page to which all controlled pages should point. We report numerical results on fragments of the real web graph.

I mention this research to raise several questions:

Does PageRank have a role to play in presentation for topic map systems?
Should PageRank results in topic map systems be used assign subject identifications?
If your answer to #2 is yes, what sort of subjects and how would you design the user choices leading to them?
Are you monitoring user navigations of your topic maps?
Has user navigation of your topic maps affected their revision or design of following maps?
Are the navigations in #5 the same as choices based on search results? (In theory or practice.)
Is there an optimal strategy for linking nodes in a topic map?

Comments Off

NOSQL means Neo4j plus Spring Roo

Filed under: Neo4j,Spring Data — Patrick Durusau @ 6:58 pm

NOSQL means Neo4j plus Spring Roo

Interesting post on the use of Neo4J with Spring Roo.

Interesting on its own but it also uses Greek Mythology as a data set so that could explain my interest. 😉

The next post promises to show inferring new facts based on existing relationships.

Comments Off

The HasGP user manual

Filed under: Functional Programming,Guassian Processes,Haskell — Patrick Durusau @ 6:58 pm

The HasGP user manual (pdf)

Abstract:

HasGP is an experimental library implementing methods for supervised learning using Gaussian process (GP) inference, in both the regression and classification settings. It has been developed in the functional language Haskell as an investigation into whether the well known advantages of the functional paradigm can be exploited in the field of machine learning, which traditionally has been dominated by the procedural/object-oriented approach, particularly involving C/C++ and Matlab. HasGP is open-source software released under the GPL3 license. This manual provides a short introduction on how install the library, and how to apply it to supervised learning problems. It also provides some more in-depth information on the implementation of the library, which is aimed at developers. In the latter, we also show how some of the specific functional features of Haskell, in particular the ability to treat functions as first-class objects, and the use of typeclasses and monads, have informed the design of the library. This manual applies to HasGP version 0.1, which is the initial release of the library.

HasGP website

What a nice surprise for a Monday morning, something new and different (not the same thing). Just scanning the pages before a conference call I would say you need to both read and forward this to your Haskell/Gaussian friends.

Comes with demo programs. Release 0.1 so it will be interesting to see what the future holds.

The project does need a mailing list so users can easily discuss their experiences, suggestions, etc. (One may already exist but isn’t apparent from the project webpage. If so, apologies.)

Comments Off

Twitter Storm: Open Source Real-time Hadoop

Filed under: Hadoop,NoSQL,Storm — Patrick Durusau @ 6:55 pm

Twitter Storm: Open Source Real-time Hadoop by Bienvenido David III.

From the post:

Twitter has open-sourced Storm, its distributed, fault-tolerant, real-time computation system, at GitHub under the Eclipse Public License 1.0. Storm is the real-time processing system developed by BackType, which is now under the Twitter umbrella. The latest package available from GitHub is Storm 0.5.2, and is mostly written in Clojure.

Storm provides a set of general primitives for doing distributed real-time computation. It can be used for “stream processing”, processing messages and updating databases in real-time. This is an alternative to managing your own cluster of queues and workers. Storm can be used for “continuous computation”, doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for “distributed RPC”, running an expensive computation in parallel on the fly.

See the post for links, details, quotes, etc.

My bet is that typologies are going to be data set specific. You?

BTW, I don’t think the local coffee shop offers free access to its cluster. Will have to check with them next week.

Comments Off

September 25, 2011

Modeling Item Difficulty for Annotations of Multinomial Classifications

Filed under: Annotation,Classification,LingPipe,Linguistics — Patrick Durusau @ 7:49 pm

Modeling Item Difficulty for Annotations of Multinomial Classifications by Bob Carpenter

From the post:

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

For your convenience, links for the “…tutorial for LREC with Massimo Poesio” can be found at: LREC 2010 Tutorial: Modeling Data Annotation.

Comments Off

Tang and Lease (2011) Semi-Supervised Consensus Labeling for Crowdsourcing

Filed under: Crowd Sourcing,LingPipe — Patrick Durusau @ 7:49 pm

Tang and Lease (2011) Semi-Supervised Consensus Labeling for Crowdsourcing

From the post:

I came across this paper, which, among other things, describes the data collection being used for the 2011 TREC Crowdsourcing Track:

Tang, Wei and Matthew Lease. 2011. Semi-supervised consensus labeling for crowdsourcing. SIGIR Workshop on Crowdsourcing for Information Retrieval.

But that’s not why we’re here today. I want to talk about their modeling decisions.

Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).

The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).

I post this in part for the review of the model in question and also as a warning that competent people really do read research papers in their areas. Yes, on the WWW you can publish anything you want, of whatever quality. But, others in your field will notice. Is that what you want?

Comments Off

Domain Adaptation with Hierarchical Naive Bayes Classifiers

Filed under: Bayesian Models,Classifier,LingPipe — Patrick Durusau @ 7:48 pm

Domain Adaptation with Hierarchical Naive Bayes Classifiers by Bob Carpenter.

From the post:

This will be the first of two posts exploring hierarchical and multilevel classifiers. In this post, I’ll describe a hierarchical generalization of naive Bayes (what the NLP world calls a “generative” model). The next post will explore hierarchical logistic regression (called a “discriminative” or “log linear” or “max ent” model in NLP land).

Very entertaining and useful if you use NLP at all in your pre-topic map phase.

Comments Off

Furnace — A Property Graph Algorithms Package

Filed under: Algorithms,Blueprints,Frames,Furnace,Graphs,Gremlin,Neo4j,Pipes,Rexster,TinkerPop — Patrick Durusau @ 7:48 pm

Furnace — A Property Graph Algorithms Package

Marko Rodriguez posted the following note to the Grelim-users mailing list today:

Hello,

For many months, the TinkerPop community has been trying to realize the best way to go about providing a graph analysis package to the TinkerPop stack ( http://bit.ly/qCMlcP ). With the increased flexibility and power of Pipes and the partitioning of Gremlin into multiple JVM languages, we feel that the stack is organized correctly now to support Furnace — A Property Graph Algorithms Package.

http://furnace.tinkerpop.com
( https://github.com/tinkerpop/furnace/wiki if the domain hasn’t propagated to your DNS yet )

The project is currently just stubbed, but overtime you can expect the ability to evaluate standard (and non-standard) graph analysis algorithms over Blueprints-enabled graphs in a way that respects explicit and implicit associations in the graph. In short, it will implement the ideas articulated in:

http://markorodriguez.com/2011/02/08/property-graph-algorithms/
http://arxiv.org/abs/0806.2274

This will be possible due to Pipes and the ability to represent abstract relationships using Pipes, Gremlin_groovy (and the upcoming Gremlin_scala). Moreover, while more thought is needed, there will be a way to talk at the Frames-levels (http://frames.tinkerpop.com) and thus, calculate graph algorithms according to one’s domain model. Ultimately, in time, as Furnace develops, we will see a Rexster-Kibble that supports the evaluation of algorithms via Rexster.

While the project is still developing, please feel free to contribute ideas and/or participate in the development process. To conclude, we hope people are excited about the promises that Furnace will bring by raising the processing abstraction level above the imperative representations of Pipes/Gremlin.

Thank you,
Marko.

http://markorodriguez.com

You have been waiting for the opportunity to contribute to the Tinkerpop stack, particularly on graph analysis, so here is your chance! Seriously, you need to forward this to every graph person, graph project and graduate student taking graph theory.

We can use simple graphs and hope (pray?) the world is a simple place. Or use more complex graphs to model the world. Do you feel lucky? Do you?

Comments Off

Artificial Intelligence Resources

Filed under: Artificial Intelligence,Indexing,Searching,Semantic Diversity,Topic Maps — Patrick Durusau @ 7:48 pm

Artificial Intelligence Resources

A collection of collections of resources on artificial intelligence. Useful but also illustrates a style of information delivery that has advantages over “search style foraging” and disadvantages as well.

It’s biggest advantage over “search style foraging” is that it presents a manageable listing of resources and not several thousand links. Even very dedicated researchers are unlikely to follow links > hundreds and even if you did, some of the material would be outdated by the time you reached it.

Another advantage is that one hopes (I haven’t tried all the links) that the resources have been vetted to some degree, with the superficial and purely advertising sites being filtered out. Results are more “hit” than “miss,” which with search results can be a very mixed bag.

But a manageable list is just that, manageable, the very link you need may have missed the cut-off point. Had to stop somewhere.

And you can’t know the author’s criteria for the listing. Their definition of “algorithm” may broader or narrower than your own.

In the days of professional indexes, researchers learned a sense for the categories used by indexing services. At least that was a smaller set than the vocabulary range of every author.

How would you use topic maps to bridge the gap between those two solutions?

Comments Off

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation

Filed under: Language,Semantic Diversity,Vocabularies — Patrick Durusau @ 7:48 pm

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation by Odile Piton (SAMM), Slim Mesfar (RIADI), and Hélène Pignot (SAMM).

Abstract:

Since 2006 we have undertaken to describe the differences between 17th century English and contemporary English thanks to NLP software. Studying a corpus spanning the whole century (tales of English travellers in the Ottoman Empire in the 17th century, Mary Astell’s essay A Serious Proposal to the Ladies and other literary texts) has enabled us to highlight various lexical, morphological or grammatical singularities. Thanks to the NooJ linguistic platform, we created dictionaries indexing the lexical variants and their transcription in CE. The latter is often the result of the validation of forms recognized dynamically by morphological graphs. We also built syntactical graphs aimed at transcribing certain archaic forms in contemporary English. Our previous research implied a succession of elementary steps alternating textual analysis and result validation. We managed to provide examples of transcriptions, but we have not created a global tool for automatic transcription. Therefore we need to focus on the results we have obtained so far, study the conditions for creating such a tool, and analyze possible difficulties. In this paper, we will be discussing the technical and linguistic aspects we have not yet covered in our previous work. We are using the results of previous research and proposing a transcription method for words or sequences identified as archaic.

Everyone working on search engines needs to print a copy of this article and read it at least once a month.

Seriously, the senses of both words and grammar evolve over centuries and even more quickly. What seem like correct search results from as recently as the 1950’s may be quite incorrect.

For example (I don’t have the episode reference, perhaps someone can suppy it) there was an “I Love Lucy” episode where Lucy says on the phone to RIcky that some visitor (at home) is “making love to her,” which meant nothing more than sweet talk. Not sexual intercourse.

I leave it for your imagination how large the semantic gap may be between English texts and originals composed in another language, culture, historical context and between 2,000 to 6,000 years ago. Flattening the complexities of ancient texts to bumper sticker snippets does a disservice them and ourselves.

Comments (2)

Musimetrics

Filed under: Multivariate Statistics,Music — Patrick Durusau @ 7:48 pm

Musimetrics by Vilson Vieira, Renato Fabbri, and Luciano da Fontoura Costa.

Abstract:

Can the arts be analyzed in a quantitative manner? We propose a methodology to study music development by applying multivariate statistics on composers characteristics. Seven representative composers were considered in terms of eight main musical features. Grades were assigned to each characteristic and their correlations were analyzed. A bootstrap method was applied to simulate hundreds of artificial composers influenced by the seven representatives chosen. Applying dimensionality reduction we obtained a planar space used to quantify non-numeric relations like dialectics, opposition and innovation. Composers differences on style and technique were represented as geometrical distances in the planar space, making it possible to quantify, for example, how much Bach and Stockhausen differ from other composers or how much Beethoven influenced Brahms. In addition, we compared the results with a prior investigation on philosophy. The influence of dialectics, strong on philosophy, was not remarkable on music. Instead, supporting an observation already considered by music theorists, strong influences were identified between subsequent composers, implying inheritance and suggesting a stronger master-disciple evolution when compared to the philosophy analysis.

The article concludes:

While taking the first steps on the direction of a quantitative approach to arts and philosophy we believe that an understanding of the creative process could also be eventually quantified. We want to end this work going back to Webern, who early envisioned these relations: “It is clear that where relatedness and unity are omnipresent, comprehensibility is also guaranteed. And all the rest is dilettantism, nothing else, for all time, and always has been. That’s so not only in music but everywhere.”

You are going to encounter multivariate statistics in a number of contexts. Where are the weak points in this paper? What questions would you ask? (Hint, they don’t involve expertise in music history or theory.) If you are familiar with multivariate statistics, what are the common weak points of that type of analysis?

I remember multivariate statistics from their use in the 1960’s/70’s in attempts to predict Supreme Court (US) behavior. The Court was quite safe and I think the same can be said for composers in the Western canon.

Comments Off

Scaling with RavenDB

Filed under: NoSQL,RavenDB — Patrick Durusau @ 7:47 pm

Scaling with RavenDB

From the description:

Scaling the data tier is a topic that many find scary. In this webcast, Oren Eini and Nick VanMatre, Solutions Architect at Archstone, sit down to discuss the scaling options for Archstone’s newest project, a re-architecture of their internal and external apartment-management applications.

Discussed are the options for scaling RavenDB, including sharding, replication and multi-master setups.

Something to start your week!

Comments Off

Visualizing Uncertainty

Filed under: Visualization — Patrick Durusau @ 7:47 pm

Visualizing Uncertainty by David Spiegelhalter.

From the post:

We have had a review paper published in Science called Visualising uncertainty about the future, although it primarily focuses on probability forecasts.

You may access the full paper by following the links below.

We know there is uncertainty about identification of criminals, of what we witness, rumors have that there is uncertainty in our data systems.

So how do we visual that uncertainty?

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 27, 2011

September 26, 2011

September 25, 2011