April « 2011 « Another Word For It

April 30, 2011

When Data Mining Goes Horribly Wrong

Filed under: Data Mining,Merging,Search Engines — Patrick Durusau @ 10:22 am

In When Data Mining Goes Horribly Wrong, Matthew Hurst brings us a cautionary tale about what can happen when “merging” decisions are made badly.

From the blog:

Consequently, when you see a details page – either on Google, Bing or some other search engine with a local search product – you are seeing information synthesized from multiple sources. Of course, these sources may differ in terms of their quality and, as a result, the values they provide for certain attributes.

When combining data from different sources, decisions have to be made as to firstly when to match (that is to say, assert that the data is about the same real world entity) and secondly how to merge (for example: should you take the phone number found in one source or another?).

This process – the conflation of data – is where you either succeed or fail.

Read Matthew’s post for encouraging signs that there is plenty of room for the use of topic maps.

What I find particularly amusing is that repair of the merging in this case doesn’t help prevent it from happening again and again.

Not much of a repair if the problem continues to happen elsewhere.

Comments Off

CS 533: Natural Language Processing

Filed under: Legal Informatics,Natural Language Processing — Patrick Durusau @ 10:20 am

CS 533: Natural Language Processing

Don’t be mis-led by the title! This isn’t just another NLP course.

From the Legal Informatics Blog:

Professor Dr. L. Thorne McCarty of the Rutgers University Department of Computer Science has posted lecture videos and other materials in connection with his recent graduate course on Natural Language Processing. The course uses examples from a judicial decision: Carter v. Exxon Company USA, 177 F.3d 197 (3d Cir. 1999).

From Professor McCarty’s post on LinkedIn:

To access most of the files, you will need the username: cs533 and the password: shrdlu. To access the videos, use the same password: shrdlu. Comments are welcome!

NLP and legal materials? Now there is a winning combination!

I must confess that years of practicing law and even now continuing to read legal materials in some areas may be influencing my estimate of the appeal of this course. I will watch the lectures and get back to you.

Comments Off

Justia.com

Filed under: Legal Informatics — Patrick Durusau @ 10:19 am

Justia.com

Collection of sites, blogs, resources, etc., for legal research.

I wasn’t surprised by its U.S. centric offerings until I saw that it also offered fair coverage of Latin and South America.

What happened to European countries, the EU, Africa, Asia, etc.?

If anyone knows of legal resource collections for other countries or regions, I would like to report them here.

Comments Off

Natural Language Processing for the Working Programmer

Filed under: Haskell,Natural Language Processing — Patrick Durusau @ 10:18 am

Natural Language Processing for the Working Programmer

Daniël de Kok and Harm Brouwer have started a book on natural language processing using Haskell.

Functional programming meets NLP!

A work in progress so I am sure the authors would appreciate comments, suggestions, etc.

BTW, there is a blog, ? Try ?t H?ske?? in ?inguistics with posts working through the book.

Comments Off

Artificial Intelligence | Natural Language
Processing – Videos

Filed under: Artificial Intelligence,Natural Language Processing — Patrick Durusau @ 10:17 am

I first blogged about Christopher Manning’s Artificial Intelligence | Natural Language Processing back in February, 2011.

Video’s of the lectures are now online and I thought that merited a separate update to that entry.

I have also edited that entry to point to the videos.

Enjoy!

Comments Off

Bridging the Gulf:…

Filed under: Conferences,Digital Library,Information Retrieval — Patrick Durusau @ 10:16 am

Bridging the Gulf: Communication and Information in Society, Technology, and Work

October 9-13, 2011, New Orleans, Louisiana

From the website:

The ASIST Annual Meeting is the main venue for disseminating research centered on advances in the information sciences and related applications of information technology.

ASIST 2011 builds on the success of the 2010 conference structure and will have the integrated program that is an ASIST strength. This will be achieved using the six reviewing tracks pioneered in 2010, each with its own committee of respected reviewers to ensure that the conference meets your high expectations for standards and quality. These reviewers, experts in their fields, will assist with a rigorous peer-review process.

Important Dates:

Papers, Panels, Workshops & Tutorials

Deadline for submissions: May 31
Notification to authors: June 28
Final copy: July 15

Posters, Demos & Videos:

Deadline for submissions: July 1
Notification to authors: July 20
Final copy: July 27

One of the premier technical conferences for librarians and information professionals in the United States.

The track listings are:

Track 1 – Information Behaviour
Track 2 – Knowledge Organization
Track 3 – Interactive Information & Design
Track 4 – Information and Knowledge Management
Track 5 – Information Use
Track 6 – Economic, Social, and Political Issues

A number of opportunities for topic map based presentations.

The conference being located in New Orleans is yet another reason to attend! The food, music, and street life has to be experienced to be believed. No description would be adequate.

Comments Off

Semantic Digital Archives

Filed under: Conferences,Digital Library,Metadata — Patrick Durusau @ 10:15 am

International Workshop on Semantic Digital Archives

Berlin, 29.09.2011

From the website:

The Semantic Digital Archives Workshop will be an exciting opportunity for collaboration and cross-fertilization between the digital libraries, the digital archives and the semantic web community. Naturally, digital archiving is a heavily interdisciplinary topic. Semantic Digital Archives shall bring digital archives to a new level of interoperability by focussing on knowlede representation and knowledge management issues around digital archiving. We intend to have a discussion about all aspects concerning Semantic Digital Archives at the Workshop.

Since decades, the amount of content created digitally is increasing and its life cycle tends to remain digital. A selection of this content is expected to be of value for the future and can thus be considered being part of our cultural heritage. For example, a huge amount of digital publications that have to be stored and kept accessible within a digital library today will become obsolete some day and will not be accessed as frequently anymore. Then, it has to be decided which content will be of value in the future and shall therefore be archived and kept accessible for requests by information professionals like historians and journalists but possibly also for the general public.

Digital Archives provide a long-term curation perspective for the content residing in Digital Libraries and any other kind of data or knowledge base. Having to deal with any kind of data breeds an issue of huge complexity. However, even digital publications tend to become more and more complex, embedding different kinds of multimedia, data in arbitrary formats and software and challenge already current Digital Library systems. As soon as these digital publications become obsolete, but are still deemed to be of value in the future, they have to be transferred into an appropriate Archival Information System (AIS) where they need to be kept accessible possibly forever. AIS must face the challenge to provide sustainable (usually indefinite) long-term digital preservation.

Important Dates:

01. June 2011 – System will be open for Submissions
30. June 2011 – Deadline for Submissions
30. July 2011 – Acceptance Notification
20. August 2011 – Camera-ready papers due
20. September 2011 – Presentations due
29. September 2011 – Workshop to be held in Berlin

The semantics by which materials are accessed evolve and change over time. Topic maps are in a unique position to use present semantics and at the same time allow for the evolution of semantics.

Each new generation of researchers will have their unique views of archived materials. Without topic maps we will either deny them the right to their own views or lose the views of prior researchers. Either seems like a poor choice to me.

While you are in Berlin, be sure to visit the Pergamon Museum. The reconstruction of the Istar Gate from Babylon is an amazing sight.

The Ishtar Gate dates from about the mid-point of the written record of civilization. Preservation of semantics over historic time isn’t even approached by current proposals. Although I think topic maps should be in the running for ideas to consider.

Comments Off

April 29, 2011

BigGarbage In -> BigGarbage Out

Filed under: Hadoop — Patrick Durusau @ 1:25 pm

Taking Hadoop Mainstream

This is just one example of any number of articles that lament how hard Hadoop is to explain to non-technical users.

Apparently there is an anticipated flood of applications that will have Hadoop “under the hood” so to speak that are due out later this year.

While I don’t doubt it will be true that enormous amounts of data will be analyzed by those applications, without some underlying understanding of the data, will the results be meaningful?

Note that I said the data and not Hadoop.

Understanding Hadoop is just a technicality.

An important one but whether one uses a cigar box with paper and pencil or the latest non-down cloud infrastructure with Hadoop, understanding the data and the operations to be performed upon it are far more important.

Processing large amounts of data will not be cheap and so the results of necessity will be seen as reliable. Yes? Or else we would not have spent all that money and you can see the answer to the problem is….

You can hear the future conversations as clearly as I can.

BigData simply means you have a big pile of data. (I forego the use of the other term.)

Whether you can extract meaningful results depends on the same factors as before the advent of “BigData.”

The principal one being an understanding of the data and its limitations. Which means human analysis of the data set and its gathering.

Data sets (large or not) are typically generated or used by staff and capturing their insights into particular aspects of a data set can be easily done using a topic map.

A topic map can collect and coordinate multiple views on the use and limitations of data sets.

Subsequent users don’t discover too late that a particular data set is unreliable or limited in some unforeseen way.

Hadoop is an important emerging technology subject to the rule:

BigGarbage In -> BigGarbage Out.

Comments Off

Duolingo: The Next Chapter in Human Communication

Filed under: Crowd Sourcing,Data Mining,Education,Entity Resolution,Knowledge Capture,Subject Identity — Patrick Durusau @ 1:16 pm

Duolingo: The Next Chapter in Human Communication

By one of the co-inventors of CAPTCHA and reCAPTCHA, Luis von Ahn, so his arguments should give us pause.

Luis wants to address the problem of translating the web into multiple languages.

Yes, you heard that right, translate the web into multiple languages.

Whatever you think now, watch the video and decide if you still feel the same way.

My question is how to adapt his techniques to subject identification?

Comments Off

Tamana – Release

Filed under: Tamana,Topic Map Software — Patrick Durusau @ 1:14 pm

Tamana: A generic topic map browser.

A new release from the TopicmapsLab

I haven’t looked at it yet but plan to over the weekend.

Comments Off

Building Erlang Applications with Rebar

Filed under: Erlang — Patrick Durusau @ 1:13 pm

Since Erlang underlies Riak, a NoSQL distributed key-value store, I thought this might be of interest:

Building Erlang Applications with Rebar

From the description:

Rebar is an Open Source project that provides a set of standardized build tools for OTP applications and releases. This talk will explore how to use Rebar in a typical development environment for everything from simple Erlang code to port drivers and embedded Erlang servers.

Talk Objectives: Introduce the major features and functionality that rebar provides to Erlang developers. Examine the architecture of rebar and discuss how it can be extended.

Target Audience: Any Erlang developer interested in using or extending rebar for their build system.

Additional references:

Rebar: Erlang Build Tool

Erlang App. Management with Rebar (another tutorial)

Rebar Mailing List

Comments Off

ARF Graph Layout

Filed under: Graphs,Visualization — Patrick Durusau @ 1:12 pm

ARF Graph Layout

From the website:

As Networks and their structure have become a major field of research a strong demand to visualize these networks has emerged. We address this challenge by formalizing the well established spring layout in terms of dynamic equations. We thus opening up the design space for new algorithms. Drawing from the knowledge of systems design, we derive a layout algorithm that remedies several drawbacks of the original spring layout. This new algorithm relies on the balancing of two antagonistic forces. We thus call it ARF for “attractive and repulsive forces”. It is, as we claim, particularly suited for dynamic layout of smaller networks. We back this claim with several application examples from on going complex systems research.

Comments Off

April 28, 2011

Reference and Response

Filed under: Subject Identity — Patrick Durusau @ 3:22 pm

Reference and Response by Louis deRosset. Australian Journal of Philosophy, March 2011, Vol. 89, No.1.

Before you skip this entry, realize that this article may shine light on why Linked Data works at all and quite possibly how to improve subject identification for Linked Data and topic maps as well.

Abstract:

A standard view of reference holds that a speaker’s use of a name refers to a certain thing in virtue of the speaker’s associating a condition with that use that singles the referent out. This view has been criticized by Saul Kripke as empirically inadequate. Recently, however, it has been argued that a version of the standard view, a response-based theory of reference, survives the charge of empirical inadequacy by allowing that associated conditions may be largely or even entirely implicit. This paper argues that response-based theories of reference are prey to a variant of the empirical inadequacy objection, because they are ill-suited to accommodate the successful use of proper names by pre-school children. Further, I argue that there is reason to believe that normal adults are, by and large, no different from children with respect to how the referents of their names are determined. I conclude that speakers typically refer positionally: the referent of a use of a proper name is typically determined by aspects of the speaker’s position, rather than by associated conditions present, however implicitly, in her psychology.

With apologies to the author but I would sum up his position (sorry) on referents to be that we use proper nouns to identify particular people because we have learned those references from others, that is our position in a community of users of that referent.

That is to say that all the characteristics that we can recite when called upon to say why we have identified a particular person are much like logic that justifies, after the fact, mathematical theorems and insights. Mathematical theorems and insights being “seen” first and then “proved” as justification for others.

Interesting. Another reason why computers do so poorly at subject identification. Computers are asked to act as we imagine ourselves identifying subjects and not how we identify them in fact.

How does that help with Linked Data and topic maps?

First, I would extend the author’s argument to all referents.

Second, it reveals that the URI/L versus properties to identify a subject is really a canard.

What is important, in terms of subject identification, is the origin of the identification.

For example, if “positionally” I am using .lg as used in Unix in a Nutshell, page 12-7, that is all you need to know to distinguish its reference from all the trash that a web search engine returns.

Adding up properties of “Ligature mode” of Nroff/Troff isn’t going to get you any closer to the referent of .lg. Because that isn’t how anyone used .lg in the same sense I did.

The hot question is how to capture our positional identification of subjects.

Which would include when two or more references are for the same subject.

PS: I rather like deRosset’s conclusion:

Someone, long ago, was well-placed to refer to Cicero. Now, because of our de facto historical position, we are well-placed to refer to Cicero, even though we (or those of us without classical education) wouldn’t know Cicero from Seneca. We don’t need to be able to point to him, or apprehend some condition which singles him out (other, perhaps, than being Cicero). Possessing an appropriately-derived use of ‘Cicero’ suffices. According to the theory of evolution by natural selection, so long as we are appropriately situated (i.e., so long as our local environment is relevantly similar to our ancestors’), we benefit from our biological ancestors’ reproductive successes. Similarly, when we refer positionally, so long as we are appropriately situated, we benefit from our linguistic ancestors’ referential successes. In neither case do the conditions by which we benefit have to be present, even implicitly, in our psychology.²⁵

Comments Off

The Rise of Hadoop…

Filed under: Hadoop — Patrick Durusau @ 3:21 pm

The Rise of Hadoop: How many Hadoop-related solutions exist?

Alex Popescu of myNoSQL enhances a CMSWire listing of fourteen (14) different Hadoop solutions by adding pointers to most of the solutions.

Thanks to Alex for that!

It always puzzles me when “content providers” refer to an online site or software that can be reached online, but don’t include a link.

Easy enough to search and including the link takes time, but only once. Every reader is saved time by the presence of a link.

PS: From the CMSWire article:

Hadoop is Hard

Hadoop is not the most intuitive and easy-to-use technology. Many of the recent startups that have emerged to challenge Cloudera’s dominance have the exclusive value proposition that they make it easier to get answers from the software by abstracting the functions to higher-level products. But none of the companies has found the magic solution to bring the learning curve to a reasonable level.

Do you think topic maps could assist in ….bring[ing] the learning curve to a reasonable level?

If so, how?

Comments Off

Scientific graphs Generators plugin

Filed under: Gephi,Graphs — Patrick Durusau @ 3:19 pm

Scientific graphs Generators plugin

A new plugin for Gephi, described as:

Cezary Bartosiak and Rafa? Kasprzyk just released the Complex Generators plugin, introducing many awaited scientific generators. These generators are extremely useful for scientists, as they help to simulate various real networks. They can test their models and algorithms on well-studied graph examples. For instance, the Watts-Strogatz generator creates networks as described by Duncan Watts in his Six Degrees book.

The plugin contains the following generators:

Balanced Tree

Barabasi Albert

Barabasi Albert Generalized

Barabasi Albert Simplified A

Barabasi Albert Simplified B

Erdos Renyi Gnm

Erdos Renyi Gnp

Kleinberg

Watts Strogatz Alpha

Watts Strogatz Beta

Comments Off

NodeBox – Graphs

Filed under: Graphs,Mac OS X,Python,Visualization — Patrick Durusau @ 3:19 pm

Nodebox – Graphs

Part of NodeBox, a Mac OS X application for visualization using Python.

The graph part is described as:

The NodeBox Graph library includes algorithms from NetworkX for betweenness centrality and eigenvector centrality, Connelly Barnes’ implementation of Dijksta shortest paths (here) and the spring layout for JavaScript by Aslak Hellesoy and Dave Hoover (here). The goal of this library is visualization of small graphs (<200 elements), if you need something more robust we recommend using NetworkX.

PS: See my post: NetworkX Introduction… for an introduction to NetworkX.

Comments Off

Dataset linkage recommendation on the Web of Data

Filed under: Conferences,Entity Resolution,Linked Data,LOD — Patrick Durusau @ 3:18 pm

Dataset linkage recommendation on the Web of Data by Martijn van der Plaat (Master thesis).

Abstract:

We address the problem of, given a particular dataset, which candidate dataset(s) from the Web of Data have the highest chance of holding co-references, in order to increase the efficiency of coreference resolution. Currently, data publishers manually discover and select the right dataset to perform a co-reference resolution. However, in the near future the size of the Web of Data will be such that data publishers can no longer determine which datasets are candidate to map to. A solution for this problem is finding a method to automatically recommend a list of candidate datasets from the Web of Data and present this to the data publisher as an input for the mapping.

We proposed two solutions to perform the dataset linkage recommendation. The general idea behind our solutions is predicting the chance a particular dataset on the Web of Data holds co-references with respect to the dataset from the data publisher. This prediction is done by generating a profile for each dataset from the Web of Data. A profile is meta-data that represents the structure of a dataset, in terms of used vocabularies, class types, and property types. Subsequently, dataset profiles that correspond with the dataset profile from the data publisher, get a specific weight value. Datasets with the highest weight values have the highest chance of holding co-references.

A useful exercise but what happens when data sets have inconsistent profiles from different sources?

And for all the drum banging, only a very tiny portion of all available datasets are part of Linked Data.

How do we evaluate the scalability of such a profiling technique?

Comments Off

April 27, 2011

Hard economic lessons for news

Filed under: Marketing — Patrick Durusau @ 2:32 pm

Hard economic lessons for news

I saw this in the TechDirt Daily Email. Mike Masnick offered the following summary:

Tradition is not a business model. The past is no longer a reliable guide to future success.

“Should” is not a business model. You can say that people “should” pay for your product but they will only if they find value in it.

Virtue is not a business model. Just because you do good does not mean you deserve to be paid for it.

Business models are not made of entitlements and emotions. They are made of hard economics. Money has no heart.

Begging is not a business model. It’s lazy to think that foundations and contributions can solve news’ problems. There isn’t enough money there.

No one cares what you spent. Arguing that news costs a lot is irrelevant to the market.

One or more of these themes have been offered as justifications for semantic technologies, including topic maps.

I would add for semantic technologies:

Saving the world isn’t a business model. Try for something with more immediate and measurable results.
Cult like faith in linking isn’t a solution, it’s a delusion. Linking per se is not a value, successful results (by whatever means) are.
Sharing my data isn’t a goal. Sharing in someone else’s data is.

Comments Off

MaJorToM server v. 1.0.0 released

Filed under: MaJorToM,SPARQL,Topic Map Software — Patrick Durusau @ 2:28 pm

MaJorToM server v. 1.0.0 released

From the post:

The MaJorToM server is a Spring application which provides a TMQL HTTP interface to query topic maps. The server is intended as back end for Topic Maps based applications.

Today the Topic Maps Lab has released the version 1.0.0 of the MaJorToM server. The server acts as TMQL endpoint for Topic Maps based applications. You can play around with an instance of the server here.

The sources of the MaJorToM server are available at Google code. In the installation instructions you will learn how you can build and deploy your own instances of the MaJorToM server. Once you have deployed the server for your own, you will have an administration interface and a TMQL interface. But the server is not only a TMQL endpoint for the data. It also provides full text search (based on Beru) and a SPARQL enpoint to the hosted topic maps.

With the next release Maiana will act as a frontend for topic maps delivered by remote MaJorToM server. Besides the Maiana integration the MaJorToM server is very well integrated with TM2O – the OData provider for Topic Maps.

Comments (1)

Solr Result Grouping / Field Collapsing Improvements

Filed under: Lucene,Solr — Patrick Durusau @ 2:25 pm

Solr Result Grouping / Field Collapsing Improvements by Yonik Seeley

From the post:

I previously introduced Solr’s Result Grouping, also called Field Collapsing, that limits the number of documents shown for each “group”, normally defined as the unique values in a field or function query.

Since then, there have been a number of bug fixes, performance improvements, and feature enhancements. You’ll need a recent nightly build of Solr 4.0-dev, or the newly released LucidWorks Enterprise v1.6, our commercial version of Solr.

A short but useful article on new grouping capabilities in Solr.

What you do with results once they are grouped, which could include “merging,” is up to you.

Comments Off

Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr

Filed under: Lucene,Solr — Patrick Durusau @ 2:24 pm

Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr posted by Mitchell Pronschinske.

From the post:

There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users. This presentation will compare Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We will further explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases will be presented describing how to map the various functions between systems.

Excellent presentation if you need to make the case for Lucene/Solr, function by function.

Comments Off

Graph Theory

Filed under: Graphs,Neo4j — Patrick Durusau @ 2:24 pm

Graph Theory by Reinhard Diestel.

With all the graph database offerings and papers you may feel the need to brush up on graph theory.

The student edition is quite attractively priced ($9.99)

Springer-Verlag, Heidelberg
Graduate Texts in Mathematics, Volume 173
ISBN 978-3-642-14278-9
July 2010 (2005, 2000, 1997)
451 pages; 125 figures

Comments Off

April 26, 2011

…Efficient Subgraph Matching on Huge Networks (or, > 1 billion edges < 1 second)

Filed under: Graphs,Networks,RDF — Patrick Durusau @ 2:18 pm

A Budget-Based Algorithm for Efficient Subgraph Matching on Huge Networks by Matthais Br&oul;cheler, Andrea Pugliese, V.S. Subrahmanian. (Presented at GDM 2011.)

Abstract:

As social network and RDF data grow dramatically in size to billions of edges, the ability to scalably answer queries posed over graph datasets becomes increasingly important. In this paper, we consider subgraph matching queries which are often posed to social networks and RDF databases — for such queries, we want to find all matching instances in a graph database. Past work on subgraph matching queries uses static cost models which can be very inaccurate due to long-tailed degree distributions commonly found in real world networks. We propose the BudgetMatch query answering algorithm. BudgetMatch costs and recosts query parts adaptively as it executes and learns more about the search space. We show that using this strategy, BudgetMatch can quickly answer complex subgraph queries on very large graph data. Specifically, on a real world social media data set consisting of 1.12 billion edges, we can answer complex subgraph queries in under one second and significantly outperform existing subgraph matching algorithms.

Built on top of Neo4J, BudgetMatch, dynamically updates budgets assigned to vertexes.

Aggressive pruning gives some rather attractive results.

Comments Off

Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud

Filed under: Graphs,Key-Value Stores — Patrick Durusau @ 2:17 pm

Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud by Alexander G. Connor, Panos K. Chrysanthis, Alexandros Labrinidis.

Data Beats Math

Filed under: Data,Mathematics,Subject Identity — Patrick Durusau @ 2:17 pm

Data Beats Math

A more recent post by Jeff Jonas.

Topic maps can capture observations, judgments, conclusions from human analysts.

Do those beat math as well?

Comments Off

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness – Post

Filed under: Associations,Indexing,Librarian/Expert Searchers — Patrick Durusau @ 2:16 pm

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness

This is a must read post by Jeff Jonas.

I won’t spoil your fun but Jeff defines terms such as:

Context-less Card Catalogs
Semantically Reconciled Directories
Semantically Reconciled and Relationship Aware Directories

and a number of others.

Looks very much like he is interested in the same issues as topic maps.

Take the time to read it and see what you think.

Comments Off

April 25, 2011

Inside Horizon: interactive analysis at cloud scale

Filed under: Cloud Computing,Data Analysis,Data Mining — Patrick Durusau @ 3:36 pm

Inside Horizon: interactive analysis at cloud scale

From the website:

Late last year, we were honored to be invited to talk at Reflections|Projections, ACM@UIUC’s annual student-run computing conference. We decided to bring a talk about Horizon, our system for doing aggregate analysis and filtering across very large amounts of data. The video of the talk was posted a few weeks back on the conference website.

Horizon started as research project / technology demonstrator built as part of Palantir’s Hack Week – a periodic innovation sprint that our engineering team uses to build brand new ideas from whole cloth. It was then used to by the Center For Public Integrity in their Who’s Behind The Subprime Meltdown report. We produced a short video on the subject, Beyond the Cloud: Project Horizon, released on our analysis blog. Subsequently, it was folded into our product offering, under the name Object Explorer.

In this hour-long talk, two of the engineers that built this technology tell the story of how Horizon came to be, how it works, and show a live demo of doing analysis on hundreds of millions of records in interactive time.

From the presentation:

Mission statement: Organize the world’s information and make it universally accessible and useful. -> Google’s statement

Which should say:

Organize the world’s [public] information and make it universally accessible and useful.

Palantir’s misson:

Organize the world’s [private] information and make it universally accessible and useful.

Closes on human-driven analysis.

A couple of points:

The demo was of a pre-beta version even though the product version shipped several months prior to the presentation. What’s with that?

Long on general statements and short on any specifics.

Did mention this is a column-store solution. Appears to work well with very clean data, but then what solution doesn’t?

Good emphasis on user interface and interactive responses to queries.

I wonder if the emphasis on interactive responses creates unrealistic expectations among customers?

Or an emphasis on problems that can be solved or appear to be solvable, interactively?

My comments about intelligence community bias the other day for example. You can measure and visualize tweets that originate in Tahrir Square, but if they are mostly from Western media, how meaningful is that?

Comments Off

GO Annotation Tools

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 3:36 pm

GO Annotation Tools

Sponsored by the GO Ontology Consortium that I covered here, but I thought the tools merited separate mention.

Many of these tools will be directly applicable to bioinformatics use of topic maps and/or will give you ideas for similar tools in other domains.

Comments Off

The igraph library

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 3:35 pm

The igraph library

From the website:

igraph is a free software package for creating and manipulating undirected and directed graphs. It includes implementations for classic graph theory problems like minimum spanning trees and network flow, and also implements algorithms for some recent network analysis methods, like community structure search.

The efficient implementation of igraph allows it to handle graphs with millions of vertices and edges. The rule of thumb is that if your graph fits into the physical memory then igraph can handle it.

….

igraph contains functions for generating regular and random graphs according to many algorithms and models from the network theory literature.

igraph provides routines for manipulating graphs, adding and removing edges and vertices.

You can assign numeric or textual attribute to the vertices or edges of the graph, like edge weights or textual vertex ids.

A rich set of functions calculating various structural properties, eg. betweenness, PageRank, k-cores, network motifs, etc. are also included.

Force based layout generators for small and large graphs

The R package and the Python module can visualize graphs many ways, in 2D and 3D, interactively or non-interactively.

igraph provides data types for implementing your own algorithm in C, R, Python or Ruby.

Community structure detection algorithms using many recently developed heuristics.

igraph can read and write many file formats, e.g., GraphML, GML or Pajek.

igraph contains efficient functions for deciding graph isomorphism and subgraph isomorphism

It also contains an implementation of the push/relabel algorithm for calculating maximum network flow, and this way minimum cuts, vertex and edge connectivity.

igraph is well documented both for users and developers.

igraph is open source and distributed under GNU GPL.

Comments Off

Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 30, 2011

April 29, 2011

April 28, 2011

April 27, 2011

April 26, 2011

April 25, 2011