Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 15, 2011

How to kill a patent with Python

Filed under: Data Mining,Graphs,Python,Visualization — Patrick Durusau @ 7:56 pm

How to kill a patent with Python: or using NLP and graph theory for great good! by Van Linberg.

From the description:

Finding the right piece of “prior art” – technical documentation that described a patented piece of technology before the patent was filed – is like finding a needle in a very big haystack. This session will talk about how I am making that process faster and more accurate through the use of natural language processing, graph theory, machine learning, and lots of Python.

Very fascinating presentation with practical suggestions on mining patents.

Key topic map statement: “People are inconsistent in the use of language.”

From the OSCON description of the same presentation:

When faced with a patent case, it is essential to find “prior art” – patents and publications that describe a technology before a certain date. The problem is that the indexing mechanisms for patents and publications are not as good as they could be, making good prior art searching more of an art than a science. We can apply some of our natural language processing and “big data” techniques to the US patent database, getting us better results more quickly.

  • Part I: The USPTO as a data source. The full-text of each patent is available from the USPTO (and now from Google.) What does this data look like? How can it be harvested and normalized to create data structures that we can work with?
  • Part II: Once the patents have been cleaned and normalized, they can be turned into data structures that we can use to evaluate their relationship to other documents. This is done in two ways – by modeling each patent as a document vector and a graph node.
  • Part IIA: Patents as document vectors. Once we have a patent as a data structure, we can treat the patent as a vector in an n-dimensional space. In moving from a document into a vector space, we will touch on normalization, stemming, TF/IDF, Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA).
  • Part IIB: Patents as technology graphs. This will show building graph structures using the connections between patents – both the built-in connections in the patents themselves as well as the connections discovered while working with the patents as vectors. We apply some social network analysis to partition the patent graph and find other documents in the same technology space.
  • Part III: What have we built? Now that we have done all this analysis, we can see some interesting things about the patent database as a whole. How does the patent database act as a map to the world of technology? And how has this helped with the original problem – finding better prior art?

My suggestion was to use topic maps to capture the human analysis of the clusters at the end of the day for merging with other human analysis of other clusters.

Waiting to learn more about this project!

Pagination with Neo4j

Filed under: Cypher,Neo4j,py2neo — Patrick Durusau @ 7:53 pm

Pagination with Neo4j by Nigel Small.

I saw this on a retweet from Peter Neubauer.

Following a request on Twitter, I have spent a few hours putting together a quick article on pagination using Neo4j and more specifically the Cypher query language. Predictably, this is written from a Python standpoint and uses py2neo, but the theory should hold true for any language since all the clever bits come from Cypher.

Fundamentally, the method described here exploits the order by, skip and limit features of Cypher in order to return only a segment of the total results from the overall result set. These features are available in the latest stable version of Neo4j at the time of writing, so if you don’t have access to them already, maybe it’s time to consider an upgrade!

The longer you wait to upgrade, the more stuff you will have to learn! 😉

Please upgrade responsibly (to a stable version, but if you are brave, not!).

5 Steps to Scaling MongoDB

Filed under: Database,MongoDB — Patrick Durusau @ 7:52 pm

5 Steps to Scaling MongoDB (Or Any DB) in 8 MInutes

From the post:

Jared Rosoff concisely, effectively, entertainingly, and convincingly gives an 8 minute MongoDB tutorial on scaling MongoDB at Scale Out Camp. The ideas aren’t just limited to MongoDB, they work for most any database: Optimize your queries; Know your working set size; Tune your file system; Choose the right disks; Shard. Here’s an explanation of all 5 strategies:

Note: The Scale Out Camp link isn’t working as of 9/14/2011. Web domain is there but no content.

Lucene and Solr 3.4.0 Released

Filed under: Lucene,Solr — Patrick Durusau @ 7:52 pm

Lucene and Solr 3.4.0 Released

Eric Hatcher writes:

There are several juicy additions, but also a critical bug fix. It is recommended that all 3.x-using applications upgrade to 3.4 as soon as possible. Here’s the scoop on this fixed bug:

* Fixed a major bug (LUCENE-3418) whereby a Lucene index could
  easily become corrupted if the OS or computer crashed or lost
  power.

Lucene 3.4.0 includes: a new faceting module (contrib/facet) for computing facet counts (both hierarchical and non-hierarchical) at search time (LUCENE-3079); a new join module (contrib/join), enabling indexing and searching of nested (parent/child) documents using (LUCENE-3171); ability to now to index documents with term frequencies included but without positions (LUCENE-2048) – previously omitTermFreqAndPositions always omitted both; and a few other improvements.

Solr 3.4.0 includes: Lucene 3.4.0, which fixes the serious bug mentioned above; a new XsltUpdateRequestHandler allows posting XML that’s transformed by a provided XSLT into a valid Solr document (SOLR-2630); field grouping/collapsing: post-group faceting option (group.truncate) can now compute facet counts for only the highest ranking documents per-group (SOLR-2665); The query cache and filter cache can now be disabled per request (SOLR-2429); Improved memory usage, build time, and performance of SynonymFilterFactory (LUCENE-3233); various fixes for multi-threaded DataImportHandler; and a few other improvements.

Here are the links for more information and download access: Lucene 3.4.0 and Solr 3.4.0

The State of Solandra – Summer 2011

Filed under: Lucene,Solandra,Solr — Patrick Durusau @ 7:52 pm

The State of Solandra – Summer 2011

From SemanText:

A little over 18 months ago we talked to Jake Luciani about Lucandra – a Cassandra-based Lucene backend. Since then Jake has moved away from raw Lucene and married Cassandra with Solr, which is why Lucandra now goes by Solandra. Let’s see what Jake and Solandra are up to these days.

What is the current status of Solandra in terms of features and stability?

Solandra has gone through a few iterations. First as Lucandra which partitioned data by terms and used thrift to communicate with Cassandra. This worked for a few big use cases, mainly how to manage a index per user, and garnered a number of adopters. But it performed poorly when you had very large indexes with many dense terms, due to the number and size of remote calls needed to fulfill a query.Last summer I started off on a new approach based on Solr that would address Lucandra’s shortcomings: Solandra. The core idea of Solandra is to use Cassandra as a foundation for scaling Solr. It achieves this by embedding Solr in the Cassandra runtime and uses the Cassandra routing layer to auto shard a index across the ring (by document). This means good random distribution of data for writes (using Cassandra’s RandomParitioner) and good search performance since individual shards can be searched in parallel across nodes (using SolrDistributedSearch). Cassandra is responsible for sharding, replication, failover and compaction. The end user now gets a single scalable component for search without changing API’s which will scale in the background for them. Since search functionality is performed by Solr so it will support anything Solr does.

I gave a talk recently on Solandra and how it works: http://blip.tv/datastax/scaling-solr-with-cassandra-5491642

…more follows, worth your attention.

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

Filed under: Lucene,Solr — Patrick Durusau @ 7:51 pm

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

Just to temp you to read the rest of the post:

As promised in Part 1 of Solr Digest, Spring-Summer 2011, in this Part 2 post we’ll summarize what’s new with Solr’s Near-Real-Time Search support and Solr Cloud (if you love clouds and search with some big data on the side, get in touch). Let’s first examine what is being worked on for Solr Cloud and what else is in the queue for the near future. A good overview of what is currently functional can be found in the old Solr Cloud wiki page. Also, there is now another wiki page covering New Solr Cloud Design, which we find quite useful. The individual pieces of Solr Cloud functionality that are being worked on are as follows:

  • Work is still in progress on Distributed Indexing and Shard distribution policy. Patches exist, although they are now over 6 months old, so you can expect to see them updated soon.
  • As part of the Distributed Indexing effort, shard leader functionality deals with leader election and with publishing the information about which node is a leader of which shard and in Zookeeper in order to notify all interested parties. The development is pretty active here and initial patches already exist.
  • At some point in the future, Replication Handler may become cloud aware, which means it should be possible to switch the roles of masters and slaves, master URLs will be able to change based on cluster state, etc. The work hasn’t started on this issue.
  • Another feature Solr Cloud will have is automatic Spliting and migrating of Indices. The idea is that when some shard’s index becomes too large or the shard itself starts having bad query response times, we should be able to split parts of that index and migrate it (or merge) with indices on other (less loaded) nodes. Again, the work on this hasn’t started yet. Once this is implemented one will be able to split and move/merge indices using a Solr Core Admin as described in SOLR-2593.
  • To achieve more efficiency in search and gain control over where exactly each document gets indexed to, you will be able to define a custom shard lookup mechanism. This way, you’ll be able to limit execution of search requests to only some shards that are known to hold target documents, thus making the query more efficient and faster. This, along with the above mentioned shard distribution policy, is akin to routing functionality in ElasticSearch.

Isn’t that an amazing level of activity? I get tired just reading about it. 😉 Now if it can just be applied as cleverly as it has been written.

BTW, Part 1 if you are interested.

Statistical machine learning for text classification

Filed under: Natural Language Processing,NLTK,Python — Patrick Durusau @ 7:51 pm

Statistical machine learning for text classification with scikit-learn and NLTK by Olivier Grisel. (PyCon 2011)

The goal of this talk is to give a state-of-the-art overview of machine learning algorithms applied to text classification tasks ranging from language and topic detection in tweets and web pages to sentiment analysis in consumer products reviews.

First third is a review of basic NLP. Review of basic functions of scikit-learn. Same for NLTK. Also covers, briefly, the Google Prediction API.

Compares all three on the movie review database. Discusses analysis of newsgroups (for topics) and identifying language of webpages.

I would not say “state-of-the-art” as much as “an intro to text classification and its potential.”

Spatial Search Plugin (SSP) for Solr

Filed under: Geographic Information Retrieval,Maps,Solr — Patrick Durusau @ 7:51 pm

Spatial Search Plugin (SSP) for Solr

From the webpage:

With the continuous efforts of adjusting search results to focused target audiences, there’s an increasing demand for incorporating geographical location information into the standard search functionality. Spatial Search Plugin (SSP) for Apache Solr is a free, standalone plug-in which enables Geo / Location Based Search, and is built on top of the open source projects Apache Solr and Apache Lucene. It’s main goals and characteristics are:

  • Provide a complete, consistent, robust and fast implementation of advanced geospatial algorithms
  • Act as a standalone pluggable extension to Solr
  • Written in 100% Java
  • Compatible with Apache Solr and Apache Lucene
  • Open source under the Apache2 license
  • Well documented and comes with support

Location plus information about the location is a topic mappish sort of thing.

Explorer for Apache Solr

Filed under: Interface Research/Design,Solr — Patrick Durusau @ 7:50 pm

Explorer for Apache Solr

From the webpage:

One of our products is the Explorer for Apache Solr, a powerful generic Apache Solr client developed by JTeam. Based on Google Web Toolkit and the GWToolbox framework, this powerful explorer is able to connect to several Solr instances/cores and provides a meaningful UI for some of the more advanced featured offered by Solr.

With this Explorer for Apache Solr you can:

  • Perform simple term based search
  • Configure highlighting of search terms
  • Configure spellchecking (“Did you mean…” functionality)
  • Explore search facets (field and query facets)
  • View the search request and raw search response
  • Browse the Solr Schema
  • Configure different search result sortings
  • And more…

Do you really have the time to develop a UI from scratch?

DTIC Online

Filed under: Information Retrieval,Library — Patrick Durusau @ 7:50 pm

DTIC Online

From the webpage:

The Defense Technical Information Center (DTIC®) serves the DoD community as the largest central resource for DoD and government-funded scientific, technical, engineering, and business related information available today .

For more than 65 years DTIC has provided the warfighter and researchers, scientists, engineers, laboratories, and universities timely access to over 2 million publications covering over 250 subject areas. Our mission supports the nation’s warfighter.
….

The United States government and I suspect other national governments has sponsored decades worth of research on text processing, mining and evaluation. This is one of the major interfaces to US based literature. The Literature-Related Discovery (LRD) material originated from this source.

You will find things such as: “Research in Information Retrieval – Final Report – An investigation of the techniques and concepts of information retrieval,” dated 31 July 1964 as well as current reports.

A real treasure trove of historical and current material on information retrieval. The historical material will help you recognize when you are re-solving a well known problem. And sometimes help you avoid repeating old mistakes.

CERN Document Server (CDS)

Filed under: Library — Patrick Durusau @ 7:50 pm

CERN Document Server (CDS)

If you want to talk about “big data” and tools for dealing with it, what better place to start than where big data is the norm, not the exception.

You may want to start off with the help page as this is one of the barest interfaces I have seen in a while.

Enterprise-level Cloud at no charge

Filed under: Cloud Computing,Hadoop — Patrick Durusau @ 7:49 pm

Enterprise-level cloud at no charge

From September 12 – November 11, 2011.

Signup deadline: 28 October 2011

From the webpage:

  • 64-bit Copper and 32-bit Silver machines
  • Virtual machines to run Linux® (Red Hat or Novell SUSE) or Microsoft® Windows® Server 2003/2008
  • Select IBM software images
  • 1 block (256 gigabytes) of persistent storage

For the promotional period, IBM will suppress charges for use of these services. You may terminate the promotion at any time, although we don’t think you’ll want to! At the end of the promotional period, your account will transition to a standard pay-as-you-go account at the rates effective at that time. You may elect to add on more services, including, but not limited to:

  • Reserved virtual machine instances
  • On-boarding support
  • Premium and Advanced Premium support options
  • Virtual Private Network services
  • Additional images from IBM software brands, along with offerings from independent software vendors
  • Access to other IBM SmartCloud data centers
  • Additional services that are regularly being added to the IBM SmartCloud Enterprise offering

With these features and more, don’t miss this opportunity to try the IBM SmartCloud. With our enterprise-level servers, software and services, we offer a cloud computing infrastructure that you can approach with confidence. The IBM SmartCloud is built on the skills, experience and best practices gained from years of managing and operating security-rich data centers for enterprises and public institutions around the world.

If you want to try the cloud computing waters or IBM offerings, this could be your chance.

Naive Bayes Classifiers – Python

Filed under: Bayesian Models,Classifier,Python — Patrick Durusau @ 7:49 pm

Naive Bayes Classifiers – Python

From the post:

In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking the frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. The label whose likelihood estimate is the highest is then assigned to the input value.

Just one recent post from Python Language Processing. There are a number of others, some of which I will call out in future posts.

September 14, 2011

Neo4j Web/Phone Survey

Filed under: Graphs,Neo4j — Patrick Durusau @ 7:30 pm

Neo4j Web/Phone Survey

Neo4j is running a web and phone survey to determine future development.

Web survey is only five (5) questions, well, plus #6 if you have time to participate in a phone survey about Neo4j.

If you are using/enjoying/experimenting with Neo4j, take a moment to help shape its future!

Seven Deadly Sins of Solr

Filed under: Design,Enterprise Integration,Search Engines,Solr — Patrick Durusau @ 7:06 pm

7 Ways to Ensure Your Lucene/Solr Implementation Fails

From the post:

CMSWire spoke with Lucene/Solr expert Jay Hill of Lucid Imagination for a few tips on things to avoid when implementing Lucene/Solr to reduce the risk of your search project biting the dust. Hill calls them the “Seven Deadly Sins of Solr” – sloth, greed, pride, lust, envy, gluttony and wrath.

Read for Solr projects. Recast and read for other projects as well.

Don’t trust your instincts

Filed under: Data Analysis,Language,Recognition,Research Methods — Patrick Durusau @ 7:04 pm

I stumbled upon a review of: “The Secret Life of Pronouns: What Our Words Say About Us” by James W. Pennebaker in the New York Times Book Review, 28 August 2011.

Pennebaker is a word counter who first rule is: “Don’t trust your instincts.”

Why? In part because our expectations shape our view of the data. (sound familiar?)

The review quotes the Druge Report as posting a headline about President Obama that reads: “I ME MINE: Obama praises C.I.A. for bin Laden raid – while saying ‘I’ 35 Times.”

If the listener thinks President Obama is self-centered, the “I’s” have it as it were.

But, Pennebaker has used his programs to mindlessly count usage of words in press conferences since Truman. Obama is the lowest user I-word user of modern presidents.

That is only one illustration of how badly we can “look” at text or data and get it seriously wrong.

The Secret Life of Pronouns website has exercises to demonstrate how badly we get things wrong. (The videos are very entertaining.)

What does that mean for topic maps and authoring topic maps?

  1. Don’t trust your instincts. (courtesy of Pennebaker)
  2. View your data in different ways, ask unexpected questions.
  3. Ask people unfamiliar with your data how they view it.
  4. Read books on subjects you know nothing about. (Just general good advice.)
  5. Ask known unconventional people to question your data/subjects. (Like me! Sorry, consulting plug.)

Yahoo! Hadoop Tutorial

Filed under: Hadoop,MapReduce,Pig — Patrick Durusau @ 7:03 pm

Yahoo! Hadoop Tutorial

From the webpage:

Welcome to the Yahoo! Hadoop Tutorial. This tutorial includes the following materials designed to teach you how to use the Hadoop distributed data processing environment:

  • Hadoop 0.18.0 distribution (includes full source code)
  • A virtual machine image running Ubuntu Linux and preconfigured with Hadoop
  • VMware Player software to run the virtual machine image
  • A tutorial which will guide you through many aspects of Hadoop’s installation and operation.

The tutorial is divided into seven modules, designed to be worked through in order. They can be accessed from the links below.

  1. Tutorial Introduction
  2. The Hadoop Distributed File System
  3. Getting Started With Hadoop
  4. MapReduce
  5. Advanced MapReduce Features
  6. Related Topics
  7. Managing a Hadoop Cluster
  8. Pig Tutorial

You can also download this tutorial as a single .zip file and burn a CD for use, and easy distribution, offline.

Literature-Related Discovery (LRD)

Filed under: Information Retrieval,Literature-based Discovery — Patrick Durusau @ 7:02 pm

Literature-Related Discovery (LRD) by Kostoff, Ronald N. ; Block, Joel A. ; Solka, Jeffrey L. ; Briggs, Michael B. ; Rushenberg, Robert L. ; Stump, Jesse A. ; Johnson, Dustin ; Lyons, Terence J. ; Wyatt, Jeffrey R.

Short Abstract:

Discovery in science is the generation of novel, interesting, plausible, and intelligible knowledge about the objects of study. Literature-related discovery (LRD) is the linking of two or more literature concepts that have heretofore not been linked (i.e., disjoint), in order to produce novel interesting, plausible, and intelligible knowledge (i.e., potential discovery).

From the longer abstract in the monograph:

LRD offers the promise of large amounts of potential discovery, for the following reasons:

  • the burgeoning technical literature contains a very large pool of technical concepts in myriad technical areas;
  • researchers spend full time trying to cover the literature in their own research fields and are relatively unfamiliar with research in other especially disparate fields of research;
  • the large number of technical concepts (and disparate technical concepts) means that many combinations of especially disparate technical concepts exist
  • by the laws of probability, some of these combinations will produce novel, interesting, plausible, and intelligible knowledge about the objects of study

This monograph presents the LRD methodology and voluminous discovery results from five problem areas: four medical (treatments for Parkinson’s Disease (PD), Multiple Sclerosis (MS), Raynaud’s Phenomenon (RP), and Cataracts) and one non-medical (Water Purification (WP)). In particular, the ODS aspect of LRD is addressed, rather than the CDS aspect. In the presentation of potential discovery, a ‘vetting’ process is used that insures both requirements for ODS LBD are met: concepts are linked that have not been linked previously, and novel, interesting, plausible, and intelligible knowledge is produced.

The potential discoveries for the PD, MS, Cataracts, and WP problems are the first we have seen reported by this ODS LBD approach, and the numbers of potential discoveries for the ODS LBD benchmark RP problem are almost two orders of magnitude greater than those reported in the open literature by any other ODS LBD researcher who has addressed this benchmark RP problem. The WP problem is the first non-medical technical topic to have been addressed successfully by ODS LBD.

(ODS = open discovery system)

If you are looking for validation with supporting data for the literature-related discovery method, seek no further. The text plus annexes runs 884 pages.

This is a technique that fits quite well with topic maps.

PS: Yes, I know, this monograph says “literature-related discovery” (5.8 million “hits” in a popular search engine) versus “literature-based discovery” (6.3 million “hits” in the same search engine), another name for the same technique. Sigh, even semantic integration is afflicted with semantic integration woes.

Clojure Zippers

Filed under: Clojure,Functional Programming — Patrick Durusau @ 7:02 pm

Clojure Zippers

Loke VanderHart on zippers in Clojure.

From the website:

An introduction to the Clojure zip data structure, which supports fully functional tree navigation and editing. Includes a discussion of how to use the data structure effectively, as well as an overview of its performance characteristics.

I would supplement this presentation with:

“Editing” trees in Clojure with clojure.zip by Brian Marick.

Zippers – Functional Tree Editing (its under Other Included Libraries)

From the webpage:

Clojure includes purely functional, generic tree walking and editing, using a technique called a zipper (in namespace zip) . For background, see the paper by Huet. A zipper is a data structure representing a location in a hierarchical data structure, and the path it took to get there. It provides down/up/left/right navigation, and localized functional ‘editing’, insertion and removal of nodes. With zippers you can write code that looks like an imperative, destructive walk through a tree, call root when you are done and get a new tree reflecting all the changes, when in fact nothing at all is mutated – it’s all thread safe and shareable.

Oh, and the library documentation: API for Clojure.zip

Secondary Indexes in Riak

Filed under: Indexing,Riak — Patrick Durusau @ 7:02 pm

Secondary Indexes in Riak

Hey, “…alternate keys, one-to-many relationships, or many-to-many relationships…,” it sounds like they are playing the topic map song!

From the post:

Developers building an application on Riak typically have a love/hate relationship with Riak’s simple key/value-based approach to storing data. It’s great that anyone can grok the basics (3 simple operations, get/put/delete) quickly. It’s convenient that you can store anything imaginable as an object’s value: an integer, a blob of JSON data, an image, an MP3. And the distributed, scalable, failure-tolerant properties that a key/value storage model enables can be a lifesaver depending on your use case.

But things get much less rosy when faced with the challenge of representing alternate keys, one-to-many relationships, or many-to-many relationships in Riak. Historically, Riak has shifted these responsibilities to the application developer. The developer is forced to either find a way to fit their data into a key/value model, or to adopt a polyglot storage strategy, maintaining data in one system and relationships in another.

This adds complexity and technical risk, as the developer is burdened with writing additional bookkeeping code and/or learning and maintaining multiple systems.

That’s why we’re so happy about Secondary Indexes. Secondary Indexes are the first step toward solving these challenges, lifting the burden from the backs of developers, and enabling more complex data modeling in Riak. And the best part is that it ships in our 1.0 release, just a few weeks from now.

Flexible ranking in Lucene 4

Filed under: Lucene,Ranking — Patrick Durusau @ 7:01 pm

Flexible ranking in Lucene 4

Robert Muir writes:

Over the summer I served as a Google Summer of Code mentor for David Nemeskey, PhD student at Eötvös Loránd University. David proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework.

These improvements are now committed to Lucene’s trunk: you can use these models in tandem with all of Lucene’s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A JIRA issue has been created to make it easy to use these models from Solr’s schema.xml.

Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you’ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.

The wiki page for this project has a pointer to the search engine in A “Terrier” For Your Tool Box?.

I count a baker’s dozen or so new features described in this post.

Google Pregel: the Rise of the Clones

Filed under: Giraph,GoldenOrb,Hama,Phoebus,Pregel — Patrick Durusau @ 7:01 pm

Google Pregel: the Rise of the Clones

Claudio Martella gives a quick overview of Pregel “clones,” Apache Hama, GoldenOrb, Giraph, and Phoebus.

Claudio concludes:

So, here it is, fire up your Hadoop pseudo-cluster and get back to me if you have something to add.

Simple Search Is Not Enough!! (Homeland Security)

Filed under: Mapping,Searching — Patrick Durusau @ 7:00 pm

Simple Search Is Not Enough!! Map Necessity by Luc Quoniam

Very pictorial review of why simple search is inadequate.

Topical too since the examples all concern “homeland security.”

Leveling Up in The Process Quest

Filed under: Erlang — Patrick Durusau @ 7:00 pm

Leveling Up in The Process Quest: The Hiccups of Appups and Relups

I won’t reproduce the image that “learn you some Erlang for great good” uses for a failed update, you will have to visit the blog page to see for yourself.

I can quote the first couple of paragraphs that sets the background for it:

Doing some code hot-loading is one of the simplest things in Erlang. You recompile, make a fully-qualified function call, and then enjoy. Doing it right and safe is much more difficult, though.

There is one very simple challenge that makes code reloading problematic. Let’s use our amazing Erlang-programming brain and have it imagine a gen_server process. This process has a handle_cast/2 function that accepts one kind of argument. I update it to one that takes a different kind of argument, compile it, push it in production. All is fine and dandy, but because we have an application that we don’t want to shut down, we decide to load it on the production VM to make it run.

I suspect that Erlang or something close to it will become the norm in the not too distant future. Mostly because there won’t be an opportunity to “catch up” on all the data streams in the event of re-loading the application. May have buffering in the event of a reader failure but not system wide.

FigShare

Filed under: Data,Dataset — Patrick Durusau @ 6:59 pm

FigShare

From the website:

Scientific publishing as it stands is an inefficient way to do science on a global scale. A lot of time and money is being wasted by groups around the world duplicating research that has already been carried out. FigShare allows you to share all of your data, negative results and unpublished figures. In doing this, other researchers will not duplicate the work, but instead may publish with your previously wasted figures, or offer collaboration opportunities and feedback on preprint figures.

There wasn’t a category on the site for CS data sets or rather the results of processing/searching data sets.

Would that be the same thing?

Thinking it would be interesting to have examples of data analysis that failed along with the data sets in question. Or at least pointers to the data sets.

September 13, 2011

3rd Canadian Semantic Web Symposium

Filed under: Biomedical,Concept Detection,Ontology,Semantic Web — Patrick Durusau @ 7:17 pm

CSWS2011: The 3rd Canadian Semantic Web Symposium Proceedings of the 3rd Canadian Semantic Web Symposium
Vancouver, British Columbia, Canada, August 5, 2011

An interesting set of papers! I suppose I can be forgiven for looking at the text mining (Hassanpour & Das) and heterogeneous information systems (Khan, Doucette, and Cohen) papers first. 😉 More comments to follow on those.

What are your favorite papers in this batch and why?

The whole proceedings can also be downloaded as a single PDF file.

Edited by:

Christopher J. O. Baker *
Helen Chen **
Ebrahim Bagheri ***
Weichang Du ****

* University of New Brunswick, Saint John, NB, Canada, Department of Computer Science & Applied Statistics
** University of Waterloo, Waterloo, ON, Canada, School of Public Health and Health Systems
*** Athabasca University, School of Computing and Information Systems
**** University of New Brunswick, NB, Canada, Faculty of Computer Science

Table of Contents

Full Paper

  1. The Social Semantic Subweb of Virtual Patient Support Groups
    Harold Boley, Omair Shafiq, Derek Smith, Taylor Osmun
  2. Leveraging SADI Semantic Web Services to Exploit Fish Ecotoxicology Data
    Matthew M. Hindle, Alexandre Riazanov, Edward S. Goudreau, Christopher J. Martyniuk, Christopher J. O. Baker
  3. Short Paper

  4. Towards Evaluating the Impact of Semantic Support for Curating the Fungus Scientic Literature
    Marie-Jean Meurs, Caitlin Murphy, Nona Naderi, Ingo Morgenstern, Carolina Cantu, Shary Semarjit, Greg Butler, Justin Powlowski, Adrian Tsang, René Witte
  5. Ontology based Text Mining of Concept Definitions in Biomedical Literature
    Saeed Hassanpour, Amar K. Das
  6. Social and Semantic Computing in Support of Citizen Science
    Joel Sachs, Tim Finin
  7. Unresolved Issues in Ontology Learning
    Amal Zouaq, Dragan Gaševic, Marek Hatala
  8. Poster

  9. Towards Integration of Semantically Enabled Service Families in the Cloud
    Marko Boškovic, Ebrahim Bagheri, Georg Grossmann, Dragan Gaševic, Markus Stumptner
  10. SADI for GMOD: Semantic Web Services for Model Organism Databases
    Ben Vandervalk, Michel Dumontier, E Luke McCarthy, Mark D Wilkinson
  11. An Ontological Approach for Querying Distributed Heterogeneous Information Systems
    Atif Khan, John A. Doucette, Robin Cohen

Please see the CSWS2011 website for further details.

Discovering, Summarizing and Using Multiple Clusterings

Filed under: Clustering,Data Analysis,Data Mining — Patrick Durusau @ 7:16 pm

Proceedings of the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings
Athens, Greece, September 5, 2011.

This collection of papers reflects what I think is rapidly becoming the consensus view: There is no one/right way to look at data.

That is important because by the application of multiple techniques, in these papers clustering techniques, you may make unanticipated discoveries about your data. Recording the trail you followed, as all explorers should, will help others duplicate your steps, to test them or to go further. In topic map terms, I would you would be discovering and identifying subjects.

Edited by

Emmanuel Müller *
Stephan Günnemann **
Ira Assent ***
Thomas Seidl **

* Karlsruhe Institute of Technology, Germany
** RWTH Aachen University, Germany
*** Aarhus University, Denmark


Complete workshop proceedings as one file (~16 MB).

Table of Contents

    Invited Talks

  1. Combinatorial Approaches to Clustering and Feature Selection
    Michael E. Houle
  2. Cartification: Turning Similarities into Itemset Frequencies
    Bart Goethals
  3. Research Papers

  4. When Pattern Met Subspace Cluster
    Jilles Vreeken, Arthur Zimek
  5. Fast Multidimensional Clustering of Categorical Data
    Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, Hua Liu
  6. Factorial Clustering with an Application to Plant Distribution Data
    Manfred Jaeger, Simon Lyager, Michael Vandborg, Thomas Wohlgemuth
  7. Subjectively Interesting Alternative Clusters
    Tijl De Bie
  8. Evaluation of Multiple Clustering Solutions
    Hans-Peter Kriegel, Erich Schubert, Arthur Zimek
  9. Browsing Robust Clustering-Alternatives
    Martin Hahmann, Dirk Habich, Wolfgang Lehner
  10. Generating a Diverse Set of High-Quality Clusterings
    Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

Query processing in distributed, taxonomy-based information sources

Filed under: P2P,Query Expansion,Taxonomy — Patrick Durusau @ 7:16 pm

Query processing in distributed, taxonomy-based information sources by Carlo Meghini, Yannis Tzitzikas, Veronica Coltella, and Anastasia Analyti.

Abstract:

We address the problem of answering queries over a distributed information system, storing objects indexed by terms organized in a taxonomy. The taxonomy consists of subsumption relationships between negation-free DNF formulas on terms and negation-free conjunctions of terms. In the first part of the paper, we consider the centralized case, deriving a hypergraph-based algorithm that is efficient in data complexity. In the second part of the paper, we consider the distributed case, presenting alternative ways implementing the centralized algorithm. These ways descend from two basic criteria: direct vs. query re-writing evaluation, and centralized vs. distributed data or taxonomy allocation. Combinations of these criteria allow to cover a wide spectrum of architectures, ranging from client-server to peer-to-peer. We evaluate the performance of the various architectures by simulation on a network with O(10^4) nodes, and derive final results. An extensive review of the relevant literature is finally included.

Two quick comments:

While simulations are informative, I am curious how the five architectures would fare against actual taxonomies? Thinking that the complexity at any particular level varies greatly from taxonomy to taxonomy, assuming they are taxonomies that record natural phenomena.

Second, I think there is a growing recognition that while some data can be successfully gathered to a single location for processing, there is an increasing amount of data that may be partially accessible but that cannot be transfered for privacy, security or other concerns. And such diverse systems are likely to have their own means of identifying subjects.

COMTOP: Applied and Computational Algebraic Topology

Filed under: Dimension Reduction,High Dimensionality,Topology — Patrick Durusau @ 7:15 pm

COMTOP: Applied and Computational Algebraic Topology

From the website:

The overall goal of this project is to develop flexible topological methods which will allow the analysis of data which is difficult to analyze using classical linear methods. Data obtained by sampling from highly curved manifolds or singular algebraic varieties in Euclidean space are typical examples where our methods will be useful. We intend to develop and refine two pieces of software which have been written by members of our research group, ISOMAP (Tenenbaum) and PLEX (de Silva-Carlsson). ISOMAP is a tool for dimension reduction and parameterization of high dimensional data sets, and PLEX is a homology computing tool which we will use in locating and analyzing singular points in data sets, as well as estimating dimension in situations where standard methods do not work well. We plan to extend the range of applicability of both tools, in the case of ISOMAP by studying embeddings into spaces with non-Euclidean metrics, and in the case of PLEX by building in the Mayer-Vietoris spectral sequence as a tool. Both ISOMAP and PLEX will be adapted for parallel computing. We will also begin the theoretical study of statistical questions relating to topology. For instance, we will initiate the study of higher dimensional homology of subsets sampled from Euclidean space under various sampling hypotheses. The key object of study will be the family of Cech complexes constructed using the distance function in Euclidean space together with a randomly chosen finite set of points in Euclidean space.

The goal of this project is to develop tools for understanding data sets which are not easy to understand using standard methods. This kind of data might include singular points, or might be strongly curved. The data is also high dimensional, in the sense that each data point has many coordinates. For instance, we might have a data set whose points each of which is an image, which has one coordinate for each pixel. Many standard tools rely on linear approximations, which do not work well in strongly curved or singular problems. The kind of tools we have in mind are in part topological, in the sense that they measure more qualitative properties of the spaces involved, such as connectedness, or the number of holes in a space, and so on. This group of methods has the capability of recognizing the number of parameters required to describe a space, without actually parameterizing it. These methods also have the capability of recognizing singular points (like points where two non-parallel planes or non-parallel lines intersect), without actually having to construct coordinates on the space. We will also be further developing and refining methods we have already constructed which can actually find good parameterizations for many high dimensional data sets. Both projects will involve the adaptation for the computer of many methods which have heretofore been used in by-hand calculations for solving theoretical problems. We will also initiate the theoretical development of topological tools in a setting which includes errors and sampling.

Are you still using “classical linear methods” to analyze your data sets?

Could be the right choice but it may be your only choice if you never consider the alternatives.

Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization

Filed under: Machine Learning,Statistical Learning,Vectors — Patrick Durusau @ 7:14 pm

Practical Aggregation of Semantical Program Properties for Machine Learning Based Optimization by Mircea Namolaru, Albert Cohen, Grigori Fursin, Ayal Zaks, and Ari Freund.

ABSTRACT

Iterative search combined with machine learning is a promising approach to design optimizing compilers harnessing the complexity of modern computing systems. While traversing a program optimization space, we collect characteristic feature vectors of the program, and use them to discover correlations across programs, target architectures, data sets, and performance. Predictive models can be derived from such correlations, effectively hiding the time-consuming feedback-directed optimization process from the application programmer.

One key task of this approach, naturally assigned to compiler experts, is to design relevant features and implement scalable feature extractors, including statistical models that filter the most relevant information from millions of lines of code. This new task turns out to be a very challenging and tedious one from a compiler construction perspective. So far, only a limited set of ad-hoc, largely syntactical features have been devised. Yet machine learning is only able to discover correlations from information it is fed with: it is critical to select topical program features for a given optimization problem in order for this approach to succeed.

We propose a general method for systematically generating numerical features from a program. This method puts no restrictions on how to logically and algebraically aggregate semantical properties into numerical features. We illustrate our method on the difficult problem of selecting the best possible combination of 88 available optimizations in GCC. We achieve 74% of the potential speedup obtained through iterative compilation on a wide range of benchmarks and four different general-purpose and embedded architectures. Our work is particularly relevant to embedded system designers willing to quickly adapt the optimization heuristics of a mainstream compiler to their custom ISA, microarchitecture, benchmark suite and workload. Our method has been integrated with the publicly released MILEPOST GCC [14].

Read the portions on extracting features, inference of new relations, extracting relations from programs, extracting features from relations and tell me this isn’t a description of pre-topic map processing! 😉

« Newer PostsOlder Posts »

Powered by WordPress