Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 4, 2012

New Paper: Linked Data Strategy for Global Identity

Filed under: Identity,RDF,Semantic Web — Patrick Durusau @ 3:32 pm

New Paper: Linked Data Strategy for Global Identity

Angela Guess writes:

Hugh Glaser and Harry Halpin have published a new PhD thesis for the University of Southampton Research Repository entitled “The Linked Data Strategy for Global Identity” (2012). The paper was published by the IEEE Computer Society. It is available for download here for non-commercial research purposes only. The abstract states, “The Web’s promise for planet-scale data integration depends on solving the thorny problem of identity: given one or more possible identifiers, how can we determine whether they refer to the same or different things? Here, the authors discuss various ways to deal with the identity problem in the context of linked data.”

At first I was hurt that I didn’t see a copy of Harry’s dissertation before it was published. I don’t always agree with him (see below) but I do like keeping up with his writing.

Then I discovered this is a four page dissertation. I guess Angela never got past the cover page. It is an article in the IEEE zine, IEEE Internet Computing.

Harry fails to mention that the HTTP 303 “trick,” was made necessary by Tim Berners-Lee’s failure to understand the necessity to distinguish identifiers from addresses. Rather that admit to or correct that failure, the solution being pushed is to create web traffic overhead in the form of 303 “tricks.” “303” should be re-named, “TBL”, so we are reminded with each invocation who made it necessary. (lower middle column, page 3)

I partially agree with:

We’re only just beginning to explore the vast field of identity, and more work is needed before linked data can fulfill its full potential.(on page 5)

The “just beginning” part is true enough. But therein lies the rub. Rather than explore the “…vast field of identity…” which changes from domain to domain, first and then propose a solution, the Linked Data proponents took the other path.

They proposed a solution and in the face of its failure to work, now are inching towards the “…vast field of identity….” Seems a might late for that.

Harry concludes:

The entire bet of the linked data enterprise critically rests on using URIs to create identities for everything. Whether this succeeds might very well determine whether information integration will be trapped in centralized proprietary databases or integrated globally in a decentralized manner with open standards. Given the tremendous amount of data being created and the Web’s ubiquitous nature, URIs and equivalence links might be the best chance we have of solving the identity problem, transforming a profoundly difficult philosophical issue into a concrete engineering project.

The first line, “The entire bet….” omits to say that we need the same URIs for everything. That is called the perfect language project, which has a very long history of consistent failure. Recent attempts include Esperanto and LogLang.

The second line, “Whether this succeeds…trapped in centralized proprietary databases…” is fear mongering. “If you don’t support linked data, (insert your nightmare scenario).”

The final line, “…transforming a profoundly difficult philosophical issue into a concrete engineering project” is magical thinking.

Identity is a very troubled philosophical issue but proposing a solution without understanding the problem doesn’t sound like a high percentage shot to me. You?

The Problem With Names (and the W3C)

Filed under: RDF,Semantic Web — Patrick Durusau @ 3:30 pm

The Problem With Names by Paul Miller.

Paul details the struggle of museums to make their holdings web accessible.

The problem isn’t reluctance or a host of other issues that Paul points out.

The problem is one of identifiers, that is, names.

Museums have crafted complex identifiers for their holdings and not unreasonably expect to continue to use them.

But all they are being offered are links.

The Rijksmuseum is one of several museums around the world that is actively and enthusiastically working to open up its data, so that it may be used, enjoyed, and enriched by a whole new audience. But until some of the core infrastructure — the names, the identifiers, the terminologies, and the concepts — upon which this and other museums depend becomes truly part of the web, far too much of the opportunity created by big data releases such as the Rijksmuseum’s will be wasted.

When is the W3C going to admit that subjects can have complex names/identifiers? Not just simple links?

That would be a game changer. For everyone.

California abandons $2 billion court management system

Filed under: Marketing — Patrick Durusau @ 3:29 pm

California abandons $2 billion court management system by Michael Krigsman.

From the post:

Despite spending $500 million on the California Case Management System (CCMS), court officials terminated the project and allocated $8.6 million to determine whether they can salvage anything. In 2004, planners expected the system to cost $260 million; today, the price tag would be $2 billion if the project runs to completion.

The multi-billion project, started in 2001, was intended to automate California court operations with a common system across the state and replace 70 different legacy systems. Although benefits from the planned system seem clear, court leadership decided it could no longer afford the cost of completing the system, especially during this period of budget cuts, service reductions, and personnel layoffs.

This failure wasn’t entirely due to the diversity of legacy applications. I expect poor project management, local politics and just bad IT advice all played their parts.

But it is an example of how removing local diversity in IT represents a bridge too far.

Diversity in our population is a common thing. (English-only states not withstanding.)

Diversity in IT is common as well.

Diversity in plant and animal populations make them more robust.

Perhaps diversity in IT systems, with engineered interchange, could give us robustness and interoperability.

Apache Bigtop 0.3.0 (incubating) has been released

Filed under: Bigtop,Flume,Hadoop,HBase,Hive,Mahout,Oozie,Sqoop,Zookeeper — Patrick Durusau @ 2:33 pm

Apache Bigtop 0.3.0 (incubating) has been released by Roman Shaposhnik.

From the post:

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

  • Apache Hadoop 1.0.1
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Thoughts on what is missing from this ecosystem?

What if you moved from the company where you wrote the scripts? And they needed new scripts?

Re-write? On what basis?

Is your “big data” big enough to need “big documentation?”

March 2012 Bay Area HBase User Group meetup summary

Filed under: HBase,Hive — Patrick Durusau @ 2:31 pm

March 2012 Bay Area HBase User Group meetup summary by David S. Wang.

Let’s see:

  • …early access program – HBase In Action
  • …recent HBase releases
  • …Moving HBase RPC to protobufs
  • …Comparing the native HBase client and asynchbase
  • …Using Apache Hive with HBase: Recent improvements
  • …backups and snapshots in HBase
  • …Apache HBase PMC meeting

Do you need any additional reasons to live in the Bay Area? 😉

Seriously, if you do, take advantage of the opportunity meetings like this one offer.

If you don’t, might be cheaper than air fare to create you own HBase/Hive ecosystem.

April 3, 2012

Rhetological Fallacies

Filed under: Humor,Logic,Rhetoric — Patrick Durusau @ 4:19 pm

Rhetological Fallacies: Errors and manipulations of rhetorical and logical thinking.

Useful and deeply amusing chart of errors in rhetoric and logic by David McCandless.

Each error has a symbol along with a brief explanation.

These symbols really should appear in Unicode. 😉

Or at least have a TeX symbol set defined for them.

I have written to ask about a version with separate images/glyphs for each fallacy. Would make it easier to “tag” arguments.

Custom security filtering in Solr

Filed under: Faceted Search,Facets,Filters,Solr — Patrick Durusau @ 4:19 pm

Custom security filtering in Solr by Erik Hatcher

Yonik recently wrote about “Advanced Filter Caching in Solr” where he talked about expensive and custom filters; it was left as an exercise to the reader on the implementation details. In this post, I’m going to provide a concrete example of custom post filtering for the case of filtering documents based on access control lists.

Recap of Solr’s filtering and caching

First let’s review Solr’s filtering and caching capabilities. Queries to Solr involve a full-text, relevancy scored, query (the infamous q parameter). As users navigate they will browse into facets. The search application generates filter query (fq) parameters for faceted navigation (eg. fq=color:red, as in the article referenced above). The filter queries are not involved in document scoring, serving only to reduce the search space. Solr sports a filter cache, caching the document sets of each unique filter query. These document sets are generated in advance, cached, and reduce the documents considered by the main query. Caching can be turned off on a per-filter basis; when filters are not cached, they are used in parallel to the main query to “leap frog” to documents for consideration, and a cost can be associated with each filter in order to prioritize the leap-frogging (smallest set first would minimize documents being considered for matching).

Post filtering

Even without caching, filter sets default to generate in advance. In some cases it can be extremely expensive and prohibitive to generate a filter set. One example of this is with access control filtering that needs to take the users query context into account in order to know which documents are allowed to be returned or not. Ideally only matching documents, documents that match the query and straightforward filters, should be evaluated for security access control. It’s wasteful to evaluate any other documents that wouldn’t otherwise match anyway. So let’s run through an example… a contrived example for the sake of showing how Solr’s post filtering works.

Good examples but also heed the author’s warning to use the techniques in this article when necessary. Some times simple solutions are the best. Like using the network authentication layer to prevent unauthorized users from seeing the Solr application at all. No muss, no fuss.

1940 Census (U.S.A.)

Filed under: Census Data,Government Data — Patrick Durusau @ 4:18 pm

1940 Census (U.S.A.)

From the “about” page:

Census records are the only records that describe the entire population of the United States on a particular day. The 1940 census is no different. The answers given to the census takers tell us, in detail, what the United States looked like on April 1, 1940, and what issues were most relevant to Americans after a decade of economic depression.

The 1940 census reflects economic tumult of the Great Depression and President Franklin D. Roosevelt’s New Deal recovery program of the 1930s. Between 1930 and 1940, the population of the Continental United States increased 7.2% to 131,669,275. The territories of Alaska, Puerto Rico, American Samoa, Guam, Hawaii, the Panama Canal, and the American Virgin Islands comprised 2,477,023 people.

Besides name, age, relationship, and occupation, the 1940 census included questions about internal migration; employment status; participation in the New Deal Civilian Conservation Corps (CCC), Works Progress Administration (WPA), and National Youth Administration (NYA) programs; and years of education.

Great for ancestry and demographic studies. What other data would you use with this census information?

Ohio State University Researcher Compares Parallel Systems

Filed under: Cray,GPU,HPC,Parallel Programming,Parallelism — Patrick Durusau @ 4:18 pm

Ohio State University Researcher Compares Parallel Systems

From the post:

Surveying the wide range of parallel system architectures offered in the supercomputer market, an Ohio State University researcher recently sought to establish some side-by-side performance comparisons.

The journal, Concurrency and Computation: Practice and Experience, in February published, “Parallel solution of the subset-sum problem: an empirical study.” The paper is based upon a master’s thesis written last year by former computer science and engineering graduate student Saniyah Bokhari.

“We explore the parallelization of the subset-sum problem on three contemporary but very different architectures, a 128-processor Cray massively multithreaded machine, a 16-processor IBM shared memory machine, and a 240-core NVIDIA graphics processing unit,” said Bokhari. “These experiments highlighted the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.”

Bokhari evaluated the conventional central processing unit architecture of the IBM 1350 Glenn Cluster at the Ohio Supercomputer Center (OSC) and the less-traditional general-purpose graphic processing unit (GPGPU) architecture, available on the same cluster. She also evaluated the multithreaded architecture of a Cray Extreme Multithreading (XMT) supercomputer at the Pacific Northwest National Laboratory’s (PNNL) Center for Adaptive Supercomputing Software.

What I found fascinating about this approach was the comparison of:

the strengths and weaknesses of these architectures in the context of a well-defined combinatorial problem.

True enough, there is a place for general methods and solutions, but one pays the price for using general methods and solutions.

Thinking that for subject identity and “merging” in a “big data” context, that we will need a deeper understanding of specific identity and merging requirements. So that the result of that study is one or more well-defined combinatorial problems.

That is to say that understanding one or more combinatorial problems precedes proposing a solution.

You can view/download the thesis by Saniyah Bokhari, Parallel Solution of the Subset-sum Problem: An Empirical Study

Or view the article (assuming you have access):

Parallel solution of the subset-sum problem: an empirical study

Abstract (of the article):

The subset-sum problem is a well-known NP-complete combinatorial problem that is solvable in pseudo-polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128-processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16-processor IBM x3755 shared memory machine, and a 240-core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 1010. The IBM x3755 performs very well on medium-sized problems that fit within its 64-GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The results in this paper illustrate that the subset-sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd.

6 Reasons Why Infographics and Data Visualization Works

Filed under: Infographics,Visualization — Patrick Durusau @ 4:18 pm

6 Reasons Why Infographics and Data Visualization Works by Matthew Fields.

Here are the reasons (see the post for reasons and the infographic):

  1. Short Attention Spans
  2. Information Overload
  3. Easy to Understand
  4. Reading Retention
  5. More Engaging
  6. People Love Sharing Infographics

What I find interesting is the effective use of text before you get to the infographic.

And there are some points I would add to the list:

  • No Thinking Required
  • Reinforces Prejudices
  • Shallow Understanding
  • Infographic Replaces Data

The last one, Infographic Replaces Data, is the most dangerous.

The debate shifts from what the data may or may not show, upon additional analysis, to what the infographic may or may not show.

Do you see the shift? If you allow me (or anyone else) to create an infographic, we have implicitly defined the boundaries of discussion. We are no longer talking about the “data” (although we may use that terminology) but about an infographic that has replaced the data.

In other words, if you don’t agree with the infographic, you have already lost the debate. Because we are not debating the “data,” but rather my infographic. Which I fashioned because it supports my opinion, obviously.

Thinking infographics as brainwashed data would not be too far off the mark.

Some infographics are worse than others, in terms of shifting the basis for discussion. I will round up some good examples for a future post.

Apache Hadoop Versions: Looking Ahead

Filed under: Hadoop — Patrick Durusau @ 4:18 pm

Apache Hadoop Versions: Looking Ahead by Aaron Myers.

From the post:

A few months ago, my colleague Charles Zedlewski wrote a great piece explaining Apache Hadoop version numbering. The post can be summed up with the following diagram:

[graphic omitted]

While Charles’s post does a great job of explaining the history of Apache Hadoop version numbering, it doesn’t help users understand where Hadoop version numbers are headed.

A must read to avoid being confused yourself about future Hadoop development.

To say nothing of trying to explain future Hadoop development to others.

What we have gained from this self-inflicted travail remains unclear.

Apache Sqoop Graduates from Incubator

Filed under: Database,Hadoop,Sqoop — Patrick Durusau @ 4:18 pm

Apache Sqoop Graduates from Incubator by Arvind Prabhakar.

From the post:

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

In its monthly meeting in March of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache Sqoop, thus graduating it from the Incubator. This is a significant milestone in the life of Sqoop, which has come a long way since its inception almost three years ago.

For moving data in and out of Hadoop, Sqoop is your friend. Drop by and say hello.

Getting More Out of a Solr Solution

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 4:17 pm

Getting More Out of a Solr Solution

Jasmine Ashton reports:

Carsabi, a used car search engine, recently posted a write-up that provides readers with useful tips on how to make Solr run faster in “Optimizing Solr (or How to 7x Your Search Speed).”

According to the article, after experiencing problems with their current system, Carsabi switched to WebSolr’s Solr solution which worked until the website reached 1 million listings. In order to make the process work better, Carsabi expanded its hardware storage capacity. While this improved the speed a great deal, they still weren’t satisfied.

A good example of why we should all look beyond the traditional journals, conference proceedings and blogs for experiences with search engines. Where we will find advice like:

Software: Shard that Sh*t

Is that “active” voice? 😉

I should write so clearly and directly.

Tracking Video Game Buzz

Filed under: Blogs,Clustering,Data Mining,Tweets — Patrick Durusau @ 4:17 pm

Tracking Video Game Buzz

Matthew Hurst writes:

Briefly, I pushed out an experimental version of track // games to track tropics in the blogosphere relating to video games. As with track // microsoft it uses gathers posts from blogs, clusters them and uses an attention metric based on Bitly and Twitter to rank the clusters, new posts and videos.

Currently at the top of the stack is Bungie Waves Goodbye To Halo.

Wonder if Matthew could be persuaded to do the same for the elections this Fall in the United States? 😉

Easier literate programming with R

Filed under: R — Patrick Durusau @ 4:17 pm

Easier literate programming with R by Christophe Lalanne.

From the post:

I have been using Sweave over the past 5 or 6 years to process my R documents, and I have been quite happy with this program. However, with the recent release of knitr (already adopted on UCLA Stat Computing and on Vanderbilt Biostatistics Wiki) and all of its nice enhancements, I really need to get more familiar with it.

In fact, there’s a lot of goodies in Yihui Xie‘s knitr, including the automatic processing of graphics (no need to call print() to display a lattice object), local or global control of height/width for any figures, removal of R’s prompt (R’s output being nicely prefixed with comments), tidying and highlighting facilities, image cropping, use of framed or listings for embedding code chunk.

To overcome some of those lacking features in Sweave, I generally have to post-process my files using shell scripts or custom Makefile. For example, I am actually giving a course (in French) on introductory #rstats for biomedical research and I provide a series of exercices written with Sweave. I can easily manage my graphics to have the desired size using a combination of Sweave Gin and lattice‘s aspect= argument. However, the latter means I have to crop my images afterwards. Moreover, I need to “cache” some of the computations and there’s no command-line argument for that, unless you rely on pgfSweave. This leads to complicated stuff like…

Any post that makes content generation less complicated or less of a one-off task, sounds good to me.

April 2, 2012

The 1000 Genomes Project

The 1000 Genomes Project

If Amazon is hosting a single dataset > 200 TB, is your data “big data?” 😉

This merits quoting in full:

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3. 

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

A few years ago there was a quantum leap in the technology for sequencing DNA, which drastically reduced the time and cost of identifying genetic code. This offered the promise of being able to compare full genomes from individuals, rather than entire species, leading to a much more detailed genetic map of where we, as individuals, have genetic similarities and differences. This will ultimately give us better insight into human health and disease.

The 1000 Genomes Project, initiated in 2008, is an international public-private consortium that aims to build the most detailed map of human genetic variation available, ultimately with data from the genomes of over 2,661 people from 26 populations around the world. The project began with three pilot studies that assessed strategies for producing a catalog of genetic variants that are present at one percent or greater in the populations studied. We were happy to host the initial pilot data on Amazon S3 in 2010, and today we're making the latest dataset available to all, including results from sequencing the DNA of approximately 1,700 people.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow

You can find more information, the location of the data and how to get started using it on our 1000 Genomes web page, or from the project pages.

If that sounds like a lot of data, just imagine all of the recorded mathematical texts and the relationships between the concepts represented in such texts?

It is in our view that data looks smooth or simple. Or complex.

Distributed Graph DB Reading Club Returns! (4th April 2012)

Filed under: B-trees,Boost Graph Library,Graphs,Key-Key-Value Stores — Patrick Durusau @ 5:48 pm

Distributed Graph DB Reading Club Returns! (4th April 2012)

René Pickhardt writes:

The reading club was quite inactive due to traveling and also a not optimal process for the choice of literature. That is why a new format for the reading club has been discussed and agreed upon.

The new Format means that we have 4 new rules

  1. we will only discuss up to 3 papers in 90 minutes of time. So rough speaking we have 30 minutes per paper but this does not have to be strict.
  2. The decided papers should be read by everyone before the reading club takes place.
  3. For every paper there is one responsible person (moderator) who did read the entire paper before he suggested it as a common reading.
  4. Open questions to the (potential) reading assignments and ideas for reading can and should be discussed on http://related-work.rene-pickhardt.de/ (use the same template as I used for the reading assignments in this blogpost) eg:

Moderator:
Paper download:
Why to read it
topics to discuss / open questions:

For next meeting on April 4th 2 pm CET (in two days) the literature will be:

In case you haven’t noticed, the 4th of April is day after tomorrow!

Time for the reading glasses and strong coffee!

The Total Cost of (Non) Ownership of a NoSQL Database Service

Filed under: Amazon DynamoDB,Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 5:47 pm

The Total Cost of (Non) Ownership of a NoSQL Database Service

From the post:

We have received tremendous positive feedback from customers and partners since we launched Amazon DynamoDB two months ago. Amazon DynamoDB enables customers to offload the administrative burden of operating and scaling a highly available distributed database cluster while only paying for the actual system resources they consume. We also received a ton of great feedback about how simple it is get started and how easy it is to scale the database. Since Amazon DynamoDB introduced the new concept of a provisioned throughput pricing model, we also received several questions around how to think about its Total Cost of Ownership (TCO).

We are very excited to publish our new TCO whitepaper: The Total Cost of (Non) Ownership of a NoSQL Database service. Download PDF.

I bet you can guess how the numbers work out without reading the PDF file. 😉

Makes me wonder though if there would be a market for a different hosted NoSQL database or topic map application? Particularly a topic map application.

Not along the lines of Maiana but more of a topic based data set, which could respond to data by merging it with already stored data. Say for example a firefighter scans the bar code on a railroad car lying alongside the tracks with fire getting closer. The only think they want is a list of the necessary equipment and whether to leave now, or not.

Most preparedness agencies would be well pleased to simply pay for the usage they get of such a topic map.

iSAX

Filed under: Data Mining,iSAX,SAX,Time Series — Patrick Durusau @ 5:47 pm

iSAX

An extension of the SAX software for larger data sets. Detailed in: iSAX: Indexing and Mining Terabyte Sized Time Series.

Abstract:

Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, it has not led to algorithms that can scale to the increasingly massive datasets encountered in science, engineering, and business domains. In this work, we show how a novel multiresolution symbolic representation can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature. Our approach allows both fast exact search and ultra fast approximate search. We show how to exploit the combination of both types of search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real world datasets, containing millions of time series.

There are a number of data sets at this page with “…warning 500meg file.”

SAX (Symbolic Aggregate approXimation)

Filed under: Data Mining,SAX,Time Series — Patrick Durusau @ 5:47 pm

SAX (Symbolic Aggregate approXimation)

From the webpage:

SAX is the first symbolic representation for time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. In classic data mining tasks such as clustering, classification, index, etc., SAX is as good as well-known representations such as Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT), while requiring less storage space. In addition, the representation allows researchers to avail of the wealth of data structures and algorithms in bioinformatics or text mining, and also provides solutions to many challenges associated with current data mining tasks. One example is motif discovery, a problem which we defined for time series data. There is great potential for extending and applying the discrete representation on a wide class of data mining tasks.

From a testimonial on the webpage:

the performance SAX enables is amazing, and I think a real breakthrough. As an example, we can find similarity searches using edit distance over 10,000 time series in 50 milliseconds. Ray Cromwell, Timepedia.org

Don’t usually see “testimonials” on an academic website but they appear to be merited in this case.

Serious similarity software. Take the time to look.

BTW, you may also be interested in a SAX time series/Shape tutorial. (120 slides about what makes SAX special.)

UCR Time Series Classification/Clustering Page

Filed under: Classification,Clustering,Dataset,Time Series — Patrick Durusau @ 5:46 pm

UCR Time Series Classification/Clustering Page

From the webpage:

This webpage has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering.

While chasing the details on Eamonn Keogh and his time series presentation, I encountered this collection of data sets.

SAXually Explicit Images: Data Mining Large Shape Databases

Filed under: Data Mining,Image Processing,Image Recognition,Shape — Patrick Durusau @ 5:46 pm

SAXually Explicit Images: Data Mining Large Shape Databases by Eamonn Keogh.

ABSTRACT

The problem of indexing large collections of time series and images has received much attention in the last decade, however we argue that there is potentially great untapped utility in data mining such collections. Consider the following two concrete examples of problems in data mining.

Motif Discovery (duplication detection): Given a large repository of time series or images, find approximately repeated patterns/images.

Discord Discovery: Given a large repository of time series or images, find the most unusual time series/image.

As we will show, both these problems have applications in fields as diverse as anthropology, crime…

Ancient history in the view of some, this is a Google talk from 2006!

But, it is quite well done and I enjoyed the unexpected application of time series representation to shape data for purposes of evaluating matches. It is one of those insights that will stay with you and that seems obvious after they say it.

I think topic map authors (semantic investigators generally) need to report such insights for the benefit of others.

Clojure Game of Life

Filed under: Cellular Automata,Game of Life — Patrick Durusau @ 5:46 pm

Clojure Game of Life

From the post:

This is a Conway’s Game of Life in functional style written in Clojure.

Wikipedia (Cellular Automaton) mentions:

Cellular automata are also called “cellular spaces”, “tessellation automata”, “homogeneous structures”, “cellular structures”, “tessellation structures”, and “iterative arrays”.

You may recall that Stephen Wolfram wrote A New Kind of Science (1280 pages) about cellular automata. Had a great author. Needed a great editor as well.

At a minimum, I take cellular automata for the proposition that computational artifacts exist, whether we expect or forecast them or not.

At a maximum, well, that’s an open research question isn’t it?

Elixir – A modern approach to programming for the Erlang VM

Filed under: Elixir,Erlang — Patrick Durusau @ 5:46 pm

Elixir

From the homepage:

Elixir is a programming language built on top of the Erlang VM. As Erlang, it is a functional language built to support distributed, fault-tolerant, non-stop applications with hot code swapping.

Elixir is also dynamic typed but, differently from Erlang, it is also homoiconic, allowing meta-programming via macros. Elixir also supports polymorphism via protocols (similar to Clojure’s), dynamic records and provides a reference mechanism.

Finally, Elixir and Erlang share the same bytecode and data types. This means you can invoke Erlang code from Elixir (and vice-versa) without any conversion or performance hit. This allows a developer to mix the expressiveness of Elixir with the robustness and performance of Erlang.

If you want to install Elixir or learn more about it, check our getting started guide. [Former link, http://elixir-lang.org/getting_started/1.html updated to: http://elixir-lang.org/getting-started/introduction.html.]

Quite possibly of interest to Erlang programmers.

Take a close look at the languages mentioned in the Wikipedia article on homoiconicity as other examples of homoiconic languages.

Question: The list contains “successful” and “unsuccessful” languages. Care to comment on possible differences that account for the outcomes?

Thinking a “successful” semantic mapping language will need to have certain characteristics. The question is, of course, which ones?

Scaling Solr Indexing with SolrCloud, Hadoop and Behemoth

Filed under: Behemoth,Hadoop,Solr,SolrCloud — Patrick Durusau @ 5:45 pm

Scaling Solr Indexing with SolrCloud, Hadoop and Behemoth

Grant Ingersoll writes:

We’ve been doing a lot of work at Lucid lately on scaling out Solr, so I thought I would blog about some of the things we’ve been working on recently and how it might help you handle large indexes with ease. First off, if you want a more basic approach using versions of Solr prior to what will be Solr4 and you don’t care about scaling out Solr indexing to match Hadoop or being fault tolerant, I recommend you read Indexing Files via Solr and Java MapReduce. (Note, you could also modify that code to handle these things. If you need do that, we’d be happy to help.)

Instead of doing all the extra work of making sure instances are up, etc., however, I am going to focus on using some of the new features of Solr4 (i.e. SolrCloud whose development effort has been primarily led by several of my colleagues: Yonik Seeley, Mark Miller and Sami Siren) which remove the need to figure out where to send documents when indexing, along with a convenient Hadoop-based document processing toolkit, created by Julien Nioche, called Behemoth that takes care of the need to write any Map/Reduce code and also handles things like extracting content from PDFs and Word files in a Hadoop friendly manner (think Apache Tika run in Map/Reduce) while also allowing you to output the results to things like Solr or Mahout, GATE and others as well as to annotate the intermediary results. Behemoth isn’t super sophisticated in terms of ETL (Extract-Transform-Load) capabilities, but it is lightweight, easy to extend and gets the job done on Hadoop without you having to spend time worrying about writing mappers and reducers.

If you are pushing the boundaries of your Solr 3.* installation or just want to know more about Solr4, this post is for you.

April 1, 2012

Syrian crowdmapping project documents reports of rape

Filed under: Crowd Sourcing — Patrick Durusau @ 7:13 pm

Syrian crowdmapping project documents reports of rape

Niall Firth, technology editor for the New Scientist, writes:

Earlier this month, an unnamed woman in the village of Sahl Al-Rawj, Syria, left the safety of her hiding place to plead for the lives of her husband and son as government forces advanced. She was captured and five soldiers took turns raping her as she was forced to watch her husband die.

Her shocking story – officially unverified – is just one of many reports of sexual violence against women that has come out of Syria as fighting continues between government forces and rebels. Now a crowd-mapping website, launched this week, will attempt to detail every such rape and incident of sexual violence against women throughout the conflict.

The map is the creation of the Women under Siege initiative, and uses the same crowdsourcing technology developed by Washington DC-based Ushaidi, which is also being used to calculate the death toll in the recent fighting.

I read not all that long ago that under reporting of rape is 60% among civilians and 80% among the military. Military Sexual Abuse: A Greater Menace Than Combat

Would a mapping service such as the one created for the conflict in Syria help with the under reporting of rape in the United States? That would at least document the accounts of rape victims and the locations of their attacks.

Greater reporting of rapes and their locations is a first step.

Topic maps could help with the next step: Outing Rapists.

Outing Rapists means binding the accounts and locations of rapes to Facebook, faculty, department, government, listings of rapists.

Reporting a rape will help you help yourself. Anonymously or otherwise.

Outing a rapist may prevent a future rape.

A couple of resources out of thousands on domestic or sexual violence: National Center on Domestic and Sexual Violence or U.S. Military Violence Against Women.

Stupid Solr tricks: Introduction (SST #0)

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 7:13 pm

Stupid Solr tricks: Introduction (SST #0)

Bill Dueber writes:

Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the University of Michigan Library, our bread-and-butter these days are projects that center around Solr.

Right now, my production Solr is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn’t know when we first started down this path. My primary responsibility is for Mirlyn, our catalog, but there’s plenty of smart people doing smart things around here, and I’d like to be one of them.

Solr has since advanced to 3.x (with version 4 on the horizon), and during that time I’ve learned a lot more about Solr and how to push it around. More importantly, I’ve learned a lot more about our data, the vagaries in the MARC/AACR2 that I process and how awful so much of it really is.

So…starting today I’m going to be doing some on-the-blog experiments with a new version of Solr, reflecting some of the problems I’ve run into and ways I think we can get more out of Solr.

Definitely a series to watch, or to contribute to, or better yet, to start for your software package of choice!

Flexible Searching with Solr and Sunspot

Filed under: Indexing,Query Expansion,Solr — Patrick Durusau @ 7:13 pm

Flexible Searching with Solr and Sunspot.

Mike Pack writes:

Just about every type of datastore has some form of indexing. A typical relational database, such as MySQL or PostreSQL, can index fields for efficient querying. Most document databases, like MongoDB, contain indexing as well. Indexing in a relational database is almost always done for one reason: speed. However, sometimes you need more than just speed, you need flexibility. That’s where Solr comes in.

In this article, I want to outline how Solr can benefit your project’s indexing capabilities. I’ll start by introducing indexing and expand to show how Solr can be used within a Rails application.

If you are a Ruby fan (or not), this post is a nice introduction to some of the power of Solr for indexing.

At the same time, it is a poster child for what is inflexible about Solr query expansion.

Mike uses the following example for synonyms/query expansion:

# citi is the stem of cities
citi => city

# copi is the stem of copies
copi => copy

Well, that works no doubt, if those expansions are uniform across a body of texts. Depending on the size of the collection, that may or may not be the case. That is the uniformity of the expansion of strings.

We could say:

#cop is a synonym for the police
cop => police

Meanwhile, elsewhere in the collection we need:

#cop is the stem of copulate
cop => copulate

Without more properties to distinguish the two (or more) cases, we are going to get false positives in one case or the other.

How to set up a maven project with Neo4j in Eclipse

Filed under: Eclipse,Neo4j — Patrick Durusau @ 7:12 pm

How to set up a maven project with Neo4j in Eclipse

Peter Neubauer writes:

repeatedly there are questions on how to set up Neo4j with Eclipse and the Maven integration for it .

Which is followed by the “short” version. Makes me wonder if there is a “long” version? 😉

I don’t see this question/answer in the documentation. If someone thought enough about a topic to ask, it really should appear in the documentation.

Precise data extraction with Apache Nutch

Filed under: Nutch,Search Data,Search Engines,Searching — Patrick Durusau @ 7:12 pm

Precise data extraction with Apache Nutch By Emir Dizdarevic.

From the post:

Nutch’s HtmlParser parses the whole page and returns parsed text, outlinks and additional meta data. Some parts of this are really useful like the outlinks but that’s basically it. The problem is that the parsed text is too general for the purpose of precise data extraction. Fortunately the HtmlParser provides us a mechanism (extension point) to attach an HtmlParserFilter to it.

We developed a plugin, which consists of HtmlParserFilter and IndexingFilter extensions, which provides a mechanism to fetch and index the desired data from a web page trough use of XPath 1.0. The name of the plugin is filter-xpath plugin.

Using this plugin we are now able to extract the desired data from web site with known structure. Unfortunately the plugin is an extension of the HtmlParserFilter extension point which is hardly coupled to the HtmlParser, hence plugin won’t work without the HtmlParser. The HtmlParser generates its own metadata (host, site, url, content, title, cache and tstamp) which will be indexed too. One way to control this is by not including IndexFilter plugins which depend on the metadata data to generate the indexing data (NutchDocument). The other way is to change the SOLR index mappings in the solrindex-mapping.xml file (maps NutchDocument fields to SolrInputDocument field). That way we will index only the fields we want.

The next problem arises when it comes to indexing. We want that Nutch fetches every page on the site but we don’t want to index them all. If we use the UrlRegexFilter to control this we will loose the indirect links which we also want to index and add to our URL DB. To address this problem we developed another plugin which is a extension of the IndexingFilter extension point which is called index-omit plugin. Using this plugin we are able to omit indexing on the pages we don’t need.

Great post on precision and data extraction.

And a lesson that indexing more isn’t the same thing as indexing smarter.

« Newer PostsOlder Posts »

Powered by WordPress