Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 16, 2011

GraphLab Abstraction

Filed under: GraphLab,Graphs — Patrick Durusau @ 3:41 pm

GraphLab Abstraction

Abstract:

The GraphLab abstraction is the product of several years of research in designing and implementing systems for statistical inference in probabilistic graphical models. Early in our work [12], we discovered that the high-level parallel abstractions popular in the ML community such as MapReduce [2, 13] and parallel BLAS [14] libraries are unable to express statistical inference algorithms efficiently. Our work revealed that an efficient algorithm for graphical model inference should explicitly address the sparse dependencies between random variables and adapt to the input data and model parameters.

Guided by this intuition we spent over a year designing and implementing various machine learning algorithms on top of low-level threading primitives and distributed communication frameworks such as OpenMP [15], CILK++ [16] and MPI [1]. Through this process, we discovered the following set of core algorithmic patterns that are common to a wide range of machine learning techniques. Following, we detail our findings and motivate why a new framework is needed (see Table 1).

See also our prior post on GraphLab.

Clustering with outliers

Filed under: Clustering — Patrick Durusau @ 3:41 pm

Clustering with outliers

Part of a series of posts on clustering. From the most recent post (13 June 2011):

When abstracting the clustering problem, we often assume that the data is perfectly clusterable and so we only need to find the right clusters. But what if your data is not so perfect? Maybe there’s background noise, or a few random points were added to your data by an adversary. Some clustering formulations, in particular k-center or k-means, are not stable — the addition of a single point can dramatically change the optimal clustering. For the case of k-center, if a single point x is added far away from all of the original data points, it will become its own center in the optimum solution, necessitating that the other points are only clustered with k−1 centers.

Clustering is relevant to topic maps in a couple of ways.

First, there are numerous collective subjects, sports teams, music groups, military units, etc., that all have some characteristic by which they can be gathered together.

Second, in some very real sense, when all the information about a subject is brought together, clustering would be a fair description of that activity. True, it is clustering with some extra processing thrown in but it is still clustering. Just a bit more fine grained.

Not to mention that researchers have been working on clustering algorithms for years and they certainly should be part of any topic map authoring tool.

Apache Lucene EuroCon Barcelona

Filed under: Conferences,Lucene,Search Engines — Patrick Durusau @ 3:40 pm

Apache Lucene EuroCon Barcelona

From the webpage:

Apache Lucene EuroCon 2011 is the largest conference for the European Apache Lucene/Solr open source search community. Now in its second year, Apache Lucene Eurocon provides an unparalleled opportunity for European search application developers, thought leaders and market makers to connect and network with their peers and get on board with the technology that’s changing the shape of search: Apache Lucene/Solr.

The conference, taking place in cosmopolitan Barcelona, features a wide range of hands-on technical sessions, spanning the breadth and depth of use cases and technical sessions — plus a complete set of technical training workshops. You will hear from the foremost experts on open source search technology, commiters and developers practiced in the art and science of search. When you’re at Apache Lucene Eurocon, you can…

Even with feel me up security measures at the airport, a trip to Barcelona would be worthwhile anytime. Add a Lucene conference to boot, and who could refuse?

Seriously take advantage of this opportunity to travel this year. Next year, a U.S. presidential election year, will see rumors of security alerts, security alerts, FBI informant sponsored terror plots and the like, which will make travel more difficult.

June 15, 2011

Partitioning Graph Databases

Filed under: Algorithms,Graph Partitioning,Graphs — Patrick Durusau @ 3:11 pm

Partitioning Graph Databases by ALEX AVERBUCH & MARTIN NEUMANN.

Focusing on Neo4j, reports that compared to random partitioning, use of algorithms herein result in a reduction of inter-partition traffic by 40 to 90%, depending on the dataset.

Abstract:

The amount of globally stored, electronic data is growing at an increasing rate. This growth is both in size and connectivity, where connectivity refers to the increasing presence of, and interest in, relationships between data [12]. An example of such data is the social network graph created and stored by Twitter [2].

Due to this growth, demand is increasing for technologies that can process such data. Currently relational databases are the predominant data storage technology, but they are poorly suited to processing connected data as they are optimized for index-intensive operations. Conversely, the storage engines of graph databases are optimized for graph computation as they store records adjacent to one another, linked by direct references. This enables retrieval of adjacent elements in constant time, regardless of graph size, and allows for relationships to be followed without performing costly index lookups. However, as data volume increases these databases outgrow the resources available on a single computer, and partitioning the data becomes necessary. At present, few graph databases are capable of doing this [6].

In this work we evaluate the viability of using graph partitioning algorithms as a means of partitioning graph databases, with focus on the Neo4j graph database [4]. For this purpose, a prototype partitioned database was developed. Then, three partitioning algorithms were explored and one implemented. During evaluation, three
graph datasets were used: two from production applications, and one synthetically generated. These were partitioned in various ways and the impact on database
performance was measured. To gauge this impact, we de fined one synthetic access pattern per dataset and executed each one on the partitioned datasets. Evaluation took place in a simulation environment, which ensured repeatability and made it possible to measure certain metrics, such as network traffic and load balance.

Simulation results show that, compared to random partitioning, use of a graph partitioning algorithm reduced inter-partition traffic by 40{90 %, depending on
dataset. Executing the algorithm intermittently during database usage was shown to maintain partition quality, while requiring only 1% the computation time of
initially partitioning the datasets. Finally, a strong correlation was found between theoretic graph partitioning quality metrics and the generated inter-partition traffic
under non-uniform access patterns. Our results suggest that use of such algorithms to partition graph databases can result in signlfi cant performance benefi ts, and
warrants further investigation.

What you did not see at the JBoss World 2011
keynote demo

Filed under: Hibernate,JBoss,Visualization — Patrick Durusau @ 3:08 pm

JBoss World 2011 keynote demo

From the webpage:

Visualizing data structures is not easy, and I’m confident that a great deal of success of the exceptionally well received demo we presented at the JBoss World 2011 keynote originated from the nice web UIs projected on the multiple big screens. These web applications were effectively visualizing the tweets flowing, the voted hashtags highlighted in the tagcloud, and the animated Infinispan grid while the nodes were dancing on an ideal hashweel visualizing the data distribution among the nodes.

So I bet that everybody in the room got a clear picture of the fact that the data was stored in Infinispan, and by live unplugging a random server everybody could see the data reorganize itself, making it seem a simple and natural way to process huge amounts of data. Not all technical details were explained, so in this and the following post we’re going to detail what you did not see: how was the data stored, how could Drools filter the data, how could all visualizations load the grid stored data, and still be developed in record time?

If you follow the link to the video, go to minute 41 for the start. Truly a demo worth watching.

This blog post gives the details behind the demo you see in the video.

An Architecture for Parallel Topic Models

Filed under: Latent Dirichlet Allocation (LDA),Topic Models (LDA) — Patrick Durusau @ 3:08 pm

An Architecture for Parallel Topic Models by Alexander Smola and Shravan Narayanamurthy.

Abstract:

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.

The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

Interesting how this key, value stuff keeps coming up these days.

The authors plan on making the codebase available for public use.


Updated 30 June 2011 to include the URL supplied by Sam Hunting. (Thanks Sam!)

elasticsearch: The Road to a Distributed, (Near) Real Time, Search Engine

Filed under: ElasticSearch,Lucene — Patrick Durusau @ 3:08 pm

elasticsearch: The Road to a Distributed, (Near) Real Time, Search Engine by Shay Banon

Covers Lucene basics and then shards and replicas using elasticsearch

LexisNexis – OpenSource – HPCC

Filed under: BigData,HPCC — Patrick Durusau @ 11:07 am

LexisNexis Announces HPCC Systems, an Open Source Platform to Solve Big Data Problems for Enterprise Customers

This rocks!

From the press release:

NEW YORK, June 15, 2011 – LexisNexis Risk Solutions today announced that it will offer its data intensive supercomputing platform under a dual license, open source model, as HPCC Systems. HPCC Systems is designed for the enterprise to solve big data problems. The platform is built on top of high performance computing technology, and has been proven with customers for the past decade. HPCC Systems provides a high performance computing cluster (HPCC) technology with a single architecture and a consistent data centric programming language. HPCC Systems is an alternative to Hadoop.

“We feel the time is right to offer our HPCC technology platform as a dual license, open source solution. We believe that HPCC Systems will take big data computing to the next level,” said James M. Peck, chief executive officer, LexisNexis Risk Solutions. “We’ve been doing this quietly for years for our customers with great success. We are now excited to present it to the community to spur greater adoption. We look forward to leveraging the innovation of the open source community to further the development of the platform for the benefit of our customers and the community,” said Mr. Peck.

To manage, sort, link, and analyze billions of records within seconds, LexisNexis developed a data intensive supercomputer that has been proven for the past ten years with customers who need to process large volumes of data. Customers such as leading banks, insurance companies, utilities, law enforcement and federal government leverage the HPCC platform technology through various LexisNexis® products and services. The HPCC platform specializes in the analysis of structured and unstructured data for enterprise class organizations.

June 14, 2011

Seven Things Human Editors Do that Algorithms Don’t (Yet)

Filed under: Authoring Topic Maps,Marketing — Patrick Durusau @ 10:25 am

Seven Things Human Editors Do that Algorithms Don’t (Yet)

The seven things are all familiar:

  • Anticipation
  • Risk-taking
  • The whole picture
  • Pairing
  • Social importance
  • Mind-blowingness
  • Trust

At least for topic map authors.

How are you selling the human authorial input into your topic maps?

This list looks like a good place to start.

The Information Explosion and a Great Article by Grossman and Cormack on Legal Search

Filed under: e-Discovery,Legal Informatics — Patrick Durusau @ 10:25 am

The Information Explosion and a Great Article by Grossman and Cormack on Legal Search

A discussion the “information explosion” and review of Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Richmond Journal of Law and Technology

See what you think but I don’t read the article as debunking exhaustive manual review (by humans) so much as introducing technology to make human reviewers more effective.

Still human review, but assisted by technology to leverage it for large document collections.

As the article notes, the jury (sorry!) is still out on what assisted search methods work the best. This is an area where human recognition of subjects and recording that recognition for others, such as the use of different names for parties to litigation, would be quite useful. I would record that recognition using topic maps, but that isn’t surprising.

Quadrennial Defense Review

Filed under: Marketing — Patrick Durusau @ 10:24 am

Quadrennial Defense Review

Not light reading but if you are interested in the broad outlines of how U.S. defense may develop, this is one place to start.

I mention it because topic map application can compete with existing applications that are already in place or they can be the solution to problems without an installed base. Depends upon your particular strategy for promoting topic maps or topic map based services.

Another reason for reading this and similar material is to pick up the vocabulary in which needs will be expressed, so that you can pitch topic maps as solutions in terms of needs as seen by the prospective client, in this case the DoD. Pitching subject-centric processing to someone looking to:

fully implement the National Security Professional (NSP) program to improve cross-agency training, education, and professional experience opportunities.
(page 71)

isn’t likely to be successful. Subject-centric processing may be the best way to accomplish their goal, but the focus should be on achieving their goal. Once they are satisfied that is the case, maybe they will ask how it is being done. Maybe not. The important thing is for them to say: “I want more of that.”

OpenCV haartraining

Filed under: Machine Learning,Topic Maps — Patrick Durusau @ 10:24 am

OpenCV haartraining (Rapid Object Detection With A Cascade of Boosted Classifiers Based on Haar-like Features)

From the post:

The OpenCV library provides us a greatly interesting demonstration for a face detection. Furthermore, it provides us programs (or functions) that they used to train classifiers for their face detection system, called HaarTraining, so that we can create our own object classifiers using these functions. It is interesting.

I am not sure about the “rapid” part in the title because the author points out he typically waits a week to check for results. 😉

I suppose it is all relative.

Assuming larger hardware resources, it occurred to me that face detection could be interest to topic map authors or more importantly, to people who buy topic maps or topic map services.

At some point, video surveillance will have to improve beyond the convenience store video showing a robbery in progress, to something more sophisticated.

It is all well and good to take video of everyone in the central parts of London, but other than spotting people about to commit a crime or recognizing someone who is a known person of interest, how useful is that?

Imagine a system that assist human reviewers with suggested matches not only to identity records but suggests links to other individuals either seen in their presence or who intersect at other patterns, such as incoming passenger lists.

Hopefully this tutorial will spark you thinking on how to use topic maps with video recognition systems.

Functional thinking: Thinking functionally, Part 2

Filed under: Functional Programming — Patrick Durusau @ 10:23 am

Functional thinking: Thinking functionally, Part 2

From the post:

In the first installment of this series, I began discussing some of the characteristics of functional programming, showing how those ideas manifest in both Java and more-functional languages. In this article, I’ll continue this tour of concepts by talking about first-class functions, optimizations, and closures. But the underlying theme of this installment is control: when you want it, when you need it, and when you should let it go.

Functional thinking:…Part 1

Hypertable 0.9.5.0.pre6

Filed under: Hypertable — Patrick Durusau @ 10:23 am

Hypertable 0.9.5.0.pre6

From the release notes:

Fixed bug in MaintenanceScheduler introduced w/ merging compactions
Fixed bug in the FileBlockCache wrt growing to accomodate
Added support for DELETE lines in .tsv files
Added check for DFSBROKER_BAD_FILENAME on skip_not_found
Added –metadata-tsv option to metalog_dump
Fixed bug whereby get_table_splits() was returning stale results for previously dropped tables.
Added MaintenanceScheduler state dump via existence of run/debug-scheduler file
Fixed FMR in TableMutator timeout

Clojure Tutorial for Clojure Newbies

Filed under: Clojure,Searching — Patrick Durusau @ 9:19 am

Clojure Tutorial for Clojure Newbies

From the webpage:

If, like so many other developers, you suffer from a mild addition to Hacker News, then you’ve surely run across a reference to the Clojure language (pronounced “closure”) more than once while perusing the latest headlines. A Lisp dialect targeting the JVM runtime, Clojure users stand to gain from the portability and stability of the JVM, the rich syntax of a functional language, and the ability to integrate with the enormous Java ecosystem. These advantages have clearly resonated with the programming community, as a swarm of activity has amassed around the language in the four years since its inception.

If you’ve been wondering what all the buzz is about, this article offers a practical, hands-on overview to Clojure, covering the installation process, basic syntax, and potential for Web development.

This isn’t really a tutorial in the sense that you get very far but it does have a good list of resources on page 2.

Looking for something better to recommend, I got the following search results:

  • clojure tutorial – 468,000 “hits”
  • “clojure tutorial” – 2,040 “hits”

The third “hit” on the last query was Clojure Tutorials at Learn Clojure.

Maybe I should do something more systematic on the power of parentheses. What do you think?

June 13, 2011

Why Schema.org Will Win

Filed under: Ontology,OWL,RDF,Schema,Semantic Web — Patrick Durusau @ 7:04 pm

It isn’t hard to see why schema.org is going to win out over “other” semantic web efforts.

The first paragraph at the schema.org website says why:

This site provides a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers. Search engines including Bing, Google and Yahoo! rely on this markup to improve the display of search results, making it easier for people to find the right web pages.

  • Easy: Uses HTML tags
  • Immediate Utility: Recognized by Bing, Google and Yahoo!
  • Immediate Payoff: People can find the right web pages (your web pages)

Ironic that when HTML came up the scene, any number of hypertext engines offered more complex and useful approaches to hypertext.

But the advantages of HTML were:

  • Easy: Used simple tags
  • Immediate Utility: Useful to the author
  • Immediate Payoff: Joins hypertext network for others to find (your web pages)

I think the third advantage in each case is the crucial one. We are vain enough that making our information more findable is a real incentive, if there is a reasonable expectation of it being found. Today or tomorrow. Not ten years from now.

Linking Science and Semantics… (webinar)
15 June 2011 – 10 AM PT (17:00 GMT)

Filed under: Bioinformatics,Biomedical,OWL,RDF,Semantics — Patrick Durusau @ 7:03 pm

Linking science and semantics with the Annotation Ontology and the SWAN Annotation Tool

Abstract:

The Annotation Ontology (AO) is an open ontology in OWL for annotating scientific documents on the web. AO supports both human and algorithmic content annotation. It enables “stand-off” or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO contains a provenance model to support versioning, and a set model for specifying groups and containers of annotation.

The SWAN Annotation Tool, recently renamed DOMEO (Document Metadata Exchange Organizer), is an extensible web application enabling users to visually and efficiently create and share ontology-based stand-off annotation metadata on HTML or XML document targets, using the Annotation Ontology RDF model. The tool supports manual, fully automated, and semi-automated annotation with complete provenance records, as well as personal or community annotation with access authorization and control.
[AO] http://code.google.com/p/annotation-ontology

I’m interested in how “stand-off” annotation is being handled, being an overlapping markup person myself. Also curious how close it comes to HyTime like mechanisms.

More after the webinar.

Starfish: A Self-Tuning System for Big Data Analytics

Filed under: BigData,Hadoop,Topic Maps,Usability — Patrick Durusau @ 7:02 pm

Starfish: A Self-Tuning System for Big Data Analytics by Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu, of Duke University.

Abstract:

Timely and cost-effective analytics over “Big Data” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces—is a popular choice for big data analytics. Most practitioners of big data analytics—like computational scientists, systems researchers, and business analysts—lack the expertise to tune the system to get good performance. Unfortunately, Hadoop’s performance out of the box leaves much to be desired, leading to suboptimal use of resources, time, and money (in payas- you-go clouds). We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. While Starfish’s system architecture is guided by work on self-tuning database systems, we discuss how new analysis practices over big data pose new challenges; leading us to different design choices in Starfish

Accepts that usability is, at least for this project, more important than peak performance. That is the goal is to open up use of Hadoop with reasonable performance to a large number of non-expert users. That will probably do as much if not more than the native performance of Hadoop to spread its use in a number of sectors.

Makes me wonder what acceptance of usability over precision would look like for topic maps? Suggestions?

June 12, 2011

U.S. DoD Is Buying. Are You Selling?

Filed under: BigData,Data Analysis,Data Integration,Data Mining — Patrick Durusau @ 4:14 pm

CTOVision.com reports: Big Data is Critical to the DoD Science and Technology Investment Agenda

Of the seven reported priorities:

(1) Data to Decisions – science and applications to reduce the cycle time and manpower requirements for analysis and use of large data sets.

(2) Engineered Resilient Systems – engineering concepts, science, and design tools to protect against malicious compromise of weapon systems and to develop agile manufacturing for trusted and assured defense systems.

(3) Cyber Science and Technology – science and technology for efficient, effective cyber capabilities across the spectrum of joint operations.

(4) Electronic Warfare / Electronic Protection – new concepts and technology to protect systems and extend capabilities across the electro-magnetic spectrum.

(5) Counter Weapons of Mass Destruction (WMD) – advances in DoD’s ability to locate, secure, monitor, tag, track, interdict, eliminate and attribute WMD weapons and materials.

(6) Autonomy – science and technology to achieve autonomous systems that reliably and safely accomplish complex tasks, in all environments.

(7) Human Systems – science and technology to enhance human-machine interfaces to increase productivity and effectiveness across a broad range of missions

I don’t see any where topic maps would be out of place.

Do you?

clusterPy: Library of spatially constrained
clustering algorithms

Filed under: Clustering,Geo Analytics,Geographic Data,Geographic Information Retrieval — Patrick Durusau @ 4:13 pm

clusterPy: Library of spatially constrained clustering algorithms

From the webpage:

Analytical regionalization (also known as spatially constrained clustering) is a scientific way to decide how to group a large number of geographic areas or points into a smaller number of regions based on similarities in one or more variables (i.e., income, ethnicity, environmental condition, etc.) that the researcher believes are important for the topic at hand. Conventional conceptions of how areas should be grouped into regions may either not be relevant to the information one is trying to illustrate (i.e., using political regions to map air pollution) or may actually be designed in ways to bias aggregated results.

A Few Subjects Go A Long Way

Filed under: Data Analysis,Language,Linguistics,Text Analytics — Patrick Durusau @ 4:11 pm

A post by Rich Cooper (Rich AT EnglishLogicKernel DOT com) Analyzing Patent Claims demonstrates the power of small vocabularies (sets of subjects) for the analysis of patent claims.

It is a reminder that a topic map author need not identify every possible subject, but only so many of those as necessary. Other subjects abound and await other authors who wish to formally recognize them.

It is also a reminder that a topic map need only be as complex or as complete as necessary for a particular task. My topic map may not be useful for Mongolian herdsmen or even the local bank. But, the test isn’t an abstract but a practical. Does it meet the needs of its intended audience?

Dremel: Interactive Analysis of Web-Scale
Datasets

Filed under: BigData,Data Analysis,Data Structures,Dremel,MapReduce — Patrick Durusau @ 4:10 pm

Google, along with Bing and Yahoo! have been attracting a lot of discussion for venturing into web semantics without asking permission.

However that turns out, please don’t miss:

Dremel: interactive analysis of web-scale datasets

Abstract:

Dremel is a scalable, interactive ad hoc query system for analysis of read-only nested data. By combining multilevel execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

I am still working through the article but “…aggregation queries over trillion-row tables in seconds,” is obviously of interest for a certain class of topic map.

If You Have Too Much Data, then “Good Enough” Is Good Enough

Filed under: BigData,Data,Data Models — Patrick Durusau @ 4:10 pm

If You Have Too Much Data, then “Good Enough” Is Good Enough by Pat Helland.

This is a must read article where the author concludes:

The database industry has benefited immensely from the seminal work on data theory started in the 1970s. This work changed the world and continues to be very relevant, but it is apparent now that it captures only part of the problem.

We need a new theory and taxonomy of data that must include:

  • Identity and versions. Unlocked data comes with identity and optional versions.
  • Derivation. Which versions of which objects contributed to this knowledge? How is their schema interpreted? Changes to the source would drive a recalculation just as in Excel. If a legal reason means the source data may not be used, you should forget about using the knowledge derived from it.
  • Lossyness of the derivation. Can we invent a bounding that describes the inaccuracies introduced by derived data? Is this a multidimensional inaccuracy? Can we differentiate loss from the inaccuracies caused by sheer size?
  • Attribution by pattern. Just like a Mulligan stew, patterns can be derived from attributes that are derived from patterns (and so on). How can we bound taint from knowledge that we are not legally or ethically supposed to have?
  • Classic locked database data. Let’s not forget that any new theory and taxonomy of data should include the classic database as a piece of the larger puzzle.

The example of data relativity, a local “now” in data systems, which may not be consistent with the state at some other location, was particularly good.

June 11, 2011

Hadoop: What is it Good For? Absolutely … Something

Filed under: Hadoop,Marketing — Patrick Durusau @ 12:43 pm

Hadoop: What is it Good For? Absolutely … Something by James Kobielus is an interesting review of how to contrast Hadoop with an enterprise database warehouse (EDW).

From the post:

So – apart from being an open-source community with broad industry momentum – what is Hadoop good for that you can’t get elsewhere? The answer to that is a mouthful, but a powerful one.

Essentially, Hadoop is vendor-agnostic in-database analytics in the cloud, leveraging an open, comprehensive, extensible framework for building complex advanced analytics and data management functions for deployment into cloud computing architectures. At the heart of that framework is MapReduce, which is the only industry framework for developing statistical analysis, predictive modeling, data mining, natural language processing, sentiment analysis, machine learning, and other advanced analytics. Another linchpin of Hadoop, Pig, is a versatile language for building data integration processing logic.

Promoting Hadoop without singing Aquarius, promising us a new era in human relationships, or that we are going to be smarter than we were 100, 500, or even 1,000 years ago. Just cold hard data analysis advantages, the sort that reputations, businesses and billings are built upon. Maybe there is a lesson there for topic maps?

“Human Cognition is Limited”

Filed under: Data Mining — Patrick Durusau @ 12:43 pm

In Data Mining and Open APIs, Toby Segaran offers several reasons why data mining is important, including:

Human Cognition is Limited (slide 7)

We have all seen similar claims in data mining/processing presentations and for the most part they are just background noise until we get to the substance of the presentation.

The substance of this presentation is some useful Python code for several open interfaces and I commend it to your review. But I want to take issue with the notion that “human cognition is limited,” that we blow by so easily.

I suspect the real problem is:

Human Cognition is Unlimited

Any data mining task you can articulate could be performed by underpaid and overworked graduate assistants. The problem is that their minds wander from the task at hand, to preparation for lectures the next day, to reading assignments to that nice piece of work they saw on the way to the lab and other concerns. None of which are distractions that trouble data mining programs or the machines on which they run.

What is really needed is an assistant that acts like a checker with a counter that simply “clicks” as the next body in line passes. Just enough cognition to perform the task at hand.

Since it is difficult to find humans with such limited cognition, we turn to computers to take up the gauge.

For example, campaign contributions in the United States is too large a data set for manual processing. While automated processors can dutifully provide totals, etc., they won’t notice, on their own initiative, checks going to Illinois senators and presidential candidates from “Al Capone.” The cognition of data mining programs is bestowed by their creators. That would be us.

Image gallery: 22 free tools for data visualization and analysis

Filed under: Data Analysis,Visualization — Patrick Durusau @ 12:42 pm

Image gallery: 22 free tools for data visualization and analysis

A chart of data visualization and analysis tools from ComputerWorld with required skill levels for use. Accompanies a story which reviews each tool, usefully, albeit briefly. Gives references for further study. If you are looking for a new visualization/analysis tool or just want an overview of the area, this is a good place to start.

Writing a Simple Keyword Search Engine Using Haskell and Redis

Filed under: Haskell,Redis — Patrick Durusau @ 12:42 pm

Writing a Simple Keyword Search Engine Using Haskell and Redis

Alex Popescu says this is a good guide to “…translat[ing] logical operators in Redis set commands” which is true, but it is also an entertaining post on writing a search engine.

Advanced Computer Science Courses

Filed under: Computer Science — Patrick Durusau @ 12:41 pm

Advanced Computer Science Courses

Interesting collection of links to advanced computer science courses on the WWW.

All entertaining and most of interest for anyone developing topic map applications.

MLComp

Filed under: Machine Learning — Patrick Durusau @ 12:41 pm

MLComp

Run your machine learning program on existing datasets to compare with other programs.

Or, run existing algorithms against your dataset.

Certainly an interesting idea for developing/testing machine learning algorithms or what algorithms to use with particular datasets.

Lessons Learned in Erlang Land

Filed under: Erlang — Patrick Durusau @ 12:41 pm

Lessons Learned in Erlang Land

From the post:

Kresten Krab Thorup, CTO of the Danish software development company Trifork delivered a keynote on Erlang at Gotocon 2011. Thorup is sharing this presentation on his lessons learned in Erlang on SlideShare.

Thorup talked about cloud computing, multi-core processors, the need for fault tolerance and more are necessitating a shift away from the object oriented programming paradigm. He suggests that actor programming is the best way to deal with modern programming challenges, and talks about why Erlang in particular is well suited for modern development.

The slides are useful but would be even more useful if a video of the presentation were posted as well.

« Newer PostsOlder Posts »

Powered by WordPress