Hadoop « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 18, 2011

Hadoop Basics – Post

Filed under: Hadoop,MapReduce,Subject Identity — Patrick Durusau @ 9:27 pm

Hadoop Basics by Carlo Scarioni illustrates the basics of using Hadoop.

When you read the blog post you will know why I selected his post over any number of others.

Questions:

Perform the exercise and examine the results. How accurate are they?
How would you improve the accuracy?
How would you have to modify the Hadoop example to use your improvements in #2?

Comments Off

December 25, 2010

Haloop

Filed under: Hadoop,MapReduce — Patrick Durusau @ 5:35 pm

Haloop Reported by Jack Park.

From the website:

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. However, these new platforms do not have built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph processing, model fitting, and so on.

….

Simply speaking, HaLoop = Ha, Loop:-) HaLoop is a modified version of the Hadoop MapReduce framework, designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, but also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluate HaLoop on real queries and real datasets and find that, on average, HaLoop reduces query runtimes by 1.85 compared with Hadoop, and shuffles only 4% of the data between mappers and reducers compared with Hadoop.

Interesting project but svn reports the most recent commit was 2010-08-23 and the project wiki reflects the UsersManual was modified on 2010-09-04.

I will follow up with the owner and report back.

*****
Update: 2010-12-26 – Email from the project owner advises of activity not reflected at the project site. Updates to appear in 2011-01. I will probably create another post and link back to this one and forward from this one.

Comments Off

December 11, 2010

Cascalog

Filed under: Cascalog,Clojure,Hadoop,TMQL — Patrick Durusau @ 3:23 pm

Cascalog

From the website:

Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.

Most query languages, like SQL, Pig, and Hive, are custom languages — and this leads to huge amounts of accidental complexity. Constructing queries dynamically by doing string manipulation is an impedance mismatch and makes usual programming techniques like abstraction and composition difficult.

Cascalog queries are first-class within Clojure and are extremely composable. Additionally, the Datalog syntax of Cascalog is simpler and more expressive than SQL-based languages.

Follow the getting started steps, check out the tutorial, and you’ll be running Cascalog queries on your local computer within 5 minutes.

Seems like I have heard the term datalog in TMQL discussions. 😉

I wonder what it would be like to define TMQL operators in Cascalog so that all the other capabilities of Cascalog are also available?

When the next draft appears that will be an interesting question to explore.

Comments Off

November 29, 2010

Cloud⁹: a MapReduce library for Hadoop

Filed under: Hadoop,MapReduce,Software — Patrick Durusau @ 1:40 pm

Cloud⁹: a MapReduce library for Hadoop

From the website:

Cloud⁹ is a MapReduce library for Hadoop designed to serve as both a teaching tool and to support research in data-intensive text processing. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. Hadoop provides an open-source implementation of the programming model. The library itself is available on github and distributed under the Apache License.

See Data-Intensive Text Processing with MapReduce by Lin and Dyer for more details on MapReduce.

Guide to using the Cloud⁹ library, including its use on particular data sets, such as Wikipedia.

Comments Off

November 25, 2010

Fuzzy Table

Filed under: Fuzzy Matching,Hadoop,High Dimensionality — Patrick Durusau @ 10:29 am

Tackling Large Scale Data In Government.

OK, but I cite the post because of its coverage of Fuzzy Table:

FuzzyTable is a large-scale, low-latency, parallel fuzzy-matching database built over Hadoop. It can use any matching algorithm that can compare two often high-dimensional items and return a similarity score. This makes it suitable not only for comparing fingerprints but other biometric modalities, images, audio, and anything that can be represented as a vector of features.

Hmmm, “anything that can be represented as a vector of features?”

Did someone mention subject identity? 😉

Worth a very close read. Software release coming.

Comments Off

November 7, 2010

Parallel Implementation of Classification Algorithms Based on MapReduce

Filed under: Classification,Data Mining,Hadoop,MapReduce — Patrick Durusau @ 8:31 pm

Parallel Implementation of Classification Algorithms Based on MapReduce Authors: Qing He, Fuzhen Zhuang, Jincheng Li and Zhongzhi Shi Keywords: Data Mining, Classification, Parallel Implementation, Large Dataset, MapReduce

Abstract:

Data mining has attracted extensive research for several decades. As an important task of data mining, classification plays an important role in information retrieval, web searching, CRM, etc. Most of the present classification techniques are serial, which become impractical for large dataset. The computing resource is under-utilized and the executing time is not waitable. Provided the program mode of MapReduce, we propose the parallel implementation methods of several classification algorithms, such as k-nearest neighbors, naive bayesian model and decision tree, etc. Preparatory experiments show that the proposed parallel methods can not only process large dataset, but also can be extended to execute on a cluster, which can significantly improve the efficiency.

From the paper:

In this paper, we introduced the parallel implementation of several classification algorithms based on MapReduce, which make them be applicable to mine large dataset. The key is to design the proper key/value pairs. (emphasis in original)

Questions:

Annotated bibliography of parallel classification algorithms (newer than this paper, 3-5 pages, citations)
Report for class on application of parallel classification algorithms (report + paper)
Application of parallel classification algorithm to a library dataset (project)
Can the key/value pairs be interchanged with others? Yes/no, why? (3-5 pages, no citations.)

Comments Off

October 10, 2010

DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud

Filed under: Data Mining,Hadoop,MapReduce,Pattern Recognition — Patrick Durusau @ 10:12 am

DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud Authors: Jen-Wei Huang, Su-Chen Lin, Ming-Syan Chen Keywords: sequential pattern mining, period of interest (POI), customer transactions
Abstract:

The progressive sequential pattern mining problem has been discussed in previous research works. With the increasing amount of data, single processors struggle to scale up. Traditional algorithms running on a single machine may have scalability troubles. Therefore, mining progressive sequential patterns intrinsically suffers from the scalability problem. In view of this, we design a distributed mining algorithm to address the scalability problem of mining progressive sequential patterns. The proposed algorithm DPSP, standing for Distributed Progressive Sequential Pattern mining algorithm, is implemented on top of Hadoop platform, which realizes the cloud computing environment. We propose Map/Reduce jobs in DPSP to delete obsolete itemsets, update current candidate sequential patterns and report up-to-date frequent sequential patterns within each POI. The experimental results show that DPSP possesses great scalability and consequently increases the performance and the practicability of mining algorithms.

The phrase mining sequential patterns was coined in Mining Sequential Patterns, a paper authored by Rakesh Agrawal, Ramakrishnan Srikant, and cited by the authors of this paper.

The original research was to find patterns in customer transactions, which I suspect are important “subjects” for discovery and representation in commerce topic maps.

Comments Off

July 14, 2010

Are simplified hadoop interfaces the next web cash cow? – Post

Filed under: Hadoop,Legends,MapReduce,Semantic Diversity,Subject Identity — Patrick Durusau @ 12:06 pm

Are simplified hadoop interfaces the next web cash cow? is a question that Brian Breslin is asking these days.

It isn’t that hard to imagine that not only Hadoop interfaces being cash cows but also canned analysis of public date sets that can be incorporated into those interfaces.

But then the semantics question comes back up when you want to join that canned analysis to your own. What did they mean by X? Or Y? Or for that matter, what are the semantics of the data set?

But we can solve that issue by explicit subject identification! Did I hear someone say topic maps? 😉 So our identifications of subjects in public data sets will themselves become a commodity. There could be competing set-similarity analysis of public data sets.

If a simplified Hadoop interface is the next cash cow, we need to be ready to stuff it with data mapped to subject identifications to make it grow even larger. A large cash cow is a good thing, a larger cash cow is better and a BP-sized cash cow is just about right.

Comments (2)

« Newer Posts