Archive for the ‘Mahout’ Category

Analyzing Twitter: An End-to-End Data Pipeline Recap

Monday, May 13th, 2013

Analyzing Twitter: An End-to-End Data Pipeline Recap by Jason Barbour.

Jason reviews presentations at a recent Data Science MD meeting:

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

(…)

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C’s of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

Great summaries, links to additional resources and the complete slides.

Check the DC Data Community Events Calendar if you plan to visit the DC area. (I assume residents already do.)

Free Data Mining Tools [African Market?]

Wednesday, April 10th, 2013

The Best Data Mining Tools You Can Use for Free in Your Company by: Mawuna Remarque KOUTONIN.

Short descriptions of the usual suspects but a couple (jHepWork and PSPP) that were new to me.

  1. RapidMiner
  2. RapidAnalytics
  3. Weka
  4. PSPP
  5. KNIME
  6. Orange
  7. Apache Mahout
  8. jHepWork
  9. Rattle

An interesting site in general.

Consider the following pitch for business success in Africa:

Africa: Your Business Should be Profitable in 45 days or Die

And the reasons for that claim:

1. “It’s almost virgin here. There are lot of opportunities, but you have to fight!”

2. “Target the vanity class with vanity products. The “new rich” have lot of money. They are though on everything except their big ego and social reputation”

3. “Target the lazy executives and middle managers. Do the job they are paid for as a consultant. Be good, and politically savvy, and the money is yours”

4. “You’ll make more money in selling food or opening a restaurant than working for the Bank”

5. “You can’t avoid politics, but learn to think like the people your are talking with. Always finish your sentence with something like “the most important is the country’s development, not power. We all have to work in that direction”

6. “It’s about hard work and passion, but you should first forget about managing time like in Europe.

Take time to visit people, go to the vanity parties, have the patience to let stupid people finish their long empty sentences, and make the politicians understand that your project could make them win elections and strengthen their positions”

7. “Speed is everything. Think fast, Act fast, Be everywhere through friends, family and informants”

With the exception of #1, all of these points are advice I would give to someone marketing topic maps on any continent.

It may be easier to market topic maps where there are few legacy IT systems that might feel threatened by a new technology.

Beginners Guide To Enhancing Solr/Lucene Search…

Monday, April 8th, 2013

Beginners Guide To Enhancing Solr/Lucene Search With Mahout’s Machine Learning by Doug Turnbull.

From the post:

Yesterday, John and I gave a talk to the DC Hadoop Users Group about using Mahout with Solr to perform Latent Semantic Indexing — calculating and exploiting the semantic relationships between keywords. While we were there, I realized, a lot of people could benefit from a bigger picture, less in-depth, point of view outside of our specific story. In general where do Mahout and Solr fit together? What does that relationship look like, and how does one exploit Mahout to make search even more awesome? So I thought I’d blog about how you too get start to put these pieces together to simultaneously exploit Solr’s search and Mahout’s machine learning capabilities.

The root of how this all works is with a slightly obscure feature of Lucene based search — Term Vectors. Lucene based search applications give you the ability to generate term vectors from documents in the search index. Its a feature often turned on for specific search features, but other than that can appear to be a weird opaque feature to beginners. What is a term vector, you might ask? And why would you want to get one?

You know my misgivings about metric approaches to non-metric data (such as semantics) but there is no denying that Latent Semantic Indexing can be useful.

Think of Latent Semantic Indexing as a useful tool.

A saw is a tool too but not every cut made with a saw is a correct one.

Yes?

Mahout on Windows Azure…

Tuesday, January 22nd, 2013

Mahout on Windows Azure – Machine Learning Using Microsoft HDInsight by Istvan Szegedi.

From the post:

Our last post was about Microsoft and Hortonworks joint effort to deliver Hadoop on Microsoft Windows Azure dubbed HDInsight. One of the key Microsoft HDInsight components is Mahout, a scalable machine learning library that provides a number of algorithms relying on the Hadoop platform. Machine learning supports a wide range of use cases from email spam filtering to fraud detection to recommending books or movies, similar to Amazon.com features.These algorithms can be divided into three main categories: recommenders/collaborative filtering, categorization and clustering. More details about these algorithms can be read on Apache Mahout wiki.

Are you hearing Hadoop, Mahout, HBase, Hive, etc., as often as I am?

Does it make you wonder about Apache becoming the locus of transferable IT skills?

Something to think about as you are developing topic map ecosystems.

You can hand roll your own solutions.

Or build upon solutions that have widespread vendor support.

PS: Another great post from Istvan.

Taming Text [Coming real soon now!]

Thursday, December 13th, 2012

Taming Text by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris.

During a webinar today Grant said that “Taming Text” should be out in ebook form in just a week or two.

Grant is giving up the position of being the second longest running MEAP project. (He didn’t say who was first.)

Let’s all celebrate Grant and his co-authors crossing the finish line with a record number of sales!

This promises to be a real treat!

PS: Not going to put this on my wish list, too random and clumsy a process. Will just order it direct. ;-)

Searching Big Data’s Open Source Roots

Monday, October 22nd, 2012

Searching Big Data’s Open Source Roots by Nicole Hemsoth.

Nicole talks to Grant Ingersoll, Chief Scientist at LucidWorks, about the open source roots of big data.

No technical insights but a nice piece to pass along to the c-suite. Investment in open source projects can pay rich dividends. So long as you don’t need them next quarter. ;-)

And a snapshot of where we are now, which is on the brink of new tools and capabilities in search technologies.

Applying Parallel Prediction to Big Data

Saturday, October 6th, 2012

Applying Parallel Prediction to Big Data by Dan McClary (Principal Product Manager for Big Data and Hadoop at Oracle).

From the post:

One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.

Playing Weatherman

I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.

For the problem at hand, that may be overkill and we can get it solved in an easier way, without understanding Mahout. Something becomes apparent on thinking about the problem: I don’t want my climate model for San Francisco to include the weather data from Providence, RI. Weather is a local problem and we want to model it locally. Therefore what we need is many models across different subsets of data. For the purpose of example, I’d like to model the weather on a state-by-state basis. But if I have to build 50 models sequentially, tomorrow’s weather will have happened before I’ve got a national forecast. Fortunately, this is an area where Pig shines.

Two quick observations:

First, Dan makes my point about your needing the “right” data, which may or may not be the same thing as “big data.” Decide what you want to do before you reach for big iron and data.

Second, I never hear references to the “weatherman” without remembering: “you don’t need to be a weatherman to know which way the wind blows.” (link to the manifesto) If you prefer a softer version, Subterranean Homesick Blues by Bob Dylan.

Scalable Machine Learning with Hadoop (most of the time)

Thursday, October 4th, 2012

Scalable Machine Learning with Hadoop (most of the time) by Grant Ingersoll. (slides)

Grant’s slides from a presentation on machine learning with Hadoop in Taiwan!

Not quite like being there but still useful.

And a reminder that I need to get a copy of Taming Text!

Do You Just Talk About The Weather?

Wednesday, September 12th, 2012

After reading this post by Alex you will still just be talking about the weather, but you may have something interesting to say. ;-)

Locating Mountains and More with Mahout and Public Weather Dataset by Alex Baranau

From the post:

Recently I was playing with Mahout and public weather dataset. In this post I will describe how I used Mahout library and weather statistics to fill missing gaps in weather measurements and how I managed to locate steep mountains in US with a little Machine Learning (n.b. we are looking for people with Machine Learning or Data Mining backgrounds – see our jobs).

The idea was to just play and learn something, so the effort I did and the decisions chosen along with the approaches should not be considered as a research or serious thoughts by any means. In fact, things done during this effort may appear too simple and straightforward to some. Read on if you want to learn about the fun stuff you can do with Mahout!
Tools & Data

The data and tools used during this effort are: Apache Mahout project and public weather statistics dataset. Mahout is a machine learning library which provided a handful of machine learning tools. During this effort I used just small piece of this big pie. The public weather dataset is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

What other questions could you explore with the weather data set?

The real power of “big data” access and tools may be that we no longer have to rely on the summaries of others.

Summaries still have a value-add, perhaps even more so when the original data is available for verification.

Learning Mahout : Classification

Monday, September 10th, 2012

Learning Mahout : Classification by Sujit Pal.

From the post:

The final part covered in the MIA book is Classification. The popular algorithms available are Stochastic Gradient Descent (SGD), Naive Bayes and Complementary Naive Bayes, Random Forests and Online Passive Aggressive. There are other algorithms in the pipeline, as seen from the Classification section of the Mahout wiki page.

The MIA book has generic classification information and advice that will be useful for any algorithm, but it specifically covers SGD, Bayes and Naive Bayes (the last two via Mahout scripts). Of these SGD and Random Forest are good for classification problems involving continuous variables and small to medium datasets, and the Naive Bayes family is good for problems involving text like variables and medium to large datasets.

In general, a solution to a classification problem involves choosing the appropriate features for classification, choosing the algorithm, generating the feature vectors (vectorization), training the model and evaluating the results in a loop. You continue to tweak stuff in each of these steps until you get the results with the desired accuracy.

Sujit notes that classification is under rapid development. The classification material is likely to become dated.

Some additional resources to consider:

Mahout User List (subscribe)

Mahout Developer List (subscribe)

IRC: Mahout’s IRC channel is #mahout.

Mahout QuickStart

Learning Mahout : Clustering

Sunday, September 2nd, 2012

Learning Mahout : Clustering by Sujit Pal.

From the post:

The next section in the MIA book is Clustering. As with Recommenders, Mahout provides both in-memory and map-reduce versions of various clustering algorithms. However, unlike Recommenders, there are quite a few toolkits (like Weka or Mallet for example) which are more comprehensive than Mahout for small or medium sized datasets, so I decided to concentrate on the M/R implementations.

The full list of clustering algorithms available in Mahout at the moment can be found on its Wiki Page under the Clustering section. The ones covered in the book are K-Means, Canopy, Fuzzy K-Means, LDA and Dirichlet. All these algorithms expect data in the form of vectors, so the first step is to convert the input data into this format, a process known as vectorization. Essentially, clustering is the process of finding nearby points in n-dimensional space, where each vector represents a point in this space, and each element of a vector represents a dimension in this space.

It is important to choose the right vector format for the clustering algorithm. For example, one should use the SequentialAccessSparseVector for KMeans, sinc there is lot of sequential access in the algorithm. Other possibilities are the DenseVector and the RandomAccessSparseVector formats. The input to a clustering algorithm is a SequenceFile containing key-value pairs of {IntWritable, VectorWritable} objects. Since the implementations are given, Mahout users would spend most of their time vectorizing the input (and thinking about what feature vectors to use, of course).

Once vectorized, one can invoke the appropriate algorithm either by calling the appropriate bin/mahout subcommand from the command line, or through a program by calling the appropriate Driver’s run method. All the algorithms require the initial centroids to be provided, and the algorithm iteratively perturbes the centroids until they converge. One can either guess randomly or use the Canopy clusterer to generate the initial centroids.

Finally, the output of the clustering algorithm can be read using the Mahout cluster dumper subcommand. To check the quality, take a look at the top terms in each cluster to see how “believable” they are. Another way to measure the quality of clusters is to measure the intercluster and intracluster distances. A lower spread of intercluster and intracluster distances generally imply “good” clusters. Here is code to calculate inter-cluster distance based on code from the MIA book.

Detailed walk through of two of the four case studies in Mahout In Action. This post and the book are well worth your time.

Learning Mahout : Collaborative Filtering [Recommend Your Preferences?]

Friday, August 24th, 2012

Learning Mahout : Collaborative Filtering by Sujit Pal.

From the post:

My Mahout in Action (MIA) book has been collecting dust for a while now, waiting for me to get around to learning about Mahout. Mahout is evolving quite rapidly, so the book is a bit dated now, but I decided to use it as a guide anyway as I work through the various modules in the currently GA) 0.7 distribution.

My objective is to learn about Mahout initially from a client perspective, ie, find out what ML modules (eg, clustering, logistic regression, etc) are available, and which algorithms are supported within each module, and how to use them from my own code. Although Mahout provides non-Hadoop implementations for almost all its features, I am primarily interested in the Hadoop implementations. Initially I just want to figure out how to use it (with custom code to tweak behavior). Later, I would like to understand how the algorithm is represented as a (possibly multi-stage) M/R job so I can build similar implementations.

I am going to write about my progress, mainly in order to populate my cheat sheet in the sky (ie, for future reference). Any code I write will be available in this GitHub (Scala) project.

The first module covered in the book is Collaborative Filtering. Essentially, it is a technique of predicting preferences given the preferences of others in the group. There are two main approaches – user based and item based. In case of user-based filtering, the objective is to look for users similar to the given user, then use the ratings from these similar users to predict a preference for the given user. In case of item-based recommendation, similarities between pairs of items are computed, then preferences predicted for the given user using a combination of the user’s current item preferences and the similarity matrix.

While you are working your way through this post, keep in mind: Collaborative filtering with GraphChi.

Question: What if you are an outlier?

Telephone marketing interviews with me get shortened by responses like: “X? Is that a TV show?”

How would you go about piercing the marketing veil to recommend your preferences?

Now that is a product to which even I might subscribe. (But don’t advertise on TV, I won’t see it.)

Lucene Revolution 2012 – Slides/Videos

Thursday, June 7th, 2012

Lucene Revolution 2012 – Slides/Videos

The slides and videos from Lucene Revolution 2012 are up!

Now you don’t have to search for old re-runs on Hulu to watch during lunch!

Apache Bigtop 0.3.0 (incubating) has been released

Wednesday, April 4th, 2012

Apache Bigtop 0.3.0 (incubating) has been released by Roman Shaposhnik.

From the post:

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

  • Apache Hadoop 1.0.1
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Thoughts on what is missing from this ecosystem?

What if you moved from the company where you wrote the scripts? And they needed new scripts?

Re-write? On what basis?

Is your “big data” big enough to need “big documentation?”

running mahout collocations over common crawl text

Tuesday, March 6th, 2012

running mahout collocations over common crawl text by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Can you answer Mat’s question about the incidence of Lithuanian pages? (Please post here.)

Using your Lucene index as input to your Mahout job – Part I

Tuesday, March 6th, 2012

Using your Lucene index as input to your Mahout job – Part I

From the post:

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.

Access to original text can help with improving clustering results. See the blog post for details.

Mavuno: Hadoop-Based Text Mining Toolkit

Saturday, January 28th, 2012

Mavuno: A Hadoop-Based Text Mining Toolkit

From the webpage:

Mavuno is an open source, modular, scalable text mining toolkit built upon Hadoop. It supports basic natural language processing tasks (e.g., part of speech tagging, chunking, parsing, named entity recognition), is capable of large-scale distributional similarity computations (e.g., synonym, paraphrase, and lexical variant mining), and has information extraction capabilities (e.g., instance and semantic relation mining). It can easily be adapted to new input formats and text mining tasks.

Just glancing at the documentation I am intrigued by the support for Java regular expressions. More on that this coming week.

I first saw this at myNoSQL.

Apache Mahout user meeting – session slides and videos are now available!

Friday, December 30th, 2011

Apache Mahout user meeting – session slides and videos are now available!

From the post:

The first San Francisco Apache Mahout user meeting was held on November 29th 2011 at Lucid Imagination head quarters in Redwood City. The 3-hour session hosted 2 talks followed by networking, food and drinks.

Session topics -

  • “Using Mahout to cluster, classify and recommend, plus a demonstration of using scripts packaged with Mahout” by Grant Ingersoll from Lucid Imagination.
  • “How using random projection in Machine learning can benefit performance with out sacrificing quality” by Ted Dunning from MapR Technologies.

Sharpening your Mahout skills is never a bad idea!

Apache Whirr 0.7.0 has been released

Wednesday, December 28th, 2011

Apache Whirr 0.7.0 has been released

From Patrick Hunt at Cloudera:

Apache Whirr release 0.7.0 is now available. It includes changes covering over 50 issues, four of which were considered blockers. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. This is the first Whirr release as a top level Apache project (previously releases were under the auspices of the Incubator). In addition to improving overall stability some of the highlights are described below:

Support for Apache Mahout as a deployable component is new in 0.7.0. Mahout is a scalable machine learning library implemented on top of Apache Hadoop.

  • WHIRR-384 – Add Mahout as a service
  • WHIRR-49 – Allow Whirr to use Chef for configuration management
  • WHIRR-258 – Add Ganglia as a service
  • WHIRR-385 – Implement support for using nodeless, masterless Puppet to provision and run scripts

Whirr 0.7.0 will be included in a scheduled update to CDH4.

Getting Involved

The Apache Whirr project is working on a number of new features. The How To Contribute page is a great place to start if you’re interested in getting involved as a developer.

Cluster management or even the “cloud” in your topic map future?

You could do worse than learning one of the most recent top level Apache top level projects to prepare for a future that may arrive sooner than you think!

Learning Machine Learning with Apache Mahout

Sunday, December 25th, 2011

Learning Machine Learning with Apache Mahout

From the post:

Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Nice collection of resources on getting started with Apache Mahout.

Recommendation with Apache Mahout in CDH3 – Update

Saturday, November 26th, 2011

Recommendation with Apache Mahout in CDH3 – Update

My original post was to a page at Cloudera. That page has now gone away.

I saw a tweet by Alex Popescu asking about the page and when I checked, all I got was a 404.

Started to update my post but then decided there is a broader question as to whether I should cache local copies of pages and resources? So that at least you will see the page as I saw it when I made the entry?

Comments?

Recommendation with Apache Mahout in CDH3

Saturday, November 12th, 2011

Recommendation with Apache Mahout in CDH3 by Josh Patterson.

From the introduction:

The amount of information we are exposed to on a daily basis is far outstripping our ability to consume it, leaving many of us overwhelmed by the amount of new content we have available. Ideally we’d like machines and algorithms to help us find the more interesting (for us individually) things so we more easily focus our attention on items of relevance.

Have you ever been recommended a friend on Facebook or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.

Due to the explosion of web traffic and users the scale of recommendation poses new challenges for recommendation systems. These systems face the dual challenge of producing high quality recommendations while also calculating recommendations for millions of users. In recent years collaborative filtering (CF) has become popular as a way to effectively meet these challenges. CF techniques start off by analyzing the user-item matrix to identify relationships between different users or items and then use that information to produce recommendations for each user.

To use this post as an introduction to recommendation with Apache Mahout, is there anything you would change, subtract from or add to this post? If anything.

I am working on my answer to that question but am curious what you think?

I want to use this and similar material on a graduate library course more to demonstrate the principals than to turn any of the students into Hadoop hackers. (Although that would be a nice result as well.)

Apache Mahout: Scalable machine learning for everyone

Wednesday, November 9th, 2011

Apache Mahout: Scalable machine learning for everyone by Grant Ingersoll.

Summary:

Apache Mahout committer Grant Ingersoll brings you up to speed on the current version of the Mahout machine-learning library and walks through an example of how to deploy and scale some of Mahout’s more popular algorithms.

A short summary to a twenty-three (23) page paper that concludes with two (2) pages of pointers to additional resources!

You will learn a lot about Mahout and Amazon Web Services (EC2).

Search + Big Data: It’s (still) All About the User (Users or Documents?)

Tuesday, November 8th, 2011

Search + Big Data: It’s (still) All About the User by Grant Ingersoll.

Slides

Abstract:

Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow’s enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.

Awesome as always!

Please watch the presentation and review the slides before going further. What follows won’t make much sense without Grant’s presentation as a context. I’ll wait……

Back so soon? ;-)

On slide 4 (I said to review the slides), Grant presents four overlapping areas, starting with Documents: Models, Feature Selection; Content Relationships: Page Rank, etc., Organization; Queries: Phrases, NLP; User Interaction: Clicks, Ratings/Reviews, Learning to Rank, Social Graph; and the intersection of those four areas is where Grant says search is rapidly evolving.

On slide 5 (sorry, last slide reference), Grant say to mine that intersection is a loop composed of: Search -> Discovery -> Analytics -> (back to Search). All of which involve processing of data that has been collected from use of the search interface.

Grant’s presentation made clear something that I have been overlooking:

Search/Indexing, as commonly understood, does not capture any discoveries or insights of users.

Even the search trails that Grant mentions are just lemming tracks complete with droppings. You can follow them if you like, may find interesting data, may not.

My point being that there is no way to capture the user’s insight that LBJ, for instance, is a common acronym for Lyndon Baines Johnson. So that the next user who searches for LBJ will find the information contributed by a prior user. Such as distinguishing application of Lyndon Baines Johnson to a graduate school (Lyndon B. Johnson School of Public Affairs), a hospital (Lyndon B. Johnson General Hospital), a PBS show (American Experience . The Presidents . Lyndon B. Johnson), a biography (American President: Lyndon Baines Johnson), and that is in just the first ten (10) “hits.” Oh, and as the name of an American President.

Grant made that clear for me with his loop of Search -> Discovery -> Analytics -> (back to Search) because Search only ever focuses on the documents, never the user’s insight into the documents.

And with every search, every user (with the exception of search trails), starts over at the beginning.

What if a colleague found a bug in program code, but you have to start at the beginning of the program and work your way there. Good use of your time? To reset with every user? That is what happens with search, nearly a complete reset. (Not complete because of page rank, etc. but only just.)

If we are going to make it “All About the User,” shouldn’t we be indexing their insights* into data? (Big or otherwise.)

*”Clicks” are not insights. Could be an unsteady hand, DTs, etc.

CDH3 update 2 is released (Apache Hadoop)

Friday, October 21st, 2011

CDH3 update 2 is released (Apache Hadoop)

From the post:

There are a number of improvements coming to CDH3 with update 2. Among them are:

  1. New features – Support for Apache Mahout (0.5). Apache Mahout is a popular machine learning library that makes it easier for users to perform analyses like collaborative filtering and k-means clustering on Hadoop. Also added in update 2 is expanded support for Apache Avro’s data file format. Users can:
  • load data into Avro data files in Hadoop via Sqoop or Flume
  • run MapReduce, Pig or Hive workloads on Avro data files
  • view the contents of Avro files from the Hue web client

This gives users the ability to use all the major features of the Hadoop stack without having to switch file formats. Avro file format provides added benefits over text because it is faster and more compact.

  1. Improvements (stability and performance) – HBase in particular has received a number of improvements that improve stability and recoverability. All HBase users are encouraged to use update 2.
  2. Bug fixes – 50+ bug fixes. The enumerated fixes and their corresponding Apache project jiras are provided in the release notes.

Update 2 is available in all the usual formats (RHEL, SLES, Ubuntu, Debian packages, tarballs, and SCM Express). Check out the installation docsfor instructions. If you’re running components from the Cloudera Management Suite they will not be impacted by moving to update 2. The next update (update 3) for CDH3 is planned for January, 2012.

Thank you for supporting Apache Hadoop and thank you for supporting Cloudera.

Another aspect of Cloudera’s support for the Hadoop ecosystem is its Cloudera University.

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

Tuesday, October 4th, 2011

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

From the webpage:

This is the first post in a four part series about a wine rating and recommendation Web application, named VinWiki, built using open source technology. The purpose of this series is to document key design and implementation decisions, which may be of interest to anyone wanting to build an intelligent Web application using Java technologies. The end result will not be a 100% functioning Web application, but will have enough functionality to prove the concepts.

I thought about Lars Marius and his expertise at beer evaluation when I saw this series. Not that Lars would need it but it looks like the sort of thing you could build to recommend things you know something about, and like. Whatever that may be. ;-)

Running Mahout in the Cloud using Apache Whirr

Tuesday, September 20th, 2011

Running Mahout in the Cloud using Apache Whirr

From the post:

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promising Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr’s Java API (version 0.4).

Running Mahout in the cloud with Apache Whirr will prepare you for using Whirr or similar tools to run services in the cloud.

SearchWorkings

Saturday, September 10th, 2011

SearchWorkings

From the About Us page:

SearchWorkings.org was created by a bunch of really passionate search technology professionals who realised that the world (read: other search professionals) doesn’t have a single point of contact or comprehensive resource where they can learn and talk about all the exciting new developments in the wonderful world of open source search solutions. These professionals all work at JTeam, a leading supplier of high-quality custom-built applications and end-to-end solutions provider, and moreover a market leader when it comes to search solutions.

A wide variety of materials, from whitepapers and articles, forums (Lucene, Solr, ElasticSearch, Mahout), training videos, news, and blogs.

You do have to register/join (free) to get access to the good stuff.

Calling Mahout from Clojure

Thursday, August 18th, 2011

Calling Mahout from Clojure

From the post:

Mahout is a set of libraries for running machine learning processes, such as recommendation, clustering and categorisation.

The libraries work against an abstract model that can be anything from a file to a full Hadoop cluster. This means you can start playing around with small data sets in files, a local database, a Hadoop cluster or a custom data store.

After a bit of research, it turned out not to be too complex to call via any JVM language. When you compile and install Mahout, the libraries are installed into your local Maven cache. This makes it very easy to include them into any JVM type project.

Concludes with two interesting references:

Visualizing Mahout’s output with Clojure and Incanter

Monte Carlo integration with Clojure and Mahout

Mahout: Scaleable Data Mining for Everybody

Monday, August 8th, 2011

Mahout: Scaleable Data Mining for Everybody by Ted Dunning.

Has to be the most entertaining and accessible presentations on classification I have seen to date.

Ted is a co-author of Mahout in Action with Sean Owen, Robin Anil, and Ellen Friedman.

If they had more of this sort of thing during the pledge drives to support public television I would bet that their numbers would be better. At least among a certain crowd! ;-)