Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 30, 2011

Apache Mahout user meeting – session slides and videos are now available!

Filed under: Mahout — Patrick Durusau @ 6:03 pm

Apache Mahout user meeting – session slides and videos are now available!

From the post:

The first San Francisco Apache Mahout user meeting was held on November 29th 2011 at Lucid Imagination head quarters in Redwood City. The 3-hour session hosted 2 talks followed by networking, food and drinks.

Session topics –

  • “Using Mahout to cluster, classify and recommend, plus a demonstration of using scripts packaged with Mahout” by Grant Ingersoll from Lucid Imagination.
  • “How using random projection in Machine learning can benefit performance with out sacrificing quality” by Ted Dunning from MapR Technologies.

Sharpening your Mahout skills is never a bad idea!

December 28, 2011

Apache Whirr 0.7.0 has been released

Filed under: Cloud Computing,Clustering (servers),Mahout,Whirr — Patrick Durusau @ 9:30 pm

Apache Whirr 0.7.0 has been released

From Patrick Hunt at Cloudera:

Apache Whirr release 0.7.0 is now available. It includes changes covering over 50 issues, four of which were considered blockers. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. This is the first Whirr release as a top level Apache project (previously releases were under the auspices of the Incubator). In addition to improving overall stability some of the highlights are described below:

Support for Apache Mahout as a deployable component is new in 0.7.0. Mahout is a scalable machine learning library implemented on top of Apache Hadoop.

  • WHIRR-384 – Add Mahout as a service
  • WHIRR-49 – Allow Whirr to use Chef for configuration management
  • WHIRR-258 – Add Ganglia as a service
  • WHIRR-385 – Implement support for using nodeless, masterless Puppet to provision and run scripts

Whirr 0.7.0 will be included in a scheduled update to CDH4.

Getting Involved

The Apache Whirr project is working on a number of new features. The How To Contribute page is a great place to start if you’re interested in getting involved as a developer.

Cluster management or even the “cloud” in your topic map future?

You could do worse than learning one of the most recent top level Apache top level projects to prepare for a future that may arrive sooner than you think!

December 25, 2011

Learning Machine Learning with Apache Mahout

Filed under: Machine Learning,Mahout — Patrick Durusau @ 6:06 pm

Learning Machine Learning with Apache Mahout

From the post:

Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Nice collection of resources on getting started with Apache Mahout.

November 26, 2011

Recommendation with Apache Mahout in CDH3 – Update

Filed under: Mahout — Patrick Durusau @ 7:59 pm

Recommendation with Apache Mahout in CDH3 – Update

My original post was to a page at Cloudera. That page has now gone away.

I saw a tweet by Alex Popescu asking about the page and when I checked, all I got was a 404.

Started to update my post but then decided there is a broader question as to whether I should cache local copies of pages and resources? So that at least you will see the page as I saw it when I made the entry?

Comments?

November 12, 2011

Recommendation with Apache Mahout in CDH3

Filed under: Hadoop,Mahout — Patrick Durusau @ 8:46 pm

Recommendation with Apache Mahout in CDH3 by Josh Patterson.

From the introduction:

The amount of information we are exposed to on a daily basis is far outstripping our ability to consume it, leaving many of us overwhelmed by the amount of new content we have available. Ideally we’d like machines and algorithms to help us find the more interesting (for us individually) things so we more easily focus our attention on items of relevance.

Have you ever been recommended a friend on Facebook or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.

Due to the explosion of web traffic and users the scale of recommendation poses new challenges for recommendation systems. These systems face the dual challenge of producing high quality recommendations while also calculating recommendations for millions of users. In recent years collaborative filtering (CF) has become popular as a way to effectively meet these challenges. CF techniques start off by analyzing the user-item matrix to identify relationships between different users or items and then use that information to produce recommendations for each user.

To use this post as an introduction to recommendation with Apache Mahout, is there anything you would change, subtract from or add to this post? If anything.

I am working on my answer to that question but am curious what you think?

I want to use this and similar material on a graduate library course more to demonstrate the principals than to turn any of the students into Hadoop hackers. (Although that would be a nice result as well.)

November 9, 2011

Apache Mahout: Scalable machine learning for everyone

Filed under: Amazon Web Services AWS,Mahout — Patrick Durusau @ 7:41 pm

Apache Mahout: Scalable machine learning for everyone by Grant Ingersoll.

Summary:

Apache Mahout committer Grant Ingersoll brings you up to speed on the current version of the Mahout machine-learning library and walks through an example of how to deploy and scale some of Mahout’s more popular algorithms.

A short summary to a twenty-three (23) page paper that concludes with two (2) pages of pointers to additional resources!

You will learn a lot about Mahout and Amazon Web Services (EC2).

November 8, 2011

Search + Big Data: It’s (still) All About the User (Users or Documents?)

Filed under: Hadoop,Lucene,LucidWorks,Mahout,Solr,Topic Maps — Patrick Durusau @ 7:44 pm

Search + Big Data: It’s (still) All About the User by Grant Ingersoll.

Slides

Abstract:

Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow’s enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.

Awesome as always!

Please watch the presentation and review the slides before going further. What follows won’t make much sense without Grant’s presentation as a context. I’ll wait……

Back so soon? 😉

On slide 4 (I said to review the slides), Grant presents four overlapping areas, starting with Documents: Models, Feature Selection; Content Relationships: Page Rank, etc., Organization; Queries: Phrases, NLP; User Interaction: Clicks, Ratings/Reviews, Learning to Rank, Social Graph; and the intersection of those four areas is where Grant says search is rapidly evolving.

On slide 5 (sorry, last slide reference), Grant say to mine that intersection is a loop composed of: Search -> Discovery -> Analytics -> (back to Search). All of which involve processing of data that has been collected from use of the search interface.

Grant’s presentation made clear something that I have been overlooking:

Search/Indexing, as commonly understood, does not capture any discoveries or insights of users.

Even the search trails that Grant mentions are just lemming tracks complete with droppings. You can follow them if you like, may find interesting data, may not.

My point being that there is no way to capture the user’s insight that LBJ, for instance, is a common acronym for Lyndon Baines Johnson. So that the next user who searches for LBJ will find the information contributed by a prior user. Such as distinguishing application of Lyndon Baines Johnson to a graduate school (Lyndon B. Johnson School of Public Affairs), a hospital (Lyndon B. Johnson General Hospital), a PBS show (American Experience . The Presidents . Lyndon B. Johnson), a biography (American President: Lyndon Baines Johnson), and that is in just the first ten (10) “hits.” Oh, and as the name of an American President.

Grant made that clear for me with his loop of Search -> Discovery -> Analytics -> (back to Search) because Search only ever focuses on the documents, never the user’s insight into the documents.

And with every search, every user (with the exception of search trails), starts over at the beginning.

What if a colleague found a bug in program code, but you have to start at the beginning of the program and work your way there. Good use of your time? To reset with every user? That is what happens with search, nearly a complete reset. (Not complete because of page rank, etc. but only just.)

If we are going to make it “All About the User,” shouldn’t we be indexing their insights* into data? (Big or otherwise.)

*”Clicks” are not insights. Could be an unsteady hand, DTs, etc.

October 21, 2011

CDH3 update 2 is released (Apache Hadoop)

Filed under: Hadoop,Hive,Mahout,MapReduce,Pig — Patrick Durusau @ 7:27 pm

CDH3 update 2 is released (Apache Hadoop)

From the post:

There are a number of improvements coming to CDH3 with update 2. Among them are:

  1. New features – Support for Apache Mahout (0.5). Apache Mahout is a popular machine learning library that makes it easier for users to perform analyses like collaborative filtering and k-means clustering on Hadoop. Also added in update 2 is expanded support for Apache Avro’s data file format. Users can:
  • load data into Avro data files in Hadoop via Sqoop or Flume
  • run MapReduce, Pig or Hive workloads on Avro data files
  • view the contents of Avro files from the Hue web client

This gives users the ability to use all the major features of the Hadoop stack without having to switch file formats. Avro file format provides added benefits over text because it is faster and more compact.

  1. Improvements (stability and performance) – HBase in particular has received a number of improvements that improve stability and recoverability. All HBase users are encouraged to use update 2.
  2. Bug fixes – 50+ bug fixes. The enumerated fixes and their corresponding Apache project jiras are provided in the release notes.

Update 2 is available in all the usual formats (RHEL, SLES, Ubuntu, Debian packages, tarballs, and SCM Express). Check out the installation docsfor instructions. If you’re running components from the Cloudera Management Suite they will not be impacted by moving to update 2. The next update (update 3) for CDH3 is planned for January, 2012.

Thank you for supporting Apache Hadoop and thank you for supporting Cloudera.

Another aspect of Cloudera’s support for the Hadoop ecosystem is its Cloudera University.

October 4, 2011

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

Filed under: Lucene,Mahout,Recommendation — Patrick Durusau @ 7:57 pm

VinWiki Part 1: Building an intelligent Web app using Seam, Hibernate, RichFaces, Lucene and Mahout

From the webpage:

This is the first post in a four part series about a wine rating and recommendation Web application, named VinWiki, built using open source technology. The purpose of this series is to document key design and implementation decisions, which may be of interest to anyone wanting to build an intelligent Web application using Java technologies. The end result will not be a 100% functioning Web application, but will have enough functionality to prove the concepts.

I thought about Lars Marius and his expertise at beer evaluation when I saw this series. Not that Lars would need it but it looks like the sort of thing you could build to recommend things you know something about, and like. Whatever that may be. 😉

September 20, 2011

Running Mahout in the Cloud using Apache Whirr

Filed under: Cloud Computing,Hadoop,Mahout — Patrick Durusau @ 7:51 pm

Running Mahout in the Cloud using Apache Whirr

From the post:

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promising Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr’s Java API (version 0.4).

Running Mahout in the cloud with Apache Whirr will prepare you for using Whirr or similar tools to run services in the cloud.

September 10, 2011

SearchWorkings

Filed under: ElasticSearch,Lucene,Mahout,Solr — Patrick Durusau @ 6:02 pm

SearchWorkings

From the About Us page:

SearchWorkings.org was created by a bunch of really passionate search technology professionals who realised that the world (read: other search professionals) doesn’t have a single point of contact or comprehensive resource where they can learn and talk about all the exciting new developments in the wonderful world of open source search solutions. These professionals all work at JTeam, a leading supplier of high-quality custom-built applications and end-to-end solutions provider, and moreover a market leader when it comes to search solutions.

A wide variety of materials, from whitepapers and articles, forums (Lucene, Solr, ElasticSearch, Mahout), training videos, news, and blogs.

You do have to register/join (free) to get access to the good stuff.

August 18, 2011

Calling Mahout from Clojure

Filed under: Clojure,Mahout — Patrick Durusau @ 6:51 pm

Calling Mahout from Clojure

From the post:

Mahout is a set of libraries for running machine learning processes, such as recommendation, clustering and categorisation.

The libraries work against an abstract model that can be anything from a file to a full Hadoop cluster. This means you can start playing around with small data sets in files, a local database, a Hadoop cluster or a custom data store.

After a bit of research, it turned out not to be too complex to call via any JVM language. When you compile and install Mahout, the libraries are installed into your local Maven cache. This makes it very easy to include them into any JVM type project.

Concludes with two interesting references:

Visualizing Mahout’s output with Clojure and Incanter

Monte Carlo integration with Clojure and Mahout

August 8, 2011

Mahout: Scaleable Data Mining for Everybody

Filed under: Mahout — Patrick Durusau @ 6:22 pm

Mahout: Scaleable Data Mining for Everybody by Ted Dunning.

Has to be the most entertaining and accessible presentations on classification I have seen to date.

Ted is a co-author of Mahout in Action with Sean Owen, Robin Anil, and Ellen Friedman.

If they had more of this sort of thing during the pledge drives to support public television I would bet that their numbers would be better. At least among a certain crowd! 😉

August 5, 2011

Mahout: Hands on!

Filed under: Artificial Intelligence,Hadoop,Machine Learning,Mahout — Patrick Durusau @ 7:06 pm

Mahout: Hands on!

From the tutorial description at OSCON 2011:

Mahout is an open source machine learning library from Apache. At the present stage of development, it is evolving with a focus on collaborative filtering/recommendation engines, clustering, and classification.

There is no user interface, or a pre-packaged distributable server or installer. It is, at best, a framework of tools intend to be used and adapted by developers. The algorithms in this “suite” can be used in applications ranging from recommendation engines for movie websites to designing early warning systems in credit risk engines supporting the cards industry out there.

This tutorial aims at helping you set up Mahout to run on a Hadoop setup. The instructor will walk you through the basic idea behind each of the algorithms. Having done that, we’ll take a look at how it can be run on some of the large-sized datasets and how it can be used to solve real world problems.

If your site or smartphone app or viral facebook app collects data which you really want to use a lot more productively, this session is for you!

Not the only resource on Mahout you will want but an excellent place to start.

July 4, 2011

Visualizing Mahout’s Output…

Filed under: Clojure,Mahout,Visualization — Patrick Durusau @ 6:05 pm

Visualizing Mahout’s output with Clojure and Incanter

From the post:

Some Clojure code to visualize clusters built using Apache Mahout implementation of the K-Means clustering algorithm.

The code retrieves the output of the algorithm (clustered-points and centroids) from HDFS, builds a Clojure friendly representation of the output (a map and a couple of lazy-seqs) and finally uses Incanter’s wrapper around JFreeChart to visualize the results.

Another tool for data miners and visualizers.

March 10, 2011

Mahout/Hadoop on Amazon EC2 – part 1 – Installation

Filed under: Hadoop,Mahout — Patrick Durusau @ 8:11 am

Mahout/Hadoop on Amazon EC2 – part 1 – Installation

The first of 6 posts where Danny Bickson walks through use of Mahout/Hadoop on Amazon EC2.

Other posts in the series:

Mahout on Amazon EC2 – part 2 – Running Hadoop on a single node

Mahout on Amazon EC2 – part 3 – Debugging

Hadoop on Amazon EC2 – Part 4 – Running on a cluster

Mahout on Amazon EC2 – part 5 – installing Hadoop/Mahout on high performance instance (CentOS/RedHat)

Tunning Hadoop configuration for high performance – Mahut on Amazon EC2

While you are here, take some time to look around. Lots of other interesting material on “distributed/parallel large scale algorithms and applications.”

March 4, 2011

ApacheCon NA 2011

Filed under: Cassandra,Cloud Computing,Conferences,CouchDB,HBase,Lucene,Mahout,Solr — Patrick Durusau @ 7:17 am

ApacheCon NA 2011

Proposals: Be sure to submit your proposal no later than Friday, 29 April 2011 at midnight Pacific Time.

7-11 November 2011 Vancouver

From the website:

This year’s conference theme is “Open Source Enterprise Solutions, Cloud Computing, and Community Leadership”, featuring dozens of highly-relevant technical, business, and community-focused sessions aimed at beginner, intermediate, and expert audiences that demonstrate specific professional problems and real-world solutions that focus on “Apache and …”:

  • … Enterprise Solutions (from ActiveMQ to Axis2 to ServiceMix, OFBiz to Chemistry, the gang’s all here!)
  • … Cloud Computing (Hadoop, Cassandra, HBase, CouchDB, and friends)
  • … Emerging Technologies + Innovation (Incubating projects such as Libcloud, Stonehenge, and Wookie)
  • … Community Leadership (mentoring and meritocracy, GSoC and related initiatives)
  • … Data Handling, Search + Analytics (Lucene, Solr, Mahout, OODT, Hive and friends)
  • … Pervasive Computing (Felix/OSGi, Tomcat, MyFaces Trinidad, and friends)
  • … Servers, Infrastructure + Tools (HTTP Server, SpamAssassin, Geronimo, Sling, Wicket and friends)

February 21, 2011

TF-IDF Weight Vectors With Lucene And Mahout

Filed under: Authoring Topic Maps,Lucene,Mahout — Patrick Durusau @ 6:43 am

How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout

From the website:

You have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.

You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.

Lucene and Mahout can help you to do that almost in a snap.

Why is this important for topic maps?

Wikipedia reports:

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. (http://en.wikipedia.org/wiki/Tf-idf, cited in this posting)

Knowing the important terms in a document collection is one step towards a useful topic map. May not be definitive but it is a step in the right direction.

February 7, 2011

Weka – Data Mining

Filed under: Mahout,Natural Language Processing — Patrick Durusau @ 7:10 am

Weka

From the website:

Weka 3: Data Mining Software in Java

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

I would say it is under active development/use since the mailing list archives have an average of about 315 posts per month.

Yes, approximately 315 post per month.

Another tool for your topic map toolbox!

February 3, 2011

PyBrain: The Python Machine Learning Library

PyBrain: The Python Machine Learning Library

From the website:

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

How is PyBrain different?

While there are a few machine learning libraries out there, PyBrain aims to be a very easy-to-use modular library that can be used by entry-level students but still offers the flexibility and algorithms for state-of-the-art research. We are constantly working on more and faster algorithms, developing new environments and improving usability.

What PyBrain can do

PyBrain, as its written-out name already suggests, contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.

Another tool kit to assist in the construction of topic maps.

And another likely contender for the Topic Map Competition!

MALLET: MAchine Learning for LanguagE Toolkit
Topic Map Competition (TMC) Contender?

MALLET: MAchine Learning for LanguagE Toolkit

From the website:

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of “pipes”, which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Another tool to assist in the authoring of a topic map from a large data set.

It would be interesting but beyond the scope of the topic maps class, to organize a competition around several of the natural language processing packages.

To have a common data set, to be released on X date, with topic maps due say within 24 hours (there is a TV show with that in the title or so I am told).

Will have to give that some thought.

Could be both interesting and entertaining.

January 7, 2011

Apache Mahout – Data Mining Class

Filed under: Data Mining,Mahout — Patrick Durusau @ 5:27 am

Apache Mahout – Data Mining Class at the Illinois Institute of Technology, by Dr. David Grossman.

Grossman is the co-author of: Information Retrieval: Algorithms and Heuristics (The Information Retrieval Series)(2nd Edition)

The class was organized by Grant Ingersoll, see: Apache Mahout Catching on in Academia.

Definitely worth a visit to round out your data mining skills.

November 30, 2010

Apache Mahout – Website

Filed under: Classification,Clustering,Data Mining,Mahout,Pattern Recognition,Software — Patrick Durusau @ 8:54 pm

Apache Mahout

From the website:

Apache Mahout’s goal is to build scalable machine learning libraries. With scalable we mean:

Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Current capabilities include:

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)

A topic maps class will only have enough time to show some examples of using Mahout. Perhaps an informal group?

« Newer Posts

Powered by WordPress