Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 8, 2011

Counting Triangles

Filed under: Hadoop,MPP,Vectors — Patrick Durusau @ 8:14 pm

Counting Triangles

From the post:

Recently I’ve heard from or read about people who use Hadoop because their analytic jobs can’t achieve the same level of performance in a database. In one case, a professor I visited said his group uses Hadoop to count triangles “because a database doesn’t perform the necessary joins efficiently.”

Perhaps I’m being dense but I don’t understand why a database doesn’t efficiently support these use-cases. In fact, I have a hard time believing they wouldn’t perform better in a columnar, MPP database like Vertica – where memory and storage are laid out and accessed efficiently, query jobs are automatically tuned by the optimizer, and expression execution is vectorized at run-time. There are additional benefits when several, similar jobs are run or data is updated and the same job is re-run multiple times. Of course, performance isn’t everything; ease-of-use and maintainability are important factors that Vertica excels at as well.

Since the “gauntlet was thrown down”, to steal a line from Good Will Hunting, I decided to take up the challenge of computing the number of triangles in a graph (and include the solutions in GitHub so others can experiment – more on this at the end of the post).

I don’t think you will surprised at the outcome but the exercise is instructive in a number of ways. Primarily don’t assume performance without testing. If all the bellowing leads to more competition and close work on software and algorithms, I think there will be some profit from it.

Machine Learning on Hadoop at Huffington Post | AOL

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 8:12 pm

Machine Learning on Hadoop at Huffington Post | AOL

Nice slide deck on creating a pluggable platform for testing large numbers of algorithms and then selecting the best.

October 3, 2011

Our big data/total data survey is now live [the 451 Group]

Filed under: BigData,Data Warehouse,Hadoop,NoSQL,SQL — Patrick Durusau @ 7:05 pm

Our big data/total data survey is now live [the 451 Group]

The post reads in part:

The 451 Group is conducting a survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

In return for your participation, you will receive a copy of a forthcoming long-format report covering introducing Total Data, The 451 Group’s concept for explaining the changing data management landscape, which will include the results. Respondents will also have the opportunity to become members of TheInfoPro’s peer network.

Just a word about the survey.

Question 10 reads:

What is the primary platform used for storing and querying from each of the following types of data?

Good question but you have to choose one of three answers to put other (and say what “other” means), you are not allowed to skip any type of data.

Data types are:

  • Customer Data
  • Transactional Data
  • Online Transaction Data
  • Domain-specific Application Data (e.g., Trade Data in Financial Services, and Call Data in Telecoms)
  • Application Log Data
  • Web Log Data
  • Network Log Data
  • Other Log Files
  • Social Media/Online Data
  • Search Log
  • Audio/Video/Graphics
  • Other Documents/Content

Same thing happens for Question 11:

What is the primary platform used for each of the following analytics workloads?

Eleven required answers that I won’t bother to repeat here.

As a consultant I really don’t have serious iron/data on the premises but that doesn’t seem to occurred to the survey designers. Nor that even a major IT installation might not have all forms of data or analytics.

My solution? I just marked Hadoop on Questions 10 and 11 so I could get to the rest of the survey.

Q12. Which are the top three benefits associated with each of the following data management technologies?

Q13. Which are the top three challenges associated with each of the following data management technologies?

Q14. To what extent do you agree with the following statements? (which includes: “The enterprise data warehouse is the single version of the truth for business intelligence”

Questions 12 – 14 all require answers to all options.

Note the clever first agree/disagree statement for Q.14.

Someone will conduct a useful survey of business opinions about big data and likely responses to it.

Hopefully with a technical survey of the various options and their advantages/disadvantages.

Please let me know when you see it, I would like to point people to it.

(I completed this form on Sunday, October 2, 2011, around 11 AM Eastern time.)

October 2, 2011

Apache Flume incubation wiki

Filed under: Flume,Hadoop,Probalistic Models — Patrick Durusau @ 6:34 pm

Apache Flume incubation wiki

From the website:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.

A number of resources for Flume.

Will “data flows” as the dominant means of accessing data be a consequence of an environment where a “local copy” of data is no longer meaningful or an enabler of such an environment? Or both?

I think topic maps would do well to develop models for streaming and perhaps probabilistic merging or even probabilistic creation of topics/associations from data streams. Static structures only give the appearance of certainty.

Machine Learning with Hadoop

Filed under: Data Mining,Hadoop,Machine Learning — Patrick Durusau @ 6:34 pm

Machine Learning with Hadoop by Josh Patterson.

Very current (Sept. 2011) review of Hadoop, data mining and related issues. Plus pointers to software projects such as Lumberyard, which deals with terabyte-sized time series data.

September 29, 2011

Hadoop for Archiving Email

Filed under: Hadoop,Lucene,Solr — Patrick Durusau @ 6:35 pm

Hadoop for Archiving Email by Sunil Sitaula.

When I saw the title of this post I started wondering if the NSA was having trouble with archiving all my email. 😉

From the post:

This post will explore a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.

Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity, the migration processes, and the need to provide timely search results stand out as key points that must be addressed before embarking on an actual implementation. For example, in some organizations all email messages are stored in production servers; others just create a backup dump and store them in tapes; and some organizations have proper archival processes that include search features. Regardless of the situation, it is essential to be able to store and search emails because of the critical information they hold as well as for legal compliance, investigation, etc. That said, let’s look at how Hadoop could help make this process somewhat simple, cost effective, manageable, and scalable.

The post concludes:

In this post I have described the conversion of email files into sequence files and store them using HDFS. I have looked at how to search through them to output results. Given the “simply add a node” scalability feature of Hadoop, it is very straightforward to add more storage as well as search capacity. Furthermore, given that Hadoop clusters are built using commodity hardware, that the software itself is open source, and that the framework makes it simple to implement specific use cases. This leads to an overall solution that is very cost effective compared to a number of existing software products that provide similar capabilities. The search portion of the solution, however, is very rudimentary. In part 2, I will look at using Lucene/Solr for indexing and searching in a more standard and robust way.

Read part one and get ready for part 2!

And start thinking about what indexing/search capabilities you are going to want.


Update: Hadoop for Archiving Email – Part 2

September 26, 2011

Twitter Storm: Open Source Real-time Hadoop

Filed under: Hadoop,NoSQL,Storm — Patrick Durusau @ 6:55 pm

Twitter Storm: Open Source Real-time Hadoop by Bienvenido David III.

From the post:

Twitter has open-sourced Storm, its distributed, fault-tolerant, real-time computation system, at GitHub under the Eclipse Public License 1.0. Storm is the real-time processing system developed by BackType, which is now under the Twitter umbrella. The latest package available from GitHub is Storm 0.5.2, and is mostly written in Clojure.

Storm provides a set of general primitives for doing distributed real-time computation. It can be used for “stream processing”, processing messages and updating databases in real-time. This is an alternative to managing your own cluster of queues and workers. Storm can be used for “continuous computation”, doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for “distributed RPC”, running an expensive computation in parallel on the fly.

See the post for links, details, quotes, etc.

My bet is that typologies are going to be data set specific. You?

BTW, I don’t think the local coffee shop offers free access to its cluster. Will have to check with them next week.

September 22, 2011

LexisNexis Open-Sources its Hadoop Alternative

Filed under: Hadoop,HPCC — Patrick Durusau @ 6:16 pm

LexisNexis Open-Sources its Hadoop Alternative

Ryan Rosario writes:

A month ago, I wrote about alternatives to the Hadoop MapReduce platform and HPCC was included in that article. For more information, see here.

LexisNexis has open-sourced its alternative to Hadoop, called High Performance Computing Cluster. The code is available on GitHub. For years the code was restricted to LexisNexis Risk Solutions. The system contains two major components:

  • Thor (Thor Data Refinery Cluster) is the data processing framework. It “crunches, analyzes and indexes huge amounts of data a la Hadoop.”
  • Roxie (Roxy Radid Data Delivery Cluster) is more like a data warehouse and is designed with quick querying in mind for frontends.

The protocol that drives the whole process is the Enterprise Control Language which is said to be faster and more efficient than Hadoop’s version of MapReduce. A picture is a much better way to show how the system works. Below is a diagram from the Gigaom article from which most of this information originates.

Interesting times ahead.

Slides and replay from “R and Hadoop” webinar

Filed under: Hadoop,R — Patrick Durusau @ 6:10 pm

Slides and replay from “R and Hadoop” webinar

From the post:

So … there’s clearly a lot of interest in integrating R and Hadoop. Today’s webinar was a record-setter for Revolution Analytics, with more than 1000 people signing up to learn how to access Hadoop data from R with the packages from the open-source RHadoop project. If you didn’t catch the live webinar, don’t fret: the slides and replay are available for download, and you can learn more about the RHadoop packages in the white paper from CTO David Champagne, “Advanced ‘Big Data’ Analytics with R and Hadoop“.

I don’t know what the average numbers are for webinars but I suspect it is below 1,000. Way below 1,000. Does anyone have those numbers handy?

September 20, 2011

Running Mahout in the Cloud using Apache Whirr

Filed under: Cloud Computing,Hadoop,Mahout — Patrick Durusau @ 7:51 pm

Running Mahout in the Cloud using Apache Whirr

From the post:

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promising Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr’s Java API (version 0.4).

Running Mahout in the cloud with Apache Whirr will prepare you for using Whirr or similar tools to run services in the cloud.

September 17, 2011

Got Hadoop?

Filed under: Bioinformatics,Biomedical,Hadoop — Patrick Durusau @ 8:12 pm

Got Hadoop?

This is going to require free registration at Genomeweb but I think it will be worth it. (Genomeweb also offers $premium content but I haven’t tried any of it, yet.)

Nice overview of Hadoop in genome research.

Annoying in that it lists the following projects, sans hyperlinks. I have supplied the project listing with hyperlinks, just in case you are interested in Hadoop and genome research.

Crossbow: Whole genome resequencing analysis; SNP genotyping from short reads
Contrail: De novo assembly from short sequencing reads
Myrna: Ultrafast short read alignment and differential gene expression from large RNA-seq eakRanger: Cloud-enabled peak caller for ChIP-seq data
Quake: Quality-aware detection and sequencing error correction tool
BlastReduce: High-performance short read mapping (superceded by CloudBurst)
CloudBLAST*: Hadoop implementation of NCBI’s Blast
MrsRF: Algorithm for analyzing large evolutionary trees

*CloudBLAST was the only project without a webpage or similar source of information. This is a paper, perhaps the original paper on the technique. Searching for any of these techniques reveals a wealth of material on using Hadoop in bioinformatics.

Topic maps can capture your path through data (think of bread crumbs or string). So when today you think, “I should have gone left, rather than right”, you can retrace your steps and take a another path. Try that with a Google search. If you are lucky, you may get the same ads. 😉

You can also share your bread crumbs or string with others, but that is a story for another time.

September 15, 2011

Enterprise-level Cloud at no charge

Filed under: Cloud Computing,Hadoop — Patrick Durusau @ 7:49 pm

Enterprise-level cloud at no charge

From September 12 – November 11, 2011.

Signup deadline: 28 October 2011

From the webpage:

  • 64-bit Copper and 32-bit Silver machines
  • Virtual machines to run Linux® (Red Hat or Novell SUSE) or Microsoft® Windows® Server 2003/2008
  • Select IBM software images
  • 1 block (256 gigabytes) of persistent storage

For the promotional period, IBM will suppress charges for use of these services. You may terminate the promotion at any time, although we don’t think you’ll want to! At the end of the promotional period, your account will transition to a standard pay-as-you-go account at the rates effective at that time. You may elect to add on more services, including, but not limited to:

  • Reserved virtual machine instances
  • On-boarding support
  • Premium and Advanced Premium support options
  • Virtual Private Network services
  • Additional images from IBM software brands, along with offerings from independent software vendors
  • Access to other IBM SmartCloud data centers
  • Additional services that are regularly being added to the IBM SmartCloud Enterprise offering

With these features and more, don’t miss this opportunity to try the IBM SmartCloud. With our enterprise-level servers, software and services, we offer a cloud computing infrastructure that you can approach with confidence. The IBM SmartCloud is built on the skills, experience and best practices gained from years of managing and operating security-rich data centers for enterprises and public institutions around the world.

If you want to try the cloud computing waters or IBM offerings, this could be your chance.

September 14, 2011

Yahoo! Hadoop Tutorial

Filed under: Hadoop,MapReduce,Pig — Patrick Durusau @ 7:03 pm

Yahoo! Hadoop Tutorial

From the webpage:

Welcome to the Yahoo! Hadoop Tutorial. This tutorial includes the following materials designed to teach you how to use the Hadoop distributed data processing environment:

  • Hadoop 0.18.0 distribution (includes full source code)
  • A virtual machine image running Ubuntu Linux and preconfigured with Hadoop
  • VMware Player software to run the virtual machine image
  • A tutorial which will guide you through many aspects of Hadoop’s installation and operation.

The tutorial is divided into seven modules, designed to be worked through in order. They can be accessed from the links below.

  1. Tutorial Introduction
  2. The Hadoop Distributed File System
  3. Getting Started With Hadoop
  4. MapReduce
  5. Advanced MapReduce Features
  6. Related Topics
  7. Managing a Hadoop Cluster
  8. Pig Tutorial

You can also download this tutorial as a single .zip file and burn a CD for use, and easy distribution, offline.

September 13, 2011

Hadoop Tuesdays!

Filed under: Hadoop — Patrick Durusau @ 7:13 pm

Hadoop Tuesdays with Joe McKendrick: 7-Part Live Webinar Series

I know, I know, the “registration” form is fairly lame. I was tempted to put down “hospitality/travel” as my industry just to see if I started getting free trip offers or something. 😉

Can’t say for sure until I have attended the sessions but this doesn’t have the appearance of a real technical set of webinars. Still, need to know the “cliff notes” version of the story circulating about Hadoop in governmental and business circles.

From the webpage:

Data experts, Informatica and Cloudera are jointly producing a 7-part webinar series, called Hadoop Tuesdays. Host Joe McKendrick is an independent industry analyst and contributing editor to Database Trends & Applications.

Why register for Hadoop Tuesdays webinar series?

  • You are interested in learning more about Hadoop but don’t know how to get started
  • You have goals for storing, processing and extracting value from unstructured data so that you can combine and unleash the value of both structured and unstructured data
  • You wish to form a roadmap for adding Hadoop to your environment.

What will be covered in the Hadoop Tuesdays webinar series?

  • What is Hadoop?
  • What are the most popular use cases driving Hadoop adoption?
  • What are the biggest expected benefits?
  • How should you evaluate Hadoop and fit it into your information architecture?

September 9, 2011

R and Hadoop

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 7:06 pm

From Revolution Analytics:

White paper: Advanced ‘Big Data’ Analytics with R and Hadoop

Webinar: Revolution Webinar: Leveraging R in Hadoop Environments 21 September 2011 – 10AM – 10:30AM Pacific Time

RHadoop: RHadoop

From GitHub:

RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. The packages have been implemented and tested in Cloudera’s distribution of Hadoop (CDH3) with R 2.13.0. RHadoop consists of the following packages:

rmr – functions providing Hadoop MapReduce functionality in R
rhdfs – functions providing file management of the HDFS from within R
rhbase – functions providing database management for the HBase distributed database from within R

What Can Apache Hadoop do for You?

Filed under: Hadoop — Patrick Durusau @ 7:21 am

What Can Apache Hadoop do for You?

This is extremely amusing! I liked the non-correct answers better than I did the correct ones. 😉

From the description:

What can Apache Hadoop do for you? Watch the video to find out what other people think Apache Hadoop is (or isn’t).

Share your own ideas about Apache Hadoop. Get out your video camera or phone, channel your inner filmmaker and submit a short clip or mini film of what you think Apache Hadoop can do for you. Let your creative energy flow: It can be sincere, funny, educational or witty.

By participating, you could be selected as a Cloudera Hero and win a four-day trip to San Francisco to spend a day hacking code with Doug Cutting, co-founder of the Apache Hadoop project. Find out how to get entered.

Don’t have a video to enter? Help us pick the winner and give your favorite contestant a thumbs up.

Go to www.facebook.com/cloudera and click the “Be a Cloudera Hero for Apache Hadoop” tab for full details.

September 6, 2011

Hadoop Fatigue — Alternatives to Hadoop

Filed under: GraphLab,Hadoop,HPCC,MapReduce,Spark,Storm — Patrick Durusau @ 7:15 pm

Hadoop Fatigue — Alternatives to Hadoop

Can you name six (6) alternatives to Hadoop? Or formulate why you choose Hadoop over those alternatives?

From the post:

After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.

  • Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
  • Documentation for the bloated Java API is sufficient, but not the most helpful.
  • HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
  • Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
  • Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
  • Large clusters require a dedicated team to keep it running properly, but that is not surprising.
  • Writing a Hadoop job becomes a software engineering task rather than a data analysis task.

Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:

Out of the six alternatives, I haven’t seen BashReduce or Disco, so I need to look those up.

Ah, the other alternatives: GraphLab, HPCC, Spark, and Preview of Storm: The Hadoop of Realtime Processing.

It is a pet peeve of mine that some authors force me to search for links they could have just as well entered. The New York Times of all places, refers to websites and does not include the URLs. And that is for paid subscribers.

August 31, 2011

BigData University

Filed under: BigData,Hadoop — Patrick Durusau @ 7:45 pm

BigData University

From the website:

Why Register?

Easy and Affordable Learning Hadoop and other Big Data technologies has never been more affordable! Many courses are FREE!

Latest industry trends Acquire valuable skills and get updated about industry’s latest trends right here. Today!

Learn from the Experts! Big Data University offers education about Hadoop and other technologies by the industry’s best!

Learn at your Own Pace! Find everything right here when you need it and from wherever you are.

A demonstration of the power of social media. I start complaining about use of the word “big” (Big Learning 2011) and someone starts a university using “big” in the name. Random chance? I don’t think so. 😉

I’m signing up for the free Hadoop course.

BTW, did you notice that they have a “Creating a course in DB2 University” offering? (Yeah, I know, they forgot to change all the names. They will get around to it.)

Will have to see what the course software looks like. Could have possibilities for topic maps.

August 30, 2011

Graph Processing versus Graph Databases

Filed under: Graphs,Hadoop,Neo4j,Pregel — Patrick Durusau @ 7:10 pm

Graph Processing versus Graph Databases

Jim Webber describes the different problems addressed by graph processing and graph databases. Worth reading so you will pick the correct tool for the problem you are facing.

Webber visualizes the following distinctions:

What Pregel and Hadoop have in common is their tendency towards the data analytics (OLAP) end of the spectrum, rather than being focussed on transaction processing. This is in stark contrast to graph databases like Neo4j which optimise storage and querying of connected data for online transaction processing (OLTP) scenarios – much like a regular RDBMS, only with a more expressive and powerful data model.

See the post for the graphic.

August 28, 2011

The Future of Hadoop

Filed under: Hadoop,HBase — Patrick Durusau @ 7:55 pm

The Future of Hadoop – with Doug Cutting and Jeff Hammerbacher

From the description:

With a community of over 500 contributors, Apache Hadoop and related projects are evolving at an ever increasing rate. Join the co-creator of Apache Hadoop, Doug Cutting, and Cloudera’s Chief Scientist, Jeff Hammerbacher, for a discussion of the most exciting new features being developed by the Apache Hadoop community.

The primary focus of the webinar will be the evolution from the Apache Hadoop kernel to the complete Apache Bigtop platform. We’ll cover important changes in the kernel, especially high availability for HDFS and the separation of cluster resource management and MapReduce job scheduling.

We’ll discuss changes to throughout the platform, including support for Snappy-based compression and the Avro data file format in all components, performance and security improvements across all components, and additional supported operating systems. Finally, we’ll discuss new additions to the platform, including Mahout for machine learning and HCatalog for metadata management, as well as important improvements to existing platform components like HBase and Hive.

Both the slides and the recording of this webinar are available but I would go for the recording.

One of the most informative and entertaining webinars I have seen, ever. Cites actual issue numbers and lays out how Hadoop is on the road to becoming a stack of applications that offer a range of data handling and analysis capabilities.

If you are interested in data processing/analysis at any scale, you need to see this webinar.

August 15, 2011

Visitor Conversion with Bayesian Discriminant and Hadoop

Filed under: Bayesian Models,Hadoop,Marketing — Patrick Durusau @ 7:31 pm

Visitor Conversion with Bayesian Discriminant and Hadoop

From the post:

You have lots of visitors on your eCommerce web site and obviously you would like most of them to convert. By conversion, I mean buying your product or service. It could also mean the visitor taking an action, which potentially could financially benefit the business e.g., opening an account or signing up for email new letter. In this post, I will cover some predictive data mining techniques that may facilitate higher conversion rate.

Wouldn’t it be nice if for any ongoing session, you could predict the odds of the visitor converting during the session, based on the visitor’s behavior during the session.

Armed with such information, you could take different kinds of actions to enhance the chances of conversion. You could entice the visitor with a discount offer. Or you could engage the visitor in a live chat to answer any product related questions.

There are simple predictive analytic techniques to predict the probability of of a visitor converting. When the predicted probability crosses a predefined threshold, the visitor could be considered to have high potential of converting.

I would ask the question of “conversion” more broadly.

That is how can we dynamically change the model of subject identity in a topic map to match a user’s expectations? What user behavior and how would we track it to reach such an end?

Reasoning that users are more interested in and more likely to support topic maps that reinforce their world views. And selling someone topic map output that they find agreeable is easier than output they find disagreeable.

August 14, 2011

Recommendation Engine Powered By Hadoop

Filed under: Hadoop,Recommendation — Patrick Durusau @ 7:09 pm

Recommendation Engine Powered By Hadoop by Pranab Ghosh.

Nice set of slides on use of Hadoop to power a recommendation engine. (Implicit subject recognition and therefore difficult to fashion explicit merger from different sources.)

At least on Slideshare the additional resource links aren’t working on the slides. So, for your reading pleasure:

Pranab’s blog: Mawazo A number of interesting posts on NoSQL and related technologies.

Including Pranab’s two-part blog post on Hadoop and recommendation engines:

Recommendation Engine Powered by Hadoop (Part 1)

Recommendation Engine Powered by Hadoop (Part 2)

and, Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman.

August 11, 2011

Apache Hadoop and HBase

Filed under: Hadoop,HBase — Patrick Durusau @ 6:31 pm

Apache Hadoop and HBase by Todd Lipcon.

Another introductory slide deck. Don’t know which one is going to click for any individual so including it here.

August 10, 2011

Parallel Processing Using the Map Reduce Programming Model

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:12 pm

Parallel Processing Using the Map Reduce Programming Model

Demonstrates parallel processing using map reduce on the IMDB. It starts with a Perl script (Robert should like that) but then moves to Java to to use Hadoop. Code listed in the article and is available at github.

Nothing ground breaking but it will help some users gain confidence with Hadoop/Map Reduce with a familiar data set.

August 9, 2011

Parallel Data Warehouse News and Hadoop Interoperability Plans

Filed under: Hadoop — Patrick Durusau @ 7:56 pm

Parallel Data Warehouse News and Hadoop Interoperability Plans

From the MS SQL Server Team Blog:

In the data deluge faced by businesses, there is also an increasing need to store and analyze vast amounts of unstructured data including data from sensors, devices, bots and crawlers. By many accounts, almost 80% of what businesses store is unstructured data – and this volume is predicted to grow exponentially over the next decade. We have entered the age of Big Data. Our customers have been asking us to help store, manage, and analyze both structured and unstructured data – in particular, data stored in Hadoop environments. As a first step, we will soon release a Community Technology Preview (CTP) of two new Hadoop connectors – one for SQL Server and one for PDW. The connectors provide interoperability between SQL Server/PDW and Hadoop environments, enabling customers to transfer data between Hadoop and SQL Server/PDW. With these connectors, customers can more easily integrate Hadoop with their Microsoft Enterprise Data Warehouses and Business Intelligence solutions to gain deeper business insights from both structured and unstructured data.

I don’t have a SQL Server or a Parallel Data Warehouse so someone will need to contribute comments on the new Hadoop connectors when they appear.

I will note that the more seamless data interchange becomes, the greater user satisfaction with the tools they are called up to use. Which is a good thing for long term market share.

August 6, 2011

Real-time Streaming Analysis for Hadoop and Flume

Filed under: Flume,Hadoop,Interface Research/Design — Patrick Durusau @ 6:52 pm

Real-time Streaming Analysis for Hadoop and Flume

From the description:

This talk introduces an open-source SQL-based system for continuous or ad-hoc analysis of streaming data built on top of the Flume data collection platform for Hadoop.

Big data analytics based on Hadoop often require aggregating data in a large data store like HDFS or HBase, and then running periodic MapReduce processes over this data set. Getting “near real time” results requires running MapReduce jobs more frequently over smaller data sets, which has a practical frequency limit based on the size of the data and complexity of the analytics; the lower bound on analysis latency is on the order of minutes. This has spawned a trend of building custom analytics directly into the data ingestion pipeline, enabling some streaming operations such as early alerting, index generation, or real-time tuning of ad systems before performing less time-sensitive (but more comprehensive) analysis in MapReduce.

We present an open-source tool which extends the Flume data collection platform with a SQL-like language for analysis over streaming event-based data sets. We will discuss the motivation for the system, its architecture and interaction with Flume, potential applications, and examples of its usage.

Deeply awesome! Just wish I had been present to see the demo!

Makes me think of topic map creation from data streams with the ability to test different subject identity merging conditions, in real time. Rather than repetitive stories about a helicopter being downed, you get a summary report and a listing by location and time of publication of repetitive reports. Say one screen full of content and access to the noise. Better use of your time?

Machine learning problem settings

Filed under: Hadoop,Machine Learning,MapReduce — Patrick Durusau @ 6:51 pm

Machine learning problem settings

From the post:

After a few successful Apache Mahout projects the goal of this lecture was to introduce students to some of the basic concepts and problems encountered today in a world where huge datasets are generally available and are easy to process with Apache Hadoop. As such the course is targeted at an entry level audience – thorough treatment of the mathematical background of latest machine learning technology is left to the machine learning research groups in Potsdam, at TU Berlin and the neural information processing group at TU.

Slides and exercises that will be useful along side of or getting warmed up for Introduction to Artificial Intelligence – Stanford Class.

August 5, 2011

Mahout: Hands on!

Filed under: Artificial Intelligence,Hadoop,Machine Learning,Mahout — Patrick Durusau @ 7:06 pm

Mahout: Hands on!

From the tutorial description at OSCON 2011:

Mahout is an open source machine learning library from Apache. At the present stage of development, it is evolving with a focus on collaborative filtering/recommendation engines, clustering, and classification.

There is no user interface, or a pre-packaged distributable server or installer. It is, at best, a framework of tools intend to be used and adapted by developers. The algorithms in this “suite” can be used in applications ranging from recommendation engines for movie websites to designing early warning systems in credit risk engines supporting the cards industry out there.

This tutorial aims at helping you set up Mahout to run on a Hadoop setup. The instructor will walk you through the basic idea behind each of the algorithms. Having done that, we’ll take a look at how it can be run on some of the large-sized datasets and how it can be used to solve real world problems.

If your site or smartphone app or viral facebook app collects data which you really want to use a lot more productively, this session is for you!

Not the only resource on Mahout you will want but an excellent place to start.

August 1, 2011

Create Hadoop clusters the easy peasy way with Pallet

Filed under: Hadoop — Patrick Durusau @ 3:50 pm

Create Hadoop clusters the easy peasy way with Pallet

From the post:

Setting up a Hadoop cluster is usually a pretty involved task. There are certain rules about how the cluster is to be configured. These rules need to be followed strictly for the cluster to work. For example, some nodes need to know how to talk to the other nodes, and some nodes need to allow other nodes to talk to them. Go ahead and check out the official instructions, or this more detailed tutorial on setting up multi-node Hadoop clusters. In this article we describe a solution that will create a fully functional hadoop cluster on any public cloud with very few steps, and in a very flexible way.

Anyone with a cloud account have any comments on this approach to creating Hadoop clusters?

July 26, 2011

Oozie by Example

Filed under: Hadoop,Oozie — Patrick Durusau @ 6:23 pm

Oozie by Example

From the post:

In our previous article [Introduction to Oozie] we described Oozie workflow server and presented an example of a very simple workflow. We also described deployment and configuration of workflow for Oozie and tools for starting, stoping and monitoring Oozie workflows.

In this article we will describe a more complex Oozie example, which will allow us to discuss more Oozie features and demonstrate how to use them.

More on workflow for Hadoop!

« Newer PostsOlder Posts »

Powered by WordPress