Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 21, 2011

Comparing High Level MapReduce Query Languages

Filed under: Hadoop,Hive,JAQL,MapReduce,Pig — Patrick Durusau @ 7:27 pm

Comparing High Level MapReduce Query Languages by R.J. Stewart, P.W. Trinder, and H-W. Loidl.

Abstract:

The MapReduce parallel computational model is of increasing importance. A number of High Level Query Languages (HLQLs) have been constructed on top of the Hadoop MapReduce realization, primarily Pig, Hive, and JAQL. This paper makes a systematic performance comparison of these three HLQLs, focusing on scale up, scale out and runtime metrics. We further make a language comparison of the HLQLs focusing on conciseness and computational power. The HLQL development communities are engaged in the study, which revealed technical bottlenecks and limitations described in this document, and it is impacting their development.

A starting place for watching these three HLQLs as they develop, which no doubt they will continue to do. And one expects them to be joined by other candidates so familiarity with this paper may help speed their evaluation as well.

November 19, 2011

Microsoft drops Dryad; bets on Hadoop

Filed under: Dryad,Hadoop — Patrick Durusau @ 10:21 pm

Microsoft drops Dryad; bets on Hadoop

In a November 11 post on the Windows HPC Team Blog, officials said that Microsoft had provided a minor update to the latest test build of the Dryad code as part of Windows High Performance Computing (HPC) Pack 2008 R2 Service Pack (SP) 3. But they also noted that “this will be the final (Dryad) preview and we do not plan to move forward with a production release.”

….

But it now appears Microsoft is putting all its big-data eggs in the Hadoop framework basket. Microsoft officials said a month ago that Microsoft was working with Hortonworks to develop both a Windows Azure and a Windows Server distribution of Hadoop. A Community Technology Preview (CTP) of the Windows Azure version is due out before the end of this calendar year; the Windows Server test build of Hadoop is due some time in 2012.

It might be a good time for the Hadoop community, which now includes MS, to talk about studying the syntax and semantics of the Hadoop eco-system that can be standardized.

It would be nice to see competition between Hadoop products on the basis of performance and features, not learning the oddities of particular implementations. The public versions could set a baseline and commercial versions would be pressed to better that.

After all, there those who contend that commercial code is measurably better than other types of code. Perhaps it is time to put their faith to the test.

November 15, 2011

Hadoop and Data Quality, Data Integration, Data Analysis

Filed under: Data Analysis,Data Integration,Hadoop — Patrick Durusau @ 7:58 pm

Hadoop and Data Quality, Data Integration, Data Analysis by David Loshin.

From the post:

If you have been following my recent thread, you will of course be anticipating this note, in which we examine the degree to which our favorite data-oriented activities are suited to the elastic yet scalable massive parallelism promised by Hadoop. Let me first summarize the characteristics of problems or tasks that are amenable to the programming model:

  1. Two-Phased (2-φ) – one or more iterations of “computation” followed by “reduction.”
  2. Big data – massive data volumes preclude using traditional platforms
  3. Data parallel (Data-||) – little or no data dependence
  4. Task parallel (Task-||) – task dependence collapsible within phase-switch from Map to Reduce
  5. Unstructured data – No limit on requiring data to be structured
  6. Communication “light” – requires limited or no inter-process communication except what is required for phase-switch from Map to Reduce

OK, so I happen to agree with David’s conclusions. (see his post for the table) That isn’t the only reason I posted this note.

Rather I think this sort of careful analysis lends itself to test cases, which we can post and share with specification of the tasks performed.

Much cleaner and more enjoyable than the debates measured by who can sink the lowest fastest.

Test cases to suggest anyone?

VC funding for Hadoop and NoSQL tops $350m

Filed under: Funding,Hadoop,NoSQL — Patrick Durusau @ 7:58 pm

VC funding for Hadoop and NoSQL tops $350m

From the post:

451 Research has today published a report looking at the funding being invested in Apache Hadoop- and NoSQL database-related vendors. The full report is available to clients, but below is a snapshot of the report, along with a graphic representation of the recent up-tick in funding.

According to our figures, between the beginning of 2008 and the end of 2010 $95.8m had been invested in the various Apache Hadoop- and NoSQL-related vendors. That figure now stands at more than $350.8m, up 266%.

That statistic does not really do justice to the sudden uptick of interest, however. The figures indicate that funding for Apache Hadoop- and NoSQL-related firms has more than doubled since the end of August, at which point the total stood at $157.5m.

Takes more work than winning the lottery but on the other hand it is encouraging to see that kind of money being spread around.

But, past funding is just that, past funding. Encouraging but the real task is creating solutions that attract future funding.

Suggestions/comments?

November 13, 2011

Hadoop Distributions And Kids’ Soccer

Filed under: BigData,Hadoop — Patrick Durusau @ 10:00 pm

Hadoop Distributions And Kids’ Soccer

From the post:

The big players are moving in for a piece of the Big Data action. IBM, EMC, and NetApp have stepped up their messaging, in part to prevent startup upstarts like Cloudera from cornering the Apache Hadoop distribution market. They are all elbowing one another to get closest to “pure Apache” while still “adding value.” Numerous other startups have emerged, with greater or lesser reliance on, and extensions or substitutions for, the core Apache distribution. Yahoo! has found a funding partner and spun its team out, forming a new firm called Hortonworks, whose claim to fame begins with an impressive roster responsible for much of the code in the core Hadoop projects. Think of the Doctor Seuss children’s book featuring that famous elephant, and you’ll understand the name.

While we’re talking about kids – ever watch young kids play soccer? Everyone surrounds the ball. It takes years to learn their position on the field and play accordingly. There are emerging alphas, a few stragglers on the sidelines hoping for a chance to play, community participants – and a clear need for governance. Tech markets can be like that, and with 1600 attendees packing late June’s Hadoop Summit event, all of those scenarios were playing out. Leaders, new entrants, and the big silents, like the absent Oracle and Microsoft.

The ball is indeed in play; the open source Apache Hadoop stack today boasts “customers” among numerous Fortune 500 companies, running critical business workloads on Hadoop clusters constructed for data scientists and business sponsors – and very often with little or no participation by IT and the corporate data governance and enterprise architecture teams. Thousands of servers, multiple petabytes of data, and growing numbers of users are increasingly to be seen.

…. (after many amusing and interesting observations)

That governance will be critical for the future. Other Apache and non-Apache projects, like HBase, Hive, Zookeeper, Pig, Flume, Sqoop, Oozie, et al all have their own agendas. In Apache locution, each has its own “committers” – owners of the code lines – and the task of integrating disparate pieces – each on its own time line – will fall to somebody. Will your distribution owner test the combination of the particular ones you’re using? If not, that will be up to you. One of the biggest barriers to open source adoption so far has been precisely that degree of required self-integration. Gartner’s second half 2010 open source survey showed that more than half of the 547 surveyed organizations have adopted OSS solutions as part of their IT strategy. Data management and integration is the top initiative they name; 46% of surveyed companies named it. This is where the game is.

Topic maps as a mechanism for easing the process of self-integration?

Would certainly be more agile than searching blog posts, user email lists, FAQs, etc.

The Marriage of R and Hadoop

Filed under: Hadoop,R — Patrick Durusau @ 10:00 pm

The marriage of R and Hadoop: Revolution Analytics at Hadoop World by Josh Willis.

Josh covers the presentation of David Champagne, CTO of Revolution Analytics, titled: Leveraging R in Hadoop Environments.

The slides are very good but not for the C-Suite. More for people who want to get enthusiastic about using R and Hadoop.

What is a “Hadoop”? Explaining Big Data to the C-Suite

Filed under: Hadoop,Humor — Patrick Durusau @ 9:59 pm

What is a “Hadoop”? Explaining Big Data to the C-Suite by Vincent Granville.

From the post:

Keep hearing about Big Data and Hadoop? Having a hard time understanding what is behind the curtain?

Hadoop is an emerging framework for Web 2.0 and enterprise businesses who are dealing with data deluge challenges – store, process and analyze large amounts of data as part of their business requirements.

The continuous challenge online is how to improve site relevance, performance, understand user behavior, and predictive insight. This is a never ending arms race as each firm tries to become the portal of choice in a fast changing world. Take for instance, the competitive world of travel. Every site has to improve at analytics and machine learning as the contextual data is changing by the second- inventory, pricing, recommendations, economic conditions, natural disasters etc.

Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. This will be a huge shift in how IT apps are engineered.

I don’t find it helpful to confuse Big Data and Hadoop. Very different things and not helpful for folks in the C-Suite to confuse them. Unless, of course, you are selling Hadoop services and want people to think Hadoop everytime they hear Big Data.

But I am really too close to Hadoop and related technologies to reliably judge explanations for the C-Suite so why not have a poll? Nothing fancy, just comment using one of the following descriptions or make up your own if mine aren’t enough:

I think the “What is ‘Hadoop’?…” explanation:

  1. Is as good as IT explanations get for the C-Suite.
  2. Is adequate but could use (specify changes)
  3. “Everyone […] is now dumber for having listened to it. I award you no points and may God have mercy on your soul.” (Billy Madison)

Comments?

November 12, 2011

Recommendation with Apache Mahout in CDH3

Filed under: Hadoop,Mahout — Patrick Durusau @ 8:46 pm

Recommendation with Apache Mahout in CDH3 by Josh Patterson.

From the introduction:

The amount of information we are exposed to on a daily basis is far outstripping our ability to consume it, leaving many of us overwhelmed by the amount of new content we have available. Ideally we’d like machines and algorithms to help us find the more interesting (for us individually) things so we more easily focus our attention on items of relevance.

Have you ever been recommended a friend on Facebook or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.

Due to the explosion of web traffic and users the scale of recommendation poses new challenges for recommendation systems. These systems face the dual challenge of producing high quality recommendations while also calculating recommendations for millions of users. In recent years collaborative filtering (CF) has become popular as a way to effectively meet these challenges. CF techniques start off by analyzing the user-item matrix to identify relationships between different users or items and then use that information to produce recommendations for each user.

To use this post as an introduction to recommendation with Apache Mahout, is there anything you would change, subtract from or add to this post? If anything.

I am working on my answer to that question but am curious what you think?

I want to use this and similar material on a graduate library course more to demonstrate the principals than to turn any of the students into Hadoop hackers. (Although that would be a nice result as well.)

November 11, 2011

Uncovering mysteries of InputFormat (Hadoop)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:39 pm

Uncovering mysteries of InputFormat: Providing better control for your Map Reduce execution. by Boris Lublinsky and Mike Segel.

From the post:

As more companies adopt Hadoop, there is a greater variety in the types of problems for which Hadoop’s framework is being utilized. As the various scenarios where Hadoop is applied grow, it becomes critical to control how and where map tasks are executed. One key to such control is custom InputFormat implementation.

The InputFormat class is one of the fundamental classes in the Hadoop Map Reduce framework. This class is responsible for defining two main things:

  • Data splits
  • Record reader

Data split is a fundamental concept in Hadoop Map Reduce framework which defines both the size of individual Map tasks and its potential execution server. The Record Reader is responsible for actual reading records from the input file and submitting them (as key/value pairs) to the mapper. There are quite a few publications on how to implement a custom Record Reader (see, for example, [1]), but the information on splits is very sketchy. Here we will explain what a split is and how to implement custom splits for specific purposes.

See the post for the details.

Something for you to explore over the weekend!

Postgres Plus Connector for Hadoop

Filed under: Hadoop,MapReduce,Pig,PostgreSQL,SQL — Patrick Durusau @ 7:39 pm

Postgres Plus Connector for Hadoop

From the webpage:

The Postgres Plus Connector for Hadoop provides developers easy access to massive amounts of SQL data for integration with or analysis in Hadoop processing clusters. Now large amounts of data managed by PostgreSQL or Postgres Plus Advanced Server can be accessed by Hadoop for analysis and manipulation using Map-Reduce constructs.

EnterpriseDB recognized early on that Hadoop, a framework allowing distributed processing of large data sets across computer clusters using a simple programming model, was a valuable and complimentary data processing model to traditional SQL systems. Map-Reduce processing serves important needs for basic processing of extremely large amounts of data and SQL based systems will continue to fulfill their mission critical needs for complex processing of data well into the future. What was missing was an easy way for developers to access and move data between the two environments.

EnterpriseDB has created the Postgres Plus Connector for Hadoop by extending the Pig platform (an engine for executing data flows in parallel on Hadoop) and using an EnterpriseDB JDBC driver to allow users the ability to load the results of a SQL query into Hadoop where developers can operate on that data using familiar Map-Reduce programming. In addition, data from Hadoop can also be moved back into PostgreSQL or Postgres Plus Advanced Server tables.

A private beta is in progress, see the webpage for details and to register.

Plus, there is a webinar, Tuesday, November 29, 2011 11:00 am Eastern Standard Time (New York, GMT-05:00), Extending SQL Analysis with the Postgres Plus Connector for Hadoop. Registration at the webpage as well.

A step towards seamless data environments. Much like word processing now without the “.” commands. Same commands for the most part but unseen. Data is going in that direction. You will specify desired results and environments will take care of access, processor(s), operations and the like. Tables will appear as tables because you have chosen to view them as tables, etc.

Hadoop for Data Analytics: Implementing a Weblog Parser

Filed under: Hadoop — Patrick Durusau @ 7:38 pm

Hadoop for Data Analytics: Implementing a Weblog Parser by by Ira Agrawal.

From the post:

With the digitalization of the world, the data analytics function of extracting information or generating knowledge from raw data is becoming increasingly important. Parsing Weblogs to retrieve important information for analysis is one of the applications of data analytics. Many companies have turned to this application of data analytics for their basic needs.

For example, Walmart would want to analyze the bestselling product category for a region so that they could notify users living in that region about the latest products under that category. Another use case could be to capture the area details — using IP address information — about the regions that produce the most visits to their site.

All user transactions and on-site actions are normally captured in weblogs on a company’s websites. To retrieve all this information, developers must parse these weblogs, which are huge. While sequential parsing would be very slow and time consuming, parallelizing the parsing process makes it fast and efficient. But the process of parallelized parsing requires developers to split the weblogs into smaller chunks of data, and the partition of the data should be done in such a way that the final results will be consolidated without losing any vital information from the original data.

Hadoop’s MapReduce framework is a natural choice for parallel processing. Through Hadoop’s MapReduce utility, the weblog files can be split into smaller chunks and distributed across different nodes/systems over the cluster to produce their respective results. These results are then consolidated and the final results are obtained as per the user’s requirements.

Walks you through the process from setting up the Hadoop cluster, loading the logs and then parsing them. Not a bad introduction to Hadoop.

IT’s Next Hot Job: Hadoop Guru

Filed under: Hadoop,Jobs — Patrick Durusau @ 7:37 pm

IT’s Next Hot Job: Hadoop Guru by Doug Henschen InformationWeek.

“We’re hiring, and we’re paying 10% more than the other guys.”

Those were the first words from Larry Feinsmith, managing director, office of the CIO, at JPMorgan Chase, in his Tuesday keynote address at Hadoop World in New York. Who JPMorgan Chase is hiring, specifically, are people with Hadoop skills, so Feinsmith was in the right place. More than 1,400 people were in the audience, and attendee polls indicated that at least three quarters of their organizations are already using Hadoop, the open source big data platform.

The “and we’re paying 10% more” bit was actually Feinsmith’s ad-libbed follow-on to the previous keynoter, Hugh Williams, VP of search, experience, and platforms at eBay. After explaining eBay’s Hadoop-based Cassini search engine project, Williams said his company is hiring Hadoop experts to help build out and run the tool.

Feinsmith’s core message was that Hadoop is hugely promising, maturing quickly, and might overlap the functionality of relational databases over the next three years. In fact, Hadoop World 2011 was a coming-out party of sorts, as it’s now clear that Hadoop will matter to more than just Web 2.0 companies like eBay, Facebook, Yahoo, AOL, and Twitter. A straight-laced financial giant with more than 245,000 employees, 24 million checking accounts, 5,500 branches, and 145 million credit cards in use, JPMorgan Chase lends huge credibility to that vision.

JP Morgan Chase has 25,000 IT employees, and it spends about $8 billion on IT each year–$4 billion on apps and $4 billion on infrastructure. The company has been working with Hadoop for more than three years, and it’s easy to see why. It has 150 petabytes (with a “p”) of data online, generated by trading operations, banking activities, credit card transactions, and some 3.5 billion logins each year to online banking and brokerage accounts.

The benefits of Hadoop? Massive scalability, schema-free flexibility to handle a variety of data types, and low cost. Hadoop systems built on commodity hardware now cost about $4,000 per node, according to Cloudera, the Hadoop enterprise support and management software provider (and the organizer and host of Hadoop World). With the latest nodes typically having 16 compute cores and 12 1-terabyte or 2-terabyte drives, that’s massive storage and compute capacity at a very low cost. In comparison, aggressively priced relational data warehouse appliances cost about $10,000 to $12,000 per terabyte.

OK, but what does Hadoop not have out of the box? Can you say cross-domain subject or data semantics? Some “expert – (insert your name)” is going to have to supply the semantics. Have to know the Hadoop ecosystem, but having a firm background in mapping between semantic domains will make you a semantic “top gun.”

Clojure on Hadoop: A New Hope

Filed under: Cascalog,Clojure,Hadoop — Patrick Durusau @ 1:30 pm

Clojure on Hadoop: A New Hope by Chun Kuk.

From the post:

Factual’s U.S. Places dataset is built from tens of billions of signals. Our raw data is stored in HDFS and processed using Hadoop.

We’re big fans of the core Hadoop stack, however there is a dark side to using Hadoop. The traditional approach to building and running Hadoop jobs can be cumbersome. As our Director of Engineering once said, “there’s no such thing as an ad-hoc Hadoop job written in Java”.

Factual is a Clojure friendly shop, and the Clojure community led us to Cascalog. We were intrigued by its strength as an agile query language and data processing framework. It was easy to get started, which is a testament to Cascalog’s creator, Nathan Marz.

We were able to leverage Cascalog’s high-level features such as built-in joins and aggregators to abstract away the complexity of commonly performed queries and QA operations.

This article aims to illustrate Cascalog basics and core strengths. We’ll focus on how easy it is to run useful queries against data stored with different text formats such as csv, json, and even raw text.

Somehow, after that lead in, I was disappointed by what followed.

Curious what others think? As far as it goes, a good article on Clojure but doesn’t really reach the “core strengths” of Cacalog does it?

November 8, 2011

Toad Virtual Expo – 11.11.11 – 24-hour Toad Event

Filed under: Conferences,Hadoop,HBase,Hive,MySQL,Oracle,Toad — Patrick Durusau @ 7:46 pm

Toad Virtual Expo – 11.11.11 – 24-hour Toad Event

From the website:

24 hours of Toad is here! Join us on 11.11.11, and take an around the world journey with Toad and database experts who will share database development and administration best practices. This is your chance to see new products and new features in action, virtually collaborate with other users – and Quest’s own experts, and get a first-hand look at what’s coming in the world of Toad.

If you are not going to see the Immortals on 11.11.11 or looking for something to do after the movie, drop in on the Toad Virtual Expo! 😉 (It doesn’t look like a “chick” movie anyway.)

Times:

Register today for Quest Software’s 24-hour Toad Virtual Expo and learn why the best just got better.

  1. Tokyo Friday, November 11, 2011 6:00 a.m. JST – Saturday, November 12, 2011 6:00 a.m. JST
  2. Sydney Friday, November 11, 2011 8:00 a.m. EDT – Saturday, November 12, 2011 8:00 a.m. EDT

  3. Tel Aviv Thursday, November 10, 2011 11:00 p.m. IST – Friday, November 11, 2011 11:00 p.m. IST
  4. Central Europe Thursday, November 10, 2011 10:00 p.m. CET – Friday, November 11, 2011 10:00 p.m. CET
  5. London Thursday, November 10, 2011 9:00 p.m. GMT – Friday, November 11, 2011 9:00 p.m. GMT
  6. New York Thursday, November 10, 2011 4:00 p.m. EST – Friday, November 11, 2011 4:00 p.m. EST
  7. Los Angeles Thursday, November 10, 2011 1:00 p.m. PST – Friday, November 11, 2011 1:00 p.m. PST

The site wasn’t long on specifics but this could be fun!

Toad for Cloud Databases (Quest Software)

Filed under: BigData,Cloud Computing,Hadoop,HBase,Hive,MySQL,Oracle,SQL Server — Patrick Durusau @ 7:45 pm

Toad for Cloud Databases (Quest Software)

From the news release:

The data management industry is experiencing more disruption than at any other time in more than 20 years. Technologies around cloud, Hadoop and NoSQL are changing the way people manage and analyze data, but the general lack of skill sets required to manage these new technologies continues to be a significant barrier to mainstream adoption. IT departments are left without a clear understanding of whether development and DBA teams, whose expertise lies with traditional technology platforms, can effectively support these new systems. Toad® for Cloud Databases addresses the skill-set shortage head-on, empowering database professionals to directly apply their existing skills to emerging Big Data systems through an easy-to-use and familiar SQL-based interface for managing non-relational data. 

News Facts:

  • Toad for Cloud Databases is now available as a fully functional, commercial-grade product, for free, at www.quest.com/toad-for-cloud-databases.  Toad for Cloud Databases enables users to generate queries, migrate, browse, and edit data, as well as create reports and tables in a familiar SQL view. By simplifying these tasks, Toad for Cloud Databases opens the door to a wider audience of developers, allowing more IT teams to experience the productivity gains and cost benefits of NoSQL and Big Data.
  • Quest first released Toad for Cloud Databases into beta in June 2010, making the company one of the first to provide a SQL-based database management tool to support emerging, non-relational platforms. Over the past 18 months, Quest has continued to drive innovation for the product, growing its list of supported platforms and integrating a UI for its bi-directional data connector between Oracle and Hadoop.
  • Quest’s connector between Oracle and Hadoop, available within Toad for Cloud Databases, delivers a fast and scalable method for data transfer between Oracle and Hadoop in both directions. The bidirectional characteristic of the utility enables organizations to take advantage of Hadoop’s lower cost of storage and analytical capabilities. Quest also contributed the connector to the Apache Hadoop project as an extension to the existing SQOOP framework, and is also available as part of Cloudera’s Distribution Including Apache Hadoop. 
  • Toad for Cloud Databases today supports:
    • Apache Hive
    • Apache HBase
    • Apache Cassandra
    • MongoDB
    • Amazon SimpleDB
    • Microsoft Azure Table Services
    • Microsoft SQL Azure, and
    • All Open Database Connectivity (ODBC)-enabled relational databases (Oracle, SQL Server, MySQL, DB2, etc)

 

Anything that eases the transition to cloud computing is going to be welcome. Toad being free will increase the ranks of DBAs who will at least experiment on their own.

Search + Big Data: It’s (still) All About the User (Users or Documents?)

Filed under: Hadoop,Lucene,LucidWorks,Mahout,Solr,Topic Maps — Patrick Durusau @ 7:44 pm

Search + Big Data: It’s (still) All About the User by Grant Ingersoll.

Slides

Abstract:

Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow’s enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.

Awesome as always!

Please watch the presentation and review the slides before going further. What follows won’t make much sense without Grant’s presentation as a context. I’ll wait……

Back so soon? 😉

On slide 4 (I said to review the slides), Grant presents four overlapping areas, starting with Documents: Models, Feature Selection; Content Relationships: Page Rank, etc., Organization; Queries: Phrases, NLP; User Interaction: Clicks, Ratings/Reviews, Learning to Rank, Social Graph; and the intersection of those four areas is where Grant says search is rapidly evolving.

On slide 5 (sorry, last slide reference), Grant say to mine that intersection is a loop composed of: Search -> Discovery -> Analytics -> (back to Search). All of which involve processing of data that has been collected from use of the search interface.

Grant’s presentation made clear something that I have been overlooking:

Search/Indexing, as commonly understood, does not capture any discoveries or insights of users.

Even the search trails that Grant mentions are just lemming tracks complete with droppings. You can follow them if you like, may find interesting data, may not.

My point being that there is no way to capture the user’s insight that LBJ, for instance, is a common acronym for Lyndon Baines Johnson. So that the next user who searches for LBJ will find the information contributed by a prior user. Such as distinguishing application of Lyndon Baines Johnson to a graduate school (Lyndon B. Johnson School of Public Affairs), a hospital (Lyndon B. Johnson General Hospital), a PBS show (American Experience . The Presidents . Lyndon B. Johnson), a biography (American President: Lyndon Baines Johnson), and that is in just the first ten (10) “hits.” Oh, and as the name of an American President.

Grant made that clear for me with his loop of Search -> Discovery -> Analytics -> (back to Search) because Search only ever focuses on the documents, never the user’s insight into the documents.

And with every search, every user (with the exception of search trails), starts over at the beginning.

What if a colleague found a bug in program code, but you have to start at the beginning of the program and work your way there. Good use of your time? To reset with every user? That is what happens with search, nearly a complete reset. (Not complete because of page rank, etc. but only just.)

If we are going to make it “All About the User,” shouldn’t we be indexing their insights* into data? (Big or otherwise.)

*”Clicks” are not insights. Could be an unsteady hand, DTs, etc.

November 5, 2011

The cool aspects of Odiago WibiData

Filed under: Hadoop,HBase,Wibidata — Patrick Durusau @ 6:42 pm

The cool aspects of Odiago WibiData

From the post:

Christophe Bisciglia and Aaron Kimball have a new company.

  • It’s called Odiago, and is one of my gratifyingly more numerous tiny clients.
  • Odiago’s product line is called WibiData, after the justly popular We Be Sushi restaurants.
  • We’ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this is the place for — well, for the tech crunch.

WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is a data management and analytic execution layer.

Still in private beta (you can sign up for notice) but the post covers the infrastructure with enough detail to be enticing.

Just as a tease (on my part):

where you’d have a single value in a relational table, you might have the equivalent of a whole relational table (or at least selection/view) in WibiData-enhanced HBase. For example, if a user visits the same web page ten times, and each time 50 attributes are recorded (including a timestamp), all 500 data – to use the word “data” in its original “plural of datum” sense – would likely be stored in the same WibiData cell.

You need to go read the post to put that in context.

I keep thinking all the “good” names are gone and then something like WibiData shows up. 😉

I suspect there are going to be a number of lessons to learn from this combination of HBase and Hadoop.

November 2, 2011

MarkLogic 5 is Big Data for the Enterprise

Filed under: BigData,Hadoop,MarkLogic — Patrick Durusau @ 6:24 pm

MarkLogic 5 is Big Data for the Enterprise

From the announcement:

SAN CARLOS, Calif. — November 1, 2011 — MarkLogic® Corporation, the company empowering organizations to make high stakes decisions on Big Data in real time, today announced MarkLogic 5, the latest version of its award-winning product designed for Big Data applications across the enterprise. MarkLogic 5 defines Big Data by empowering organizations to build Big Data applications that make information actionable. With MarkLogic 5, organizations get smarter answers faster by analyzing structured, unstructured, and semi-structured data in the same application. This allows a complete view of the health of the enterprise. Key features include the MarkLogic Connector for Hadoop, which marries large-scale batch processing with the real time Big Data applications MarkLogic has been delivering for a decade. MarkLogic 5 is a visionary step forward for organizations who want to manage complex Big Data on an operational database with confidence at scale. MarkLogic 5 is available today.

“Most of the hype around Big Data has focused only on the big or on the analytics,” said Ken Bado, president and CEO, MarkLogic. “For nearly a decade, MarkLogic has been helping its customers build cost effective Big Data applications that create competitive advantage. That means going beyond big and analytics to make information actionable so organizations can create real value for their business. With MarkLogic, multi-billion dollar companies like JP Morgan Chase and LexisNexis have redefined their business models, while organizations like the U.S. Army and the FAA have the real time, mission-critical information they need to get the job done. These aren’t science projects – they’re real organizations using Big Data applications right now.”

“We believe that MarkLogic 5 is well positioned to help solve many of the Big Data challenges that are emerging in the healthcare industry today,” said Jeff Cunningham, CTO at Informatics Corporation of America. “By incorporating MarkLogic 5 into our CareAlign™ Health Information Exchange platform, we have the ability to securely aggregate, manage, share, and analyze large amounts of patient information derived from a wide variety of sources and formats. These capabilities will help doctors, hospitals, and healthcare systems across the country solve many of the care coordination and population health management challenges that exist in healthcare today.”

There is a lot of noise concerning this release and it will take some time to obtain a favorable signal/noise ratio.

You can help contribute to the signal side of that equation:

Available with MarkLogic 5, the new Express license is free for developers looking to check out MarkLogic. It is limited to use on one computer with at most 2 CPUs and can hold up to 40GB of content. It includes options that make sense on a single computer (geospatial, alerting, conversion) and does not include options intended for clusters or enterprise usage (e.g., replication).

October 29, 2011

Mapreduce in Search

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:25 pm

Mapreduce in Search by Amund Tveit.

Interesting coverage of MapReduce in search applications. Covers basic MapReduce, what I would call “search” MapReduce and concludes with “advanced” MapReduce, as in expert systems, training, etc.

Worth a close look. Or even a tutorial focused on a specific data set with problem sets. Something to think about.

October 28, 2011

Teradata Provides the Simplest Way to Bring the Science of Data to the Art of Business

Filed under: Hadoop,MapReduce,Marketing — Patrick Durusau @ 3:13 pm

Teradata Provides the Simplest Way to Bring the Science of Data to the Art of Business

From the post:

SAN CARLOS, California Teradata (NYSE: TDC), the analytic data solutions company, today announced the new Teradata Aster MapReduce Platform that will speed adoption of big data analytics. Big data analytics can be a valuable tool for increasing corporate profitability by unlocking information that can be used for everything from optimizing digital marketing or detecting fraud to measurement and reporting machine operations in remote locations. However, until now, the cost of mining large volumes of multi-structured data and a widespread scarcity of staff with the required specialized analytical skills have largely prevented adoption of big data analytics.

The new Teradata Aster MapReduce Platform marries MapReduce, the language of big data analytics, with Structured Query Language (SQL), the language of business analytics. It includes Aster Database 5.0, a new Aster MapReduce Appliance—which extends the Aster software deployment options beyond software-only and Cloud—and the Teradata-Aster Adaptor for high-speed data transfer between Teradata and Aster Data systems.

I leave the evaluation of these products to one side for now to draw your attention to:

Teradata Aster makes it easy for any business person to see, explore, and understand multi-structured data. No longer is big data analysis just in the hands of the few data scientists or MapReduce specialists in an organization. (enphasis added)

I am not arguing that is true or even a useful idea, but consider the impact it is going to have on the average business executive. A good marketing move, if not very good for the customers who buy into it. Perhaps there is a kernel of truth we can tap into for marketing topic maps.

October 25, 2011

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

Filed under: Algorithms,Hadoop,MapReduce — Patrick Durusau @ 7:34 pm

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011) by Amund Tveit.

From the post:

It’s been a year since I updated the mapreduce algorithms posting last time, and it has been truly an excellent year for mapreduce and hadoop – the number of commercial vendors supporting it has multiplied, e.g. with 5 announcements at EMC World only last week (Greenplum, Mellanox, Datastax, NetApp, and Snaplogic) and today’s Datameer funding announcement , which benefits the mapreduce and hadoop ecosystem as a whole (even for small fish like us here in Atbrox). The work-horse in mapreduce is the algorithm, this update has added 35 new papers compared to the prior posting, new ones are marked with *. I’ve also added 2 new categories since the last update – astronomy and social networking.

A truly awesome resource!

This promises to be hours of entertainment!

Dumbo

Filed under: Hadoop,MapReduce,Python — Patrick Durusau @ 7:34 pm

Dumbo

Have you seen Dumbo?

Described as:

Dumbo is a project that allows you to easily write and run Hadoop programs in Python (it’s named after Disney’s flying circus elephant, since the logo of Hadoop is an elephant and Python was named after the BBC series “Monty Python’s Flying Circus”). More generally, Dumbo can be considered to be a convenient Python API for writing MapReduce programs.

I ran across DAG jobs and mapredtest on the Dumbo blog. Seeing DAG meant I had to run the reference down so here we are. 😉


The use of DAGs (directed acyclic graphs) with text representation systems have been studied by Michael Sperberg-McQueen and Claus Huitfeld for many years. DAGs are thought to be useful for some cases of overlapping markup.

I remain unconvinced by the DAG approach.

October 22, 2011

Cloudera Training Videos

Filed under: Hadoop,HBase,Hive,MapReduce,Pig — Patrick Durusau @ 3:17 pm

Cloudera Training Videos

Cloudera has added several training videos on Hadoop and parts of the Hadoop ecosystem.

You will find:

  • Introduction to HBase – Todd Lipcon
  • Thinking at Scale
  • Introduction to Apache Pig
  • Introduction to Apache MapReduce and HDFS
  • Introduction to Apache Hive
  • Apache Hadoop Ecosystem
  • Hadoop Training Virtual Machine
  • Hadoop Training: Programming with Hadoop
  • Hadoop Training: MapReduce Algorithms

No direct links to the videos because new resources/videos will appear more quickly at the Cloudera site than I will be updating this list.

Now you have something to watch this weekend (Oct. 22-23, 2011) other than reports on and of the World Series! Enjoy!

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

Filed under: Hadoop,Natural Language Processing,Pig — Patrick Durusau @ 3:16 pm

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

One problem with after-the-fact assignment of semantics to text is that the volume of text involved (usually) is too great for manual annotation.

This post walks you through the alternative of using automated annotation based upon Wikipedia content.

From the post:

Instead manually of annotating text, one should try to benefit from an existing annotated and publicly available text corpus that deals with a wide range of topics, namely Wikipedia.

Our approach is rather simple: the text body of Wikipedia articles is rich in internal links pointing to other Wikipedia articles. Some of those articles are referring to the entity classes we are interested in (e.g. person, countries, cities, …). Hence we just need to find a way to convert those links into entity class annotations on text sentences (without the Wikimarkup formatting syntax).

This is also an opportunity to try out cloud based computing if you are so inclined.

October 21, 2011

CDH3 update 2 is released (Apache Hadoop)

Filed under: Hadoop,Hive,Mahout,MapReduce,Pig — Patrick Durusau @ 7:27 pm

CDH3 update 2 is released (Apache Hadoop)

From the post:

There are a number of improvements coming to CDH3 with update 2. Among them are:

  1. New features – Support for Apache Mahout (0.5). Apache Mahout is a popular machine learning library that makes it easier for users to perform analyses like collaborative filtering and k-means clustering on Hadoop. Also added in update 2 is expanded support for Apache Avro’s data file format. Users can:
  • load data into Avro data files in Hadoop via Sqoop or Flume
  • run MapReduce, Pig or Hive workloads on Avro data files
  • view the contents of Avro files from the Hue web client

This gives users the ability to use all the major features of the Hadoop stack without having to switch file formats. Avro file format provides added benefits over text because it is faster and more compact.

  1. Improvements (stability and performance) – HBase in particular has received a number of improvements that improve stability and recoverability. All HBase users are encouraged to use update 2.
  2. Bug fixes – 50+ bug fixes. The enumerated fixes and their corresponding Apache project jiras are provided in the release notes.

Update 2 is available in all the usual formats (RHEL, SLES, Ubuntu, Debian packages, tarballs, and SCM Express). Check out the installation docsfor instructions. If you’re running components from the Cloudera Management Suite they will not be impacted by moving to update 2. The next update (update 3) for CDH3 is planned for January, 2012.

Thank you for supporting Apache Hadoop and thank you for supporting Cloudera.

Another aspect of Cloudera’s support for the Hadoop ecosystem is its Cloudera University.

Build Hadoop from Source

Filed under: Hadoop,MapReduce,NoSQL — Patrick Durusau @ 7:26 pm

Build Hadoop from Source by Shashank Tiwari.

From the post:

If you are starting out with Hadoop, one of the best ways to get it working on your box is to build it from source. Using stable binary distributions is an option, but a rather risky one. You are likely to not stop at Hadoop common but go on to setting up Pig and Hive for analyzing data and may also give HBase a try. The Hadoop suite of tools suffer from a huge version mismatch and version confusion problem. So much so that many start out with Cloudera’s distribution, also know as CDH, simply because it solves this version confusion disorder.

Michael Noll’s well written blog post titled: Building an Hadoop 0.20.x version for HBase 0.90.2, serves as a great starting point for building the Hadoop stack from source. I would recommend you read it and follow along the steps stated in that article to build and install Hadoop common. Early on in the article you are told about a critical problem that HBase faces when run on top of a stable release version of Hadoop. HBase may loose data unless it is running on top an HDFS with durable sync. This important feature is only available in the branch-0.20-append of the Hadoop source and not in any of the release versions.

Assuming you have successfully, followed along Michael’s guidelines, you should have the hadoop jars built and available in a folder named ‘build’ within the folder that contains the Hadoop source. At this stage, its advisable to configure Hadoop and take a test drive.

A quick guide to “kicking the tires” as it were with part of the Hadoop eco-system.

I first saw this in the NoSQL Weekly Newsletter from http://www.NoSQLWeekly.com.

October 16, 2011

Hadoop User Group UK: Data Integration

Filed under: Data Integration,Flume,Hadoop,MapReduce,Pig,Talend — Patrick Durusau @ 4:12 pm

Hadoop User Group UK: Data Integration

Three presentations captured as podcasts from the Hadoop User Group UK:

LEVERAGING UNSTRUCTURED DATA STORED IN HADOOP

FLUME FOR DATA LOADING INTO HDFS / HIVE (SONGKICK)

LEVERAGING MAPREDUCE WITH TALEND: HADOOP, HIVE, PIG, AND TALEND FILESCALE

Fresh as of 13 October 2011.

Thanks to Skills Matter for making the podcasts available!

October 14, 2011

Microsoft unites SQL Server with Hadoop

Filed under: Hadoop,SQL Server — Patrick Durusau @ 6:24 pm

Microsoft unites SQL Server with Hadoop by Ted Samson.

From the post:

Microsoft today revealed more details surrounding Windows and SQL Server 12 support for big data analytics via cozier integration with Apache Hadoop, the increasingly popular open source cloud platform for handling the vast quantities of unstructured data spawned daily.

With this move, Microsoft may be able to pull off a feat that has eluded other companies: bring big data to the mainstream. As it stands, only large-scale companies with fat IT budgets have been able to reap that analytical bounty, as the tools on the market tend to be both complex and pricey.

Microsoft’s strategy is to groom Linux-friendlier Hadoop to fit snugly into Windows environments, thus giving organizations on-tap, seamless, and simultaneous access to both structured and unstructured data via familiar desktop apps, such as Excel, as well as BI tools such as Microsoft PowerPivot.

That’s the thing isn’t it? There are only so many DoD size contracts to go around. True enough MS will get their share of those as well (enterprises don’t call the corner IT shop). But the larger market is all the non-supersized enterprises with only internal IT shops and limited budgets.

By making MS apps the information superhighway to information stored/processed elsewhere/elsehow (read non-MS), MS opens up an entire world for its user base. Needs to be seamless but I assume MS will be devoting sufficient resources to that cause.

The more seamless MS makes its apps with non-MS innovations, such as Hadoop, the more attractive its apps become to its user base.

The ultimate irony. Non-MS innovators driving demand for MS products.

October 11, 2011

Introducing Crunch: Easy MapReduce Pipelines for Hadoop

Filed under: Flow-Based Programming (FBP),Hadoop,MapReduce — Patrick Durusau @ 6:08 pm

Introducing Crunch: Easy MapReduce Pipelines for Hadoop

Josh Wills writes:

As a data scientist at Cloudera, I work with customers across a wide range of industries that use Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like Pig and Hive for their convenient and powerful support for creating pipelines over structured and semi-structured records.

As Hadoop has spread from web companies to other industries, the variety of data that is stored in HDFS has expanded dramatically. Hadoop clusters are being used to process satellite images, time series data, audio files, and seismograms. These formats are not a natural fit for the data schemas imposed by Pig and Hive, in the same way that structured binary data in a relational database can be a bit awkward to work with. For these use cases, we either end up writing large, custom libraries of user-defined functions in Pig or Hive, or simply give up on our high-level tools and go back to writing MapReduces in Java. Either of these options is a serious drain on developer productivity.

Today, we’re pleased to introduce Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch’s design is modeled after Google’s FlumeJava, focusing on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution.

Sounds like DataFlow Programming… or Flow-Based Programming (FBP) to me. In which case the claim that:

It’s just Java. Crunch shares a core philosophical belief with Google’s FlumeJava: novelty is the enemy of adoption.

must be true, as FBP is over forty years old now. I doubt programmers involved in Crunch would be aware of it. Programming history started with their first programming language, at least for them.

From a vendor perspective, I would turn the phrase a bit to read: novelty is the enemy of market/mind share.

Unless you are a startup, in which case, novelty is good until you reach critical mass and then novelty loses its luster.

Unnecessary novelty, like new web programming languages for their own sake, can also be a bid for market/mind share.

Interesting to see both within days of each other.

October 8, 2011

Wiki PageRank with Hadoop

Filed under: Hadoop,PageRank — Patrick Durusau @ 8:15 pm

Wiki PageRank with Hadoop

From the post:

In this tutorial we are going to create a PageRanking for Wikipedia with the use of Hadoop. This was a good hands-on excercise to get started with Hadoop. The page ranking is not a new thing, but a suitable usecase and way cooler than a word counter! The Wikipedia (en) has 3.7M articles at the moment and is still growing. Each article has many links to other articles. With those incomming and outgoing links we can determine which page is more important than others, which basically is what PageRanking does.

Excellent tutorial! Non-trivial data set and gets your hands wet with Hadoop, one of the rising stars in data processing. What’s not to like?

Question: What other processing looks interesting for the Wiki pages?

The running time on some jobs would be short enough to plan a job at the start of class, from live suggestions, then run the job during the presentation/lecture, present the results/post-mortem of mistakes after the break.

Now that would make an interesting class. Suggestions?

« Newer PostsOlder Posts »

Powered by WordPress