Archive for the ‘MapR’ Category

MapR on Open Data Platform: Why we declined

Wednesday, April 29th, 2015

MapR on Open Data Platform: Why we declined by John Schroeder.

From the post:

Open Data Platform is “solving” problems that don’t need solving

Companies implementing Hadoop applications do not need to be concerned about vendor lock-in or interoperability issues. Gartner analysts Merv Adrian and Nick Heudecker disclosed in a recent blog that less than 1% of companies surveyed thought that vendor lock-in or interoperability was an issue—dead last on the list of customer concerns. Project and sub-project interoperability are very good and guaranteed by both free and paid-for distributions. Applications built on one distribution can be migrated with virtually zero switching costs to the other distributions.

Open Data Platform participation lacks participation by the Hadoop leaders

~75% of Hadoop implementations run on MapR and Cloudera. MapR and Cloudera have both chosen not to participate. The Open Data Platform without MapR and Cloudera is a bit like one of the Big Three automakers pushing for a standards initiative without the involvement of the other two.

I mention this post because it touches on two issues that should concern all users of Hadoop applications.

On “vendor lock-in” you will find the question that was asked was “…how many attendees considered vendor lock-in a barrier to investment in Hadoop. It came in dead last. With around 1% selecting it.” Who Asked for an Open Data Platform?. Considering that it was in the context of a Gartner webinar, it could have been only one person selected it. Not what I would call a representative sample.

Still, I think John in right in saying that vendor lock-in isn’t a real issue with Hadoop. Hadoop applications aren’t off the shelf items and are custom constructs for your needs and data. Not much opportunity for vendor lock-in. You’re in greater danger of IT lock-in due to poor or non-existent documentation for your Hadoop application. If anyone tells you a Hadoop application doesn’t need documentation because you can “…read the code…,” they are building up job security, quite possibly at your future expense.

John is spot on about the Open Data Platform not including all of the Hadoop market leaders. As John says, Open Data Platform does not include those responsible for 75% of the existing Hadoop implementations.

I have seen that situation before in standards work and it never leads to a happy conclusion, for the participants, non-participants and especially the consumers, who are supposed to benefit from the creation of standards. Non-standards for a minority of the market only serve to confuse not overly clever consumers. To say nothing of the popular IT press.

The Open Data Platform also raises questions about how one goes about creating a standard. One approach is to create a standard based on your projection of market needs and to campaign for its adoption. Another is to create a definition of an “ODP Core” and see if it is used by customers in development contracts and purchase orders. If consumers find it useful, they will no doubt adopt it as a de facto standard. Formalization can follow in due course.

So long as we are talking about possible future standards, a practice of documentation more advanced than C style comments for Hadoop ecosystems would be a useful Hadoop standard in the future.

An Inside Look at the Components of a Recommendation Engine

Thursday, April 16th, 2015

An Inside Look at the Components of a Recommendation Engine by Carol McDonald.

From the post:

Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model, and search technology from Elasticsearch to simplify deployment of the recommender.

There are two reasons to read this post:

First, you really don’t know how recommendation engines work. Well, better late than never.

Second, you want an example of how to write an excellent explanation of recommendation engines, hopefully to replicate it for other software.

This is an example of an excellent explanation of recommendation engines but whether you can replicate it for other software remains to be seen. 😉

Still, reading excellent explanations is a first step towards authoring excellent explanations.

Good luck!

Evolving Parquet as self-describing data format –

Monday, April 6th, 2015

Evolving Parquet as self-describing data format – New paradigms for consumerization of Hadoop data by Neeraja Rentachintala.

From the post:

With Hadoop becoming more prominent in customer environments, one of the frequent questions we hear from users is what should be the storage format to persist data in Hadoop. The data format selection is a critical decision especially as Hadoop evolves from being about cheap storage to a pervasive query and analytics platform. In this blog, I want to briefly describe self-describing data formats, how they are gaining a lot of interest as a new management paradigm to consumerize Hadoop data in organizations and the work we have been doing as part of the Parquet community to evolve Parquet as fully self-describing format.

About Parquet

Apache Parquet is a columnar storage format for the Hadoop ecosystem. Since its inception about 2 years ago, Parquet has gotten very good adoption due to the highly efficient compression and encoding schemes used that demonstrate significant performance benefits. Its ground-up design allows it to be used regardless of any data processing framework, data model, and programming language used in Hadoop ecosystem. A variety of tools and frameworks including MapReduce, Hive, Impala, and Pig provided the ability to work with Parquet data and a number of data models such as AVRO, Protobuf, and Thrift have been expanded to be used with Parquet as storage format. Parquet is widely adopted by a number of major companies including tech giants such as Twitter and Netflix.

Self-describing data formats and their growing role in analytics on Hadoop/NoSQL

Self-describing data is where schema or structure is embedded in the data itself. The schema is comprised of metadata such as element names, data types, compression/encoding scheme used (if any), statistics, and a lot more. There are a variety of data formats including Parquet, XML, JSON, and NoSQL databases such as HBase that belong to the spectrum of self-describing data and typically vary in the level of metadata they expose about themselves.

While the self-describing data has been in rise with NoSQL databases (e.g., the Mongo BSON model) for a while now empowering developers to be agile and iterative in application development cycle, the prominence of these has been growing in analytics as well when it comes to Hadoop. So what is driving this? The answer is simple – it’s the same reason – i.e., the requirement to be agile and iterative in BI/analytics.

More and more organizations are now using Hadoop as a data hub to store all their data assets. These data assets often contain existing datasets offloaded from the traditional DBMS/DWH systems, but also new types of data from new data sources (such as IOT sensors, logs, clickstream) including external data (such as social data, 3rd party domain specific datasets). The Hadoop clusters in these organizations are often multi-tenant and shared by multiple groups in the organizations. The traditional data management paradigms of creating centralized models/metadata definitions upfront before the data can be used for analytics are quickly becoming bottlenecks in Hadoop environments. The new complex and schema-less data models are hard to map to relational models and modeling data upfront for unknown ad hoc business questions and data discovery needs is challenging and keeping up with the schema changes as the data models evolve is practically impossible.

By pushing metadata to data and then using tools that can understand this metadata available in self-describing formats to expose it directly for end user consumption, the analysis life cycles can become drastically more agile and iterative. For example, using Apache Drill, the world’s first schema-free SQL query engine, you can query self-describing data (in files or NoSQL databases such as HBase/MongoDB) immediately without having to define and manage schema overlay definitions in centralize metastores. Another benefit of this is business self-service where the users don’t need to rely on IT departments/DBAs constantly for adding/changing attributes to centralized models, but rather focus on getting answers to the business questions by performing queries directly on raw data.

Think of it this way, Hadoop scaled processing by pushing processing to the nodes that have data. Analytics on Hadoop/NoSQL systems can be scaled to the entire organization by pushing more and more metadata to the data and using tools that leverage that metadata automatically to expose it for analytics. The more self-describing the data formats are (i.e., the more metadata they contain about data), the smarter the tools that leverage the metadata can be.

The post walks through example cases and points to additional resources.

To become self-describing, Parquet will need to move beyond assigning data types to tokens. In the example given, “amount” has the datatype “double,” but that doesn’t tell me if we are discussing grams, Troy ounces (for precious metals), carats or pounds.

We all need to start following the work on self-describing data formats more closely.

MapR Sandbox Fastest On-Ramp to Hadoop

Monday, March 23rd, 2015

MapR Sandbox Fastest On-Ramp to Hadoop

From the webpage:

The MapR Sandbox for Hadoop provides tutorials, demo applications, and browser-based user interfaces to let developers and administrators get started quickly with Hadoop. It is a fully functional Hadoop cluster running in a virtual machine. You can try our Sandbox now – it is completely free and available as a VMware or VirtualBox VM.

If you are a business intelligence analyst or a developer interested in self-service data exploration on Hadoop using SQL and BI Tools, the MapR Sandbox including Apache Drill will get you started quickly. You can download the Drill Sandbox here.

You of course know about the Hortonworks and Cloudera (at the very bottom of the page) sandboxes as well.

Don’t expect a detailed comparison of all three because the features and distributions change too quickly for that to be useful. And my interest is more in capturing the style or approach that may make a difference to a starting user.


I first saw this in a tweet by Kirk Borne.

MapR Offers Free Hadoop Training and Certifications

Thursday, January 29th, 2015

MapR Offers Free Hadoop Training and Certifications by Thor Olavsrud.

From the post:

In an effort to make Hadoop training for developers, analysts and administrators more accessible, Hadoop distribution specialist MapR Technologies Tuesday unveiled a free on-demand training program. Another track for HBase developers will be added later this quarter.

“This represents a $50 million, in-kind contribution to the Hadoop community,” says Jack Norris, CMO of MapR. “The focus is overcoming what many people consider the major obstacle to the adoption of big data, particularly Hadoop.”

The developer track is about building big data applications in Hadoop. The topics range from the basics of Hadoop and related technologies to advanced topics like designing and developing MapReduce and HBase applications with hands-on labs. The courses include:

  • Hadoop Essentials. This course, which is immediately available, provides an introduction to Hadoop, the ecosystem, common solutions and use cases.
  • Developing Hadoop Applications. This course is also immediately available and focuses on designing and writing effective Hadoop applications with MapReduce and YARN.
  • HBase Schema Design and Modeling. This course will become available in February and will focus on architecture, schema design and data modeling on HBase.
  • Developing HBase Applications. This course will also debut in February and focuses on real-world application design in HBase (Time Series and Social Application examples).
  • Hadoop Data Analysis – Drill. Slated for debut in March, this course covers interactive SQL on Hadoop for structured, semi-structured and nested data.

I remember how expensive the Novell training classes were back in the Netware 4.11 days. (Yes, that has been a while.)

I wonder whose software will come to mind after completing the MapR training courses and passing the certification exams?

That’s what I think too. Send kudos to MapR for this effort!

Looking forward to seeing some of you at Hadoop certification exams later this year!

I first saw this in a tweet by Kirk Borne.

MapR and Ubuntu

Wednesday, April 3rd, 2013

MapR has posted all of its Hadoop ecosystem source code to Github: MapR Technologies.

MapR has also partnered with Canonical to release the entire Hadoop stack for 12.04 LTS and 12.10 releases of Ubuntu on starting April 25, 2013.

For details see: MapR Teams with Canonical to Deliver Hadoop on Ubuntu.

I first saw this at: MapR Turns to Ubuntu in Bid to Increase Footprint by Isaac Lopez.

LucidWorks™ Teams with MapR™… [Not 26% but 5-6% + not from Big Data]

Wednesday, February 20th, 2013

LucidWorks™ Teams with MapR™ Technologies to Offer Best-in-Class Big Data Analytics Solution

Performance Day just keeps on going!

From the press release:

REDWOOD CITY, Calif. – February 20, 2013 – Big Data provides a very real opportunity for organizations to drive business decisions by utilizing new information that has yet to be tapped. However, it is increasingly apparent that organizations are struggling to make effective use of this new multi-structured content for data-driven decision-making. According to a report from the Economist Intelligence Unit, the challenge is not so much the volume, but instead it is the pressing need to analyze and act on Big Data in real-time.

Existing business intelligence (BI) tools have simply not been designed to provide spontaneous search on multi-structured data in motion. Responding directly to this need, LucidWorks, the company transforming the way people access information, and MapR Technologies, the Hadoop technology leader, today announced the integration between LucidWorks Search™ and MapR. Available now, the combined solution allows organizations to easily search their MapR Distributed File System (DFS) in a natural way to discover actionable insights from information maintained in Hadoop.

“Organizations that wait to address big data until this evolution is well under way will lose out competitively in their vertical markets, compared to organizations that have aggressively pursued big data flexibility. Aggressive organizations will demonstrate faster, more accurate analysis and decisions relating to their tactical operations and strategic planning.”

  • Source: Big Data Drives Rapid Changes in Infrastructure and $232 Billion in IT Spending Through 2016, Gartner Group

Integration Solution Highlights

  • Combines the best of Big Data with Search with an integrated and fully distributed solution
  • Supports a pre-defined MapR target data source within LucidWorks Search
  • Enables users to create and configure the MapR data source directly from the LucidWorks Search administration console
  • Leverages enterprise security features offered by both MapR and LucidWorks Search

The Economist Intelligence Unit study found that global companies experienced a 26 percent improvement in performance over the last three years when big data analytics were applied to the decision-making process. And now, those data-savvy executives are forecasting a 41 percent improvement over the next three years. The integration between LucidWorks Search and MapR makes it easier to put Big Data analytics in motion.

I’m really excited about this match up but you know I can’t simply let claims like “…global companies experienced a 26 percent improvement in performance….” slide by. 😉

If you go read the report,
The Deciding Factor: Big Data & Decision Making
, you will find at page six (6):

On average, survey participants say that big data has improved their organisations’ performance in the past three years by 26%, and they are optimistic that it will improve performance by an average of 41% in the next three years. While “performance” in this instance is not rigorously specified, it is a useful gauge of mood.

The measured difference in performance, from:

firms that emphasise decision-making based on data and analytics performed 5-6% better—as measured by output and performance—than those that rely on intuition and experience for decision-making.

So, not 26% but 5-6% measured and the 5-6% is for decision-making on data and analytics, not big data.

You don’t find code written at either LucidWorks or MapR that is “close enough.” Both have well deserved reputations for clean code and hard work.

Why should communications fall short of that mark?

Reflective Intelligence and Unnatural Acts

Thursday, December 13th, 2012

I wasn’t in the best of shape today but did manage to attend the webinar: Crowd Sourcing Reflected Intelligence Using Search and Big Data.

Not a lot of detail but there were two topics that caught my attention.

The first was “reflective intelligence,” that is a system that reflects the intelligence of the users back to other users.

Intelligence derived from tracking “clicks,” search terms, etc.

Question: How does your topic map solution “reflect” the intelligence of its users?

That is how do responses “improve” (by some measure) as a result of user interaction.

Could be measuring user behavior, what links do they select for particular query terms. (That is an example from the webinar.) Or could be users adding information, perhaps even suggesting/voting on merges.

The second riff that got my attention was a description of the software under discussion as:

“I don’t have to do unnatural acts.”

Is that like the Papa John’s “better ingredients?” Taken to imply that other pizzas use sub-par ingredients?

Or in this case, other software solutions require “unnatural acts?”

Interesting selling point.

What unusual properties would you claim for topic maps or topic map software?

Crowd Sourcing Reflected Intelligence Using Search and Big Data [Webinar]

Monday, December 3rd, 2012

Crowd Sourcing Reflected Intelligence Using Search and Big Data

Date: December 13, 2012

Time: 10:00 am PT / 1:00 pm ET

From the webpage:

Anyone interested in drawing insights from their Big Data repository/project/application should attend this informative webinar brought to you by MapR and LucidWorks. LucidWorks Search is a development platform that accelerates and simplifies building highly secure, scalable, and cost-effective search applications.

This webinar will show:

  • how search users’ search behavior can be mined
  • how big data analytics can be applied to that raw data
  • how to redeploy that data back to the users to improve their experience

Experts from MapR and Lucidworks will show the strengths of combining the easiest, most dependable and fastest distribution for Hadoop with the real-time, ad hoc data accessibility of LucidWorks Search to provide analytic capabilities along with scalable machine learning algorithms for deeper insight into both content and user behavior.

Speakers: Grant Ingersoll, Chief Scientist for LucidWorks and Ted Dunning, Chief Application Architect for MapR.

I have seen Grant on video and it was great. If Ted is anywhere close to as good as Grant, this is going to be a webinar to remember!

MapR Now Available as an Option on Amazon Elastic MapReduce

Sunday, June 17th, 2012

MapR Now Available as an Option on Amazon Elastic MapReduce

From the post:

MapR Technologies, Inc., the provider of the open, enterprise-grade distribution for Apache Hadoop, today announced the immediate availability of its MapR Distribution for Hadoop as an option within the Amazon Elastic MapReduce service. Customers can now provision dynamically scalable MapR clusters while taking advantage of the flexibility, agility and massive scalability of Amazon Web Services (AWS). In addition, AWS has made its own Hadoop enhancements available to MapR customers, allowing them to seamlessly use MapR with other AWS offerings such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB and Amazon CloudWatch.

“We’re excited to welcome MapR’s feature-rich distribution as an option for customers running Hadoop in the cloud,” said Peter Sirota, general manager of Amazon Elastic MapReduce, AWS. “MapR’s innovative high availability data protection and performance features combined with Amazon EMR’s managed Hadoop environment and seamless integration with other AWS services provides customers a powerful tool for generating insights from their data.”

Customers can provision MapR clusters on-demand and automatically terminate them after finishing data processing, reducing costs as they only pay for the resources they consume. Customers can augment their existing on-premise deployments with AWS-based clusters to improve disaster recovery and access additional compute resources as required.

“For many customers there is no longer a compelling business case for deploying an on-premise Hadoop cluster given the secure, flexible and highly cost effective platform for running MapR that AWS provides,” said John Schroeder, CEO and co-founder, MapR Technologies. “The combination of AWS infrastructure and MapR’s technology, support and management tools enables organizations to potentially lower their costs while increasing the flexibility of their data intensive applications.”

Are you doing topic maps in the cloud yet?

A rep from one of the “big iron” companies was telling me how much more reliable owning your own hardware with their software than the cloud.

True, but that has the same answer as the question: Who needs the capacity to process petabytes of data in real time?

If the truth were told, there are a few companies, organizations that could benefit from that capability.

But the rest of us don’t have that much data or the talent to process it if we did.

Over the summer I am going to try the cloud out, both generally and for topic maps.


The Search Is Over: Integrating Solr and Hadoop to Simplify Big Data Analytics

Sunday, May 27th, 2012

The Search Is Over: Integrating Solr and Hadoop to Simplify Big Data Analytics

From MapR Technologies.

Show of hands. How many of you can name the solution found in these slides?


Slides are great for entertainment.

Solutions require more, a great deal more.

For the “more” on MapR, see: Download Hadoop Software Datasheets, Product Documentation, White Papers

Mr. MapR: A Xoogler

Sunday, January 8th, 2012

Mr. MapR: A Xoogler

Cynthia Murrell of BeyondSearch writes:

Wired Enterprise gives us a glimpse into MapR, a new distribution for Apache Hadoop, in “Ex-Google Man Sells Search Genius to Rest of World.” The ex-Googler in this case is M.C. Srivas, who was so impressed with Google’s MapReduce platform that he decided to spread its concepts to the outside world.

Sounds great! So I head over to the MapR site and choose Unique Features of MapR Hadoop Distribution, where I find:

  • Finish small jobs quickly with MapR ExpressLane
  • Mount your Hadoop cluster with Direct Access NFS™
  • Enable realtime data flows
  • Use the MapR Heatmap™, alerts, and alarms to monitor your cluster
  • Manage your data easily with Volumes
  • Scale up and create an unlimited number of files
  • Get jobs done faster with half the hardware
  • Eliminate downtime and performance bottlenecks with Distributed NameNode HA
  • Eliminate lost jobs with HA Jobtracker
  • Enable Point-in-time Recovery with MapR Snapshots
  • Synchronize data across clusters with Mirroring
  • Let multiple jobs safely share your Hadoop cluster
  • Control data placement for improved performance, security or manageability

Maybe I am missing it. Do you see any Search Genius in that list?

MapR may have improved the usability/reliability of Hadoop, which is no small thing, but disappointing when looking for better search results.

Let’s represent the original Hadoop with this Wikipedia image:


and the MapR version of Hadoop with this Wikipedia image:

It is true that the MapR version has more unique features but none of them appear to relate to search.

I am sure that Hadoop cluster managers and others will be interested in MapR (as will some of the rest of us), as managers.

As searchers, we may have to turn somewhere else. Do you disagree?

PS: Cloudera has made more contributions to the Hadoop and Apache communities than I can list in a very long post. Keep than in mind when you see ill-mannered and juvenile sniping at their approach to Hadoop.