Archive for the ‘HPCC’ Category

Juju Charm (HPCC Systems)

Friday, August 8th, 2014

HPCC Systems from LexisNexis Celebrates Third Open-Source Anniversary, And Releases 5.0 Version

From the post:

LexisNexis® Risk Solutions today announced the third anniversary of HPCC Systems®, its open-source, enterprise-proven platform for big data analysis and processing for large volumes of data in 24/7 environments. HPCC Systems also announced the upcoming availability of version 5.0 with enhancements to provide additional support for international users, visualization capabilities and new functionality such as a Juju charm that makes the platform easier to use.

“We decided to open-source HPCC Systems three years ago to drive innovation for our leading technology that had only been available internally and allow other companies and developers to experience its benefits to solve their unique business challenges,” said Flavio Villanustre, Vice President, Products and Infrastructure, HPCC Systems, LexisNexis.

….

5.0 Enhancements
With community contributions from developers and analysts across the globe, HPCC Systems is offering translations and localization in its version 5.0 for languages including Chinese, Spanish, Hungarian, Serbian and Brazilian Portuguese with other languages to come in the future.
Additional enhancements include:
• Visualizations
• Linux Ubuntu Juju Charm Support
• Embedded language features
• Apache Kafka Integration
• New Regression Suite
• External Database Support (MySQL)
• Web Services-SQL

The HPCC Systems source code can be found here: https://github.com/hpcc-systems
The HPCC Systems platform can be found here: http://hpccsystems.com/download/free-community-edition

Just in time for the Fall upgrade season! 😉

While reading the documentation I stumbled across: Unicode Indexing in ECL, last updated January 09, 2014.

From the page:

ECL’s dafault indexing logic works great for strings and numbers, but can encounter problems when indexing Unicode data. In some cases, unicode indexes don’t return all matching recordsfor a query. For example, If you have a Unicode field “ufield” in a dataset and select dataset(ufield BETWEEN u’ma’ AND u’me’), it would bring back records for ‘mai’,’Mai’ and ‘may’. However a query on the index for that dataset, idx(ufield BETWEEN u’ma’ AND u’me’), only brings back a record for ‘mai’.

This is a result of the way unicode fields are sorted for indexing. Sorting compares the values of two fields byte by byte to see if a field matches or is less than or greater than another value. Integers are stored in bigendian format, and signed numbers have an offset added to create an absolute value range.

Unicode fields are different. When compared/sorted in datasets, the comparisons are performed using the ICU locale sensitive comparisons to ensure correct ordering. However, index lookup operations need to be fast and therefore the lookup operations perform binary comparisons on fixed length blocks of data. Equality checks will return data correctly, but queries involving between, > or < may fail.

If you are considering HPCC, be sure to check your indexing requirements with regard to Unicode.

Machine Learning (BETA)

Sunday, February 5th, 2012

Machine Learning (BETA)

From HPCC Systems:

An extensible set of Machine Learning (ML) and Matrix processing algorithms to assist with business intelligence; covering supervised and unsupervised learning, document and text analysis, statistics and probabilities, and general inductive inference related problems.

The ML project is designed to create an extensible library of fully parallel machine learning routines; the early stages of a bottom up implementation of a set of algorithms which are easy to use and efficient to execute. This library leverages the distributed nature of the HPCC Systems architecture, providing for extreme scalability to both, the high level implementation of the machine learning algorithms and the underlying matrix algebra library, extensible to tens of thousands of features on billions of training examples.

Some of the most representative algorithms in the different areas of machine learning have been implemented, including k-means for clustering, naive bayes classifiers, ordinary linear regression, logistic regression, correlations (including Pearson and Kendalls Tau), and association routines to perform association analysis and pattern prediction. The document tokenization and text classifiers included, with n-gram extraction and analysis, provide the basis to perform statistical grammar inference based natural language processing. Univariate statistics such as mean, median, mode, variance and percentile ranking are supported along with standard statistical measures such as Student-t, Normal, Poisson, Binomial, Negative Binomial and Exponential.

In case you need reminding, this is the open sourced Lexis/Nexis engine.

Unlike algorithms that run on top of summarized big data, these algorithms run on big data.

See if that makes a difference for your use cases.

HPCC vs Hadoop

Sunday, January 29th, 2012

HPCC vs Hadoop

Four factors as said to distinguish HPCC from Hadoop:

  • Enterprise Control Language (ECL)
  • Beyond MapReduce
  • Roxie Delivery Engine
  • Enterprise Ready

After viewing these summaries you may feel like you like information on which to make a choice between these two.

So you follow: Detailed Comparison of HPCC vs. Hadoop.

I’m afraid you are going to be disappointed there as well.

Not enough information to make an investment choice in an enterprise context in favor of either HPCC or Hadoop.

Do you have pointers to meaningful comparisons of these two platforms?

Or perhaps suggestions for what would make a meaningful comparison?

Are there features of HPCC that Hadoop should emulate?

LexisNexis Open-Sources its Hadoop Alternative

Thursday, September 22nd, 2011

LexisNexis Open-Sources its Hadoop Alternative

Ryan Rosario writes:

A month ago, I wrote about alternatives to the Hadoop MapReduce platform and HPCC was included in that article. For more information, see here.

LexisNexis has open-sourced its alternative to Hadoop, called High Performance Computing Cluster. The code is available on GitHub. For years the code was restricted to LexisNexis Risk Solutions. The system contains two major components:

  • Thor (Thor Data Refinery Cluster) is the data processing framework. It “crunches, analyzes and indexes huge amounts of data a la Hadoop.”
  • Roxie (Roxy Radid Data Delivery Cluster) is more like a data warehouse and is designed with quick querying in mind for frontends.

The protocol that drives the whole process is the Enterprise Control Language which is said to be faster and more efficient than Hadoop’s version of MapReduce. A picture is a much better way to show how the system works. Below is a diagram from the Gigaom article from which most of this information originates.

Interesting times ahead.

Hadoop Fatigue — Alternatives to Hadoop

Tuesday, September 6th, 2011

Hadoop Fatigue — Alternatives to Hadoop

Can you name six (6) alternatives to Hadoop? Or formulate why you choose Hadoop over those alternatives?

From the post:

After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.

  • Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
  • Documentation for the bloated Java API is sufficient, but not the most helpful.
  • HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
  • Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
  • Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
  • Large clusters require a dedicated team to keep it running properly, but that is not surprising.
  • Writing a Hadoop job becomes a software engineering task rather than a data analysis task.

Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:

Out of the six alternatives, I haven’t seen BashReduce or Disco, so I need to look those up.

Ah, the other alternatives: GraphLab, HPCC, Spark, and Preview of Storm: The Hadoop of Realtime Processing.

It is a pet peeve of mine that some authors force me to search for links they could have just as well entered. The New York Times of all places, refers to websites and does not include the URLs. And that is for paid subscribers.

The Graph 500 List

Tuesday, July 26th, 2011

The Graph 500 List

From the website:

Data intensive supercomputer applications are increasingly important for HPC workloads, but are ill-suited for platforms designed for 3D physics simulations. Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.

Backed by a steering committee of over 30 international HPC experts from academia, industry, and national laboratories, Graph 500 will establish a set of large-scale benchmarks for these applications. The Graph 500 steering committee is in the process of developing comprehensive benchmarks to address three application kernels: concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set). Further, we are in the process of addressing five graph-related business areas: Cybersecurity, Medical Informatics, Data Enrichment, Social Networks, and Symbolic Networks.

This is the first serious approach to complement the Top 500 with data intensive applications. Additionally, we are working with the SPEC committee to include our benchmark in their CPU benchmark suite. We anticipate the list will rotate between ISC and SC in future years.

LexisNexis – OpenSource – HPCC

Wednesday, June 15th, 2011

LexisNexis Announces HPCC Systems, an Open Source Platform to Solve Big Data Problems for Enterprise Customers

This rocks!

From the press release:

NEW YORK, June 15, 2011 – LexisNexis Risk Solutions today announced that it will offer its data intensive supercomputing platform under a dual license, open source model, as HPCC Systems. HPCC Systems is designed for the enterprise to solve big data problems. The platform is built on top of high performance computing technology, and has been proven with customers for the past decade. HPCC Systems provides a high performance computing cluster (HPCC) technology with a single architecture and a consistent data centric programming language. HPCC Systems is an alternative to Hadoop.

“We feel the time is right to offer our HPCC technology platform as a dual license, open source solution. We believe that HPCC Systems will take big data computing to the next level,” said James M. Peck, chief executive officer, LexisNexis Risk Solutions. “We’ve been doing this quietly for years for our customers with great success. We are now excited to present it to the community to spur greater adoption. We look forward to leveraging the innovation of the open source community to further the development of the platform for the benefit of our customers and the community,” said Mr. Peck.

To manage, sort, link, and analyze billions of records within seconds, LexisNexis developed a data intensive supercomputer that has been proven for the past ten years with customers who need to process large volumes of data. Customers such as leading banks, insurance companies, utilities, law enforcement and federal government leverage the HPCC platform technology through various LexisNexis® products and services. The HPCC platform specializes in the analysis of structured and unstructured data for enterprise class organizations.