Hadoop « Another Word For It

July 25, 2011

Whiz-Kid on Hadoop

Filed under: Daytona,Excel Datascope,Hadoop — Patrick Durusau @ 6:37 pm

Cloudera Whiz-Kid Lipcon Talks Hadoop, Big Data with SiliconANGLE’s Furrier

From the post:

Hadoop, the Big Data processing and analytics framework, isn’t your average open source project.

“If you look at a lot of the open source software that’s been popular out of Apache and elsewhere, its sort of like an open source replacement for something you can already get elsewhere,” said Todd Lipcon, a senior software engineer at Cloudera. “I think Hadoop is kind of unique in that it’s the only option for doing this kind of analysis.”

Lipcon is right. Open Office is an open source office suite alternative to Microsoft Office. MySQL is an open source database alternative to Oracle. Hadoop is an open source Big Data framework alternative for …. Well, there is no alternative.

Now that Daytona has been released by MS along with Excel DataScope, it would be interesting to know how Todd Lipcon sees the ease of use issue?

Powerful technology (LaTeX anyone?) may far exceed the capabilities of (insert your favorite word processor) but if the difficulty of use factor is too high, poorer alternatives will occupy most of the field.

That may give people with the more powerful technology a righteous feeling, but I am not interested in feeling righteous.

I am interested in winning, which means having a powerful technology that can be used by a wide variety of users of varying skill levels.

Some will use it poorer or barely invoking its capabilities. Others will make good but unimaginative use of it. Still others will push the envelope in terms of what it can do. All are legitimate and all are valuable in their own way.

Comments Off

July 23, 2011

Introduction to Oozie

Filed under: Hadoop,MapReduce,Oozie,Pig — Patrick Durusau @ 3:10 pm

Introduction to Oozie

From the post:

Tasks performed in Hadoop sometimes require multiple Map/Reduce jobs to be chained together to complete its goal. [1] Within the Hadoop ecosystem, there is a relatively new component Oozie [2], which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task. In this article we will introduce Oozie and some of the ways it can be used.

What is Oozie ?

Oozie is a Java Web-Application that runs in a Java servlet-container – Tomcat and uses a database to store:

Workflow definitions

Currently running workflow instances, including instance states and variables

Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language).

Workflow management for Hadoop!

Comments Off

Apache Hadoop to get more user-friendly

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:07 pm

Apache Hadoop to get more user-friendly

From Paul Krill at InfoWorld:

Relief is on the way for users of the open source Apache Hadoop distributed computing platform who have wrestled with the complexity of the technology.

A planned upgrade to Hadoop distributed computing platform, which has become popular for analyzing large volumes of data, is intended to make the platform more user-friendly, said Eric Baldeschwieler, CEO of HortonWorks, which was unveiled as a Yahoo spinoff last month with the intent of building a support and training business around Hadoop. The upgrade also will feature improvements for high availability, installation, and data management. Due in beta releases later this year with a general availability release eyed for the second quarter of 2012, the release is probably going to be called Hadoop 0.23.

I don’t remember seeing any announcements that a product would become “less user-friendly.” You? 😉

Still, good news because it means not only will Hadoop become easier to use, so will its competitors.

Comments Off

July 22, 2011

You Too Can Use Hadoop Inefficiently!!!

Filed under: Algorithms,Graphs,Hadoop,RDF,SPARQL — Patrick Durusau @ 6:15 pm

The headline Hadoop’s tremendous inefficiency on graph data management (and how to avoid it) certainly got my attention.

But when you read the paper, Scalable SPARQL Querying of Large RDF Graphs, it isn’t Hadoop’s “tremendous inefficiency,” but actually that of SHARD, an RDF triple store that uses flat text files for storage.

Or as the authors say in their paper (6.3 Performance Comparison):

Figure 6 shows the execution time for LUBM in the four benchmarked systems. Except for query 6, all queries take more time on SHARD than on the single-machine deployment of RDF-3X. This is because SHARD’s use of hash partitioning only allows it optimize subject-subject joins. Every other type of join requires a complete redistribution of data over the network within a Hadoop job, which is extremely expensive. Furthermore, its storage layer is not at all optimized for RDF data (it stores data in flat files).

Saying that SHARD (not as well known as Hadoop), was using Hadoop inefficiently, would not have the “draw” of allegations about Hadoop’s failure to process graph data efficiently.

Sure, I write blog lines for “draw” but let’s ‘fess up in the body of the blog article. Readers shouldn’t have to run down other sources to find the real facts.

Comments Off

Hoop – Hadoop HDFS over HTTP

Filed under: Hadoop — Patrick Durusau @ 6:08 pm

Hoop – Hadoop HDFS over HTTP

From the webpage:

Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S.

Hoop can be used to:

Access HDFS using HTTP REST.

Transfer data between clusters running different versions of Hadoop (thereby overcoming RPC versioning issues).

Access data in a HDFS cluster behind a firewall. The Hoop server acts as a gateway and is the only system that is allowed to go through the firewall.

Hoop has a Hoop client and a Hoop server component:

The Hoop server component is a REST HTTP gateway to HDFS supporting all file system operations. It can be accessed using standard HTTP tools (i.e. curl and wget), HTTP libraries from different programing languages (i.e. Perl, JavaScript) as well as using the Hoop client. The Hoop server component is a standard Java web-application and it has been implemented using Jersey (JAX-RS).

The Hoop client component is an implementation of Hadoop FileSystem client that allows using the familiar Hadoop filesystem API to access HDFS data through a Hoop server.

Comments Off

Hadoop for Intelligence Analysis???

Filed under: Hadoop,Intelligence — Patrick Durusau @ 6:05 pm

Hadoop for Intelligence Analysis

From the webpage:

CTOlabs.com, a subsidiary of the technology research, consulting and services firm Crucial Point LLC and a peer site of CTOvision.com, has just published a white paper providing context, tips and strategies around Hadoop titled “Hadoop for Intelligence Analysis.” This paper focuses on use cases selected to be informative to any organization thinking through ways to make sense out of large quantities of information.

I’m curious. How little would you have to know about Hadoop or intelligence analysis to get something from the “white paper?”

Or is having “Hadoop” in a title these days enough to gain a certain number of readers?

Unless you want to answer my first question, suggest that you avoid this “white paper” as “white noise.”

Your time can be better spent, doing almost anything.

Comments Off

July 21, 2011

Wonderdog

Filed under: ElasticSearch,Hadoop,Pig — Patrick Durusau @ 6:30 pm

Wonderdog

From the webpage:

Wonderdog is a Hadoop interface to Elastic Search. While it is specifically intended for use with Apache Pig, it does include all the necessary Hadoop input and output formats for Elastic Search. That is, it’s possible to skip Pig entirely and write custom Hadoop jobs if you prefer.

I may just be paying more attention but the search scene seems to be really active.

That’s good for topic maps because the more data that is searched, the greater the likelihood of heterogeneous data. Text messages between teens are probably heterogeneous but who cares?

Medical researchers using different terminology results in heterogeneous data, not just today, but data from yesteryear. Now that could be important.

Comments Off

Hadoop Advancing Information Management

Filed under: Conferences,Hadoop — Patrick Durusau @ 6:25 pm

Hadoop Advancing Information Management

I first saw this at Alex Popescu’s myNoSQL site and decided to explore further.

From the Ventana Research site:

Newly conducted benchmark research from Ventana Research shows organizations are recognizing that addressing big data needs requires new approaches to data and information management. New processes and new technologies have begun to take hold, as have the beginnings of a set of best practices. In particular, Hadoop is emerging in fore-front as a solution for managing large-scale data. The research findings indicate that Hadoop is already being used in one third of big data environments and evaluated in nearly another fifth. The research also found that Hadoop is additive to existing technologies according to almost two thirds of research participants.

Topping the lists of benefits in Hadoop adoption are newly found capabilities – 87% of organizations using Hadoop report being able to do new things with big data versus 52% of other organizations, 94% perform new types of analytics on large volumes of data, 88% analyze data at greater level of detail. These research statistics already validate the arrival of Hadoop as a key component of organization’s information management efforts. However, challenges remain with over half the organizations indicating some level of dissatisfaction with Hadoop.

Like me you probably want some more than breathless numbers and more detail. That is going to be available:

Ventana Research will detail the findings of this benchmark research in a live interactive webinar on July 28, 2011 at 10:00 AM Pacific time [1 PM Eastern] that will discuss the research findings and offer recommendations for improvement. Key research findings to be discussed will include:

The current state of organizations’ thinking on how best to apply Big Data management techniques.

Top patterns in the adoption of new methods and technologies

The current state, future direction of and potential investments

The competencies required to manage large-scale data.

Recommendations for organizations to act on immediately.

Of course I am looking for places in the use of Hadoop where subject identity is likely to be recognized as an issue.

Comments Off

July 20, 2011

Harnessing the Power of Apache Hadoop:…

Filed under: Hadoop — Patrick Durusau @ 3:38 pm

Harnessing the Power of Apache Hadoop: How to Rapidly and Reliably Operationalize Apache Hadoop and Unlock the Value of Data

I swear! That really is the title! From the page just below it:

Webinar: Thursday, July 21, 2011 10:00 AM – 10:45 AM PDT

Join Cloudera’s CEO Mike Olson on July 21st for a webinar about optimizing your Apache Hadoop deployment and leveraging this completely open source technology in production for Big Data analytics and asking questions across structured and unstructured data that were previously impossible to solve.

Learn why you must consider this open source technology in order to evolve your company

Understand how Apache Hadoop can provide your organization with a holistic view and insight into data

Learn how you can easily configure Apache Hadoop for your enterprise

Find out how several well-known organizations are using Apache Hadoop to solve real-world business problems such as increasing revenue, delivering better business solutions and ensuring network performance

Ambitious goals for forty-five (45) minutes in a forum where a lot of folks will also be reading email, tweets, etc, but maybe Mike really is that good. 😉

I am sure it will be interesting but hopefully also recorded.

My experience is that most webinars are good for picking up memes and themes that you can then explore at your leisure.

Suggested future title: Hadoop: Where Data Hits The Road. Zero hits in Google for “Where Data Hits The Road” as of 16:30 Eastern Time, 20 July 2011.

To be fair, the present title, “Harnessing the Power of Apache Hadoop: How to Rapidly and Reliably Operationalize Apache Hadoop and Unlock the Value of Data” gets five (5) hits in Google.

Wonder which one would propagate better?

Comments Off

July 19, 2011

Excel DataScope

Filed under: Algorithms,Cloud Computing,Excel Datascope,Hadoop — Patrick Durusau @ 7:51 pm

Excel DataScope

From the webpage:

From the familiar interface of Microsoft Excel, Excel DataScope enables researchers to accelerate data-driven decision making. It offers data analytics, machine learning, and information visualization by using Windows Azure for data and compute-intensive tasks. Its powerful analysis techniques are applicable to any type of data, ranging from web analytics to survey, environmental, or social data.

And:

Excel DataScope is a technology ramp between Excel on the user’s client machine, the resources that are available in the cloud, and a new class of analytics algorithms that are being implemented in the cloud. An Excel user can simply select an analytics algorithm from the Excel DataScope Research Ribbon without concern for how to move their data to the cloud, how to start up virtual machines in the cloud, or how to scale out the execution of their selected algorithm in the cloud. They simply focus on exploring their data by using a familiar client application.

Excel DataScope is an ongoing research and development project. We envision a future in which a model developer can publish their latest data analysis algorithm or machine learning model to the cloud and within minutes Excel users around the world can discover it within their Excel Research Ribbon and begin using it to explore their data collection. (emphasis added)

I added emphasis to the last sentence because that is the sort of convenience/collaboration that will make cloud computing and collaboration meaningful.

Imagine that sort of sharing across MS and non-MS cloud resources. Well, you would have to have an Excel DataScope interface on non-MS cloud resources, but one hopes that will be a product offering in the near future.

Comments Off

July 18, 2011

Microsoft Research Releases Another Hadoop Alternative for Azure

Filed under: Daytona,Hadoop — Patrick Durusau @ 6:45 pm

Microsoft Research Releases Another Hadoop Alternative for Azure

From the post:

Today Microsoft Research announced the availability of a free technology preview of Project Daytona MapReduce Runtime for Windows Azure. Using a set of tools for working with big data based on Google’s MapReduce paper, it provides an alternative to Apache Hadoop.

Daytona was created by the eXtreme Computing Group at Microsoft Research. It’s designed to help scientists take advantage of Azure for working with large, unstructured data sets. Daytona is also being used to power a data-analytics-as-a-service offering the team calls Excel DataScope.

Excellent coverage of this latest release along with information about related software from Microsoft.

I don’t think anyone disputes that Hadoop is difficult to use effectively, so why not offer an MS product that makes Apache Hadoop easier to use? With all the consumer software skills at Microsoft it would still be a challenge but one that Microsoft would be the most likely candidate to overcome.

And that would give Microsoft a window (sorry) into non-Azure environments as well as an opportunity to promote an Excel-like interface. (Hard to argue against the familiar.)

We are going to reach the future of computing more quickly the fewer times we stop to build product silos.

Products yes, product silos, no.

Comments Off

The Future of Hadoop in Bioinformatics

Filed under: BigData,Bioinformatics,Hadoop,Heterogeneous Data — Patrick Durusau @ 6:44 pm

The Future of Hadoop in Bioinformatics: Hadoop and its ecosystem including MapReduce are the dominant open source Big Data solution by Bob Gourley.

From the post:

Earlier, I wrote on the use of Hadoop in the exciting, evolving field of Bioinformatics. I have since had the pleasure of speaking with Dr. Ron Taylor of Pacific Northwest National Library, the author of “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics“, on what’s changed in the half-year since its publication and what’s to come.

As Dr. Taylor expected, Hadoop and it’s “ecosystem” including MapReduce are the dominant open source Big Data solution for next generation DNA sequencing analysis. This is currently the sub-field generating the most data and requiring the most computationally expensive analysis. For example, de novo assembly pieces together tens of millions of short reads (which may be 50 bases long on ABI SOLiD sequencers). To do so, every read needs to be compared to the others, which scales in proportion to n(logn), meaning, even assuming reads that are 100 base pairs in length and a human genome of 3 billion pairs, analyzing an entire human genome will take 7.5 times longer than if it scaled linearly. By dividing the task up into a Hadoop cluster, the analysis will be faster and, unlike other high performance computing alternatives, it can run on regular commodity servers that are much cheaper than custom supercomputers. This, combined with the savings from using open source software, ease of use due to seamless scaling, and the strength of the Hadoop community make Hadoop and related software the parallelization solution of choice in next generation sequencing.In other areas, however, traditional HPC is still more common and Hadoop has not yet caught on. Dr. Taylor believes that in the next year to 18 months, this will change due to the following trends:

So, over the next year to eighteen months, what do you see as the evolution of topic map software and services?

Or what problems do you see becoming apparent in bioinformatics or other areas (like the Department of Energy’s knowledgebase) that will require topic maps?

(More on the DOE project later this week.)

Comments Off

IBM Targets the Future of Social Media Analytics

Filed under: Analytics,Hadoop — Patrick Durusau @ 6:42 pm

IBM Targets the Future of Social Media Analytics

This is from back in April, 2011 but thought it was worthy of a note. The post reads in part:

The new product, called Cognos Consumer Insight, is built upon IBM’s Cognos business intelligence technology along with Hadoop to process the piles of unstructured social media data. According to Deepak Advani, IBM’s VP of predictive analytics, there’s a lot of value in performing text analytics on data derived from Twitter, Facebook and other social forums to determine how companies or their products are faring among consumers. Cognos lets customers view sentiment levels over time to determine how efforts are working, he added, and skilled analysts can augment their Cognos Consumer Insight usage with IBM’s SPSS product to bring predictive analytics into the mix.

The partnership with Yale is designed to address the current dearth of analytic skills among business leaders, Advani said. Although the program will involve training on analytics technologies, Advani explained that business people still need some grounding in analytic theory and thinking rather than just knowing how to use a particular piece of software. “I think the primary goal is for students to learn analytically,” he said, which will help know which technology to put to work on what data, and how.

Within many organizations, he added, the main problem is that they’re not using analytics at the point of decision or across all their business processes. Advani says partnerships like those with Yale will help instill the thought process of using mathematical algorithms instead of gut feelings.

I was with them up to the point that it says: “….instill the thought process of using mathematical algorithms instead of gut feelings.”

I don’t take “analytical thinking” to be limited to mathematical algorithms.

Moreover, we have been down this road before, when Jack Kennedy was president and Robert McNamara was Secretary of Defense. Operations analysis they called it back then. Should be able to determine, mathematically, how much equipment was needed at any particular location and didn’t need to ask local “gut” opinions about it. True, some bases don’t need snow plows every year, but when planes are trying to land, they are very nice to have.

If you object that is an abuse of operations theory I would have to concede you are correct, but abused it was on a regular basis.

I suspect the program will be a very good one along with the software. My only caution is really on any analytical technique that gives an answer at variance with years of experience in a trade. At least a reason to pause to ask why?

Comments Off

July 17, 2011

Hadoop & Startups: Where Open Source Meets Business Data

Filed under: Hadoop,Marketing,Subject Identity — Patrick Durusau @ 7:28 pm

Hadoop & Startups: Where Open Source Meets Business Data

From the post:

A decade ago, the open-source LAMP (Linux, Apache, MySQL, PHP/Python) stack began to transform web startup economics. As new open-source webservers, databases, and web-friendly programming languages liberated developers from proprietary software and big iron hardware, startup costs plummeted. This lowered the barrier to entry, changed the startup funding game, and led to the emergence of the current Angel/Seed funding ecosystem. In addition, of course, to enabling a generation of webapps we all use everyday.

This same process is now unfolding in the Big Data space, with an open-source ecosystem centered around Hadoop displacing the expensive, proprietary solutions. Startups are creating more intelligent businesses and more intelligent products as a result. And perhaps even more importantly, this technological movement has the potential to blur the sharp line between traditional business and traditional web startups, dramatically expanding the playing field for innovation.

So, how do we create an open-source subject identity ecosystem?

Note that I said “subject identity ecosystem” and not URLs pointing at arbitrary resources. Useful but subject identity, to be re-usable, requires more than that.

Comments Off

July 8, 2011

Hadoop and MapReduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:53 pm

Hadoop and MapReduce

Nice slidedeck on Hadoop and MapReduce by Friso van Vollenhoven.

Can’t tell what illustration/explanation is going to “click” for someone so keep this one in mind when discussing Hadoop/MapReduce.

Comments Off

July 7, 2011

MapR Releases Commercial Distributions based on Hadoop

Filed under: Hadoop — Patrick Durusau @ 4:16 pm

MapR Releases Commercial Distributions based on Hadoop

From the post:

MapR Technologies released a big data toolkit, based on Apache Hadoop with their own distributed storage alternative to HDFS. The software is commercial, with MapR offering both a free version, M3, as well as a paid version, M5. M5 includes snapshots and mirroring for data, Job Tracker recovery, and commercial support. MapR’s M5 edition will form the basis of EMC Greenplum’s upcoming HD Enterprise Edition, whereas EMC Greenplum’s HD Community Edition will be based on Facebook’s Hadoop distribution rather than MapR technology.

At the Hadoop Summit last week, MapR Technologies announced the general availability of their "Next Generation Distribution for Apache Hadoop." InfoQ interviewed CEO John Schroeder and VP Marketing Jack Norris to learn more about their approach. MapR claims to improve MapReduce and HBase performance by a factor of 2-5, and to eliminate single points of failure in Hadoop. Schroeder says that they measure performance against competing distributions by timing benchmarks such as DFSIO, Terasort, YCSB, Gridmix, and Pigmix. He also said that customers testing MapR’s technology are seeing a 3-5 times improvement in performance against previous versions of Hadoop that they use. Schroeder reports that they had 35 beta testers and that they showed linear scalability in clusters of up to 160 nodes. MapR reports that several of the beta test customers now have their technology in production – including one that has a 140 node cluster in production, and another that "is looking at deploying MapR on 2000 nodes." By comparison, Yahoo is believed to run the largest Hadoop clusters, comprised of 4000 nodes running Apache Hadoop and competitor Cloudera claimed to have more than 80 customers running Hadoop in production in March 2011, with 22 clusters running Cloudera’s distribution that are over a petabyte as of July 2011.

Remember, Hadoop is a buzz word in U.S. government circles.

Comments Off

July 1, 2011

Hadoop Developer Virtual Appliance

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:48 pm

Hadoop Developer Virtual Appliance

From the webpage:

The Karmasphere Studio Community All-in-One Virtual Appliance combines Apache Hadoop, Eclipse and Karmasphere Studio Community to make it easy to get started with the Hadoop development lifecycle. With pre-configured and documented examples, this easy-to-install environment gives the developer everything they need to learn, prototype, develop and test Hadoop applications.

Use Studio Community Edition to:

Learn how to develop Hadoop applications in a familiar graphical environment. Fast!

Prototype and develop your Hadoop projects with visual aids and wizards.

Test your Hadoop application on any version of Hadoop whether it runs on premise, in a private data center or in the cloud.

Studio Community Edition is perfect for those new to Hadoop, and a great way to explore Karmasphere Studio before jumping into Studio Professional Edition.

Supports all Hadoop platforms including Amazon Elastic MapReduce, Apache Hadoop, Cloudera CDH, EMC Greenplum HD Community Edition and IBM InfoSphere BigInsights.

Runs on Mac, Linux, and Windows.

Includes comprehensive “Getting Started Guide” with easy to use examples

The tools for Hadoop and Map/Reduce just keep getting better. There’s a lesson in there somewhere.

Comments Off

June 28, 2011

Get started with Hadoop…

Filed under: Hadoop,MapReduce — Patrick Durusau @ 9:47 am

Get started with Hadoop: From evaluation to your first production cluster by Brett Sheppard.

From the introduction:

This piece provides tips, cautions and best practices for an organization that would like to evaluate Hadoop and deploy an initial cluster. It focuses on the Hadoop Distributed File System (HDFS) and MapReduce. If you are looking for details on Hive, Pig or related projects and tools, you will be disappointed in this specific article, but I do provide links for where you can find more information.

Highly recommended!

Comments Off

HortonWorks

Filed under: Hadoop,MapReduce — Patrick Durusau @ 9:47 am

I read in Alex Popescu’s myNoSQL, Yahoo Launches Hadoop Spinoff, who pointed to GigaOm, Exclusive: Yahoo launching Hadoop spinoff this week (Dick Harris), announcement anticipated Tuesday (June 28th) or at the Hadoop Summit 2011, on Wednesday, June 29th.

Informational only, no release at this point.

Comments Off

Mapreduce & Hadoop Algorithms in Academic Papers (4th update)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 9:46 am

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

From the post:

It’s been a year since I updated the mapreduce algorithms posting last time, and it has been truly an excellent year for mapreduce and hadoop – the number of commercial vendors supporting it has multiplied, e.g. with 5 announcements at EMC World only last week (Greenplum, Mellanox, Datastax, NetApp, and Snaplogic) and today’s Datameer funding announcement , which benefits the mapreduce and hadoop ecosystem as a whole (even for small fish like us here in Atbrox). The work-horse in mapreduce is the algorithm, this update has added 35 new papers compared to the prior posting, new ones are marked with *. I’ve also added 2 new categories since the last update – astronomy and social networking.

Comments Off

June 26, 2011

DataCaml – a first look at distributed dataflow programming in OCaml

Filed under: CIEL,DataCaml,Dryad,Hadoop — Patrick Durusau @ 4:12 pm

DataCaml – a first look at distributed dataflow programming in OCaml

From the post:

Distributed programming frameworks like Hadoop and Dryad are popular for performing computation over large amounts of data. The reason is programmer convenience: they accept a query expressed in a simple form such as MapReduce, and automatically take care of distributing computation to multiple hosts, ensuring the data is available at all nodes that need it, and dealing with host failures and stragglers.

A major limitation of Hadoop and Dryad is that they are not well-suited to expressing iterative algorithms or dynamic programming problems. These are very commonly found patterns in many algorithms, such as k-means clustering, binomial options pricing or Smith Waterman for sequence alignment.

Over in the SRG in Cambridge, we developed a Turing-powerful distributed execution engine called CIEL that addresses this. The NSDI 2011 paper describes the system in detail, but here’s a shorter introduction.

The post gives an introduction to the OCaml API.

The CIEL Execution Engine description begins with:

CIEL consists of a master coordination server and workers installed on every host. The engine is job-oriented: a job consists of a graph of tasks which results in a deterministic output. CIEL tasks can run in any language and are started by the worker processes as needed. Data flows around the cluster in the form of references that are fed to tasks as dependencies. Tasks can publish their outputs either as concrete references if they can finish the work immediately or as a future reference. Additionally, tasks can dynamically spawn more tasks and delegate references to them, which makes the system Turing-powerful and suitable for iterative and dynamic programming problems where the task graph cannot be computed statically.

BTW, you can also have opaque references, which progress for a while, then stop.

Deeply interesting work.

Comments Off

June 25, 2011

HackReduce Data

Filed under: Conferences,Dataset,Hadoop,MapReduce — Patrick Durusau @ 8:49 pm

HackReduce Data

Data sets and instructions on data sets for Hack/Reduce Big Data. Hackathon.

Includes:

Bixi (Courtesy of Fabrice)
Flights (Courtesy of Hopper)
NASDAQ daily prices and dividends (http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nasdaq-exchange)
NYSE daily prices and dividends (http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nyse-exchange)
Wikipedia XML (http://en.wikipedia.org/wiki/Wikipedia:
Database_download#English-language_Wikipedia)
Google Ngram (http://ngrams.googlelabs.com/datasets)

Always nice to have data of interest to a user community when demonstrating topic maps.

Comments Off

June 22, 2011

VoltDB Announces Hadoop Integration

Filed under: Hadoop,VoltDB — Patrick Durusau @ 6:41 pm

VoltDB Announces Hadoop Integation

From the announcement:

VoltDB, a leading provider of high-velocity data management systems, today announced the release of VoltDB Integration for Hadoop. The new product functionality, available in VoltDB Enterprise Edition, allows organizations to selectively stream high velocity data from a VoltDB cluster into Hadoop’s native HDFS file system by leveraging Cloudera’s Distribution Including Apache Hadoop (CDH), which has SQL-to-Hadoop integration technology, Apache Sqoop, built in.

“The term ‘big data’ is being applied to a diverse set of data storage and processing problems related to the growing volume, variety and velocity of data and the desire of organizations to store and process data sets in their totality,” said Matt Aslett, senior analyst, enterprise software, The 451 Group. “Choosing the right tool for the job is crucial: high velocity data requires an engine that offers fast throughput and real-time visibility; high volume data requires a platform that can expose insights in massive data sets. Integration between VoltDB and CDH will help organizations to combine two special purpose engines to solve increasingly complex data management problems.”

I can’t imagine a better environment for promotion of topic maps than “big data.” The more data there is processed, the more semantic integration issues will come to the fore. At least to clients paying the bills for sensible answers. It is sorta like putting teenagers in Indy race cars. It won’t take all that long before some of them will decide they need driving lessons.

Comments (1)

Cloudera – Apache Hadoop Connector for Netezza

Filed under: Hadoop,Netezza — Patrick Durusau @ 6:40 pm

Cloudera Delivers Apache Hadoop Connector for Netezza

From the announcement:

Cloudera Inc., a leading provider of Apache Hadoop-based data management software and services, today announced the immediate general availability of the Cloudera Connector for IBM Netezza appliances. The connector allows Netezza users to leverage Cloudera’s Distribution including Apache Hadoop (CDH) and Cloudera Enterprise services, support and management tools to derive highly articulated analytical insights from large unstructured data sets. The Cloudera Connector, which is the first of its kind for CDH and Cloudera Enterprise, enables high-speed, bilateral data transfer between CDH and Netezza environments.

“As the amount of data that organizations need to process, especially for analytics, continues to increase, Apache Hadoop is increasingly becoming an important data integration tool to enhance performance in reducing very large amounts of data to only what is needed in the data warehouse,” said Donald Feinberg, VP and distinguished analyst at Gartner. “Hadoop presents a viable solution for organizations looking to address the challenges presented by large scale data and has the potential to extend the capabilities of a company’s data warehouse by providing expanded opportunities for analysis and storage for complex data sets.”

Comments (1)

Best Practices with Hadoop – Real World Data

Filed under: Hadoop — Patrick Durusau @ 6:38 pm

Best Practices with Hadoop – Real World Data

Thursday, June 23, 2011 11:00 AM – 12:00 PM PDT

From the webpage:

Ventana Research recently completed data collection for the first large-scale, effective research on Hadoop and Information Management. Be among the first to get a glimpse into the results by attending this webinar, sponsored by Cloudera and Karmasphere.

David Menninger of Ventana Research will share some of the preliminary findings from the survey. Register now to learn how to improve your information management efforts with specific knowledge of the best practices for managing large-scale data with Hadoop.

About the presenter:

David Menninger is responsible for the research direction of information technologies at Ventana Research covering major areas including Analytics, Business Intelligence and Information Management. David brings to Ventana Research over two decades of experience, through which he has marketed and brought to market some of the leading edge technologies for helping organizations analyze data to support a range of action-taking and decision-making processes.

I suspect mostly a promo for the “details” but still could be worth attending.

Comments Off

Biodiversity Indexing: Migration from MySQL to Hadoop

Filed under: Hadoop,Hibernate,MySQL — Patrick Durusau @ 6:36 pm

Biodiversity Indexing: Migration from MySQL to Hadoop

From the post:

The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the Hadoop stack.

The Data Portal was launched in 2007. It consists of crawling components and a web application, implemented in a typical Java solution consisting of Spring, Hibernate and SpringMVC, operating against a MySQL database. In the early days the MySQL database had a very normalized structure, but as content and throughput grew, we adopted the typical pattern of denormalisation and scaling up with more powerful hardware. By the time we reached 100 million records, the occurrence content was modeled as a single fixed-width table. Allowing for complex searches containing combinations of species identifications, higher-level groupings, locality, bounding box and temporal filters required carefully selected indexes on the table. As content grew it became clear that real time indexing was no longer an option, and the Portal became a snapshot index, refreshed on a monthly basis, using complex batch procedures against the MySQL database. During this growth pattern we found we were moving more and more operations off the database to avoid locking, and instead partitioned data into delimited files, iterating over those and even performing joins using text files by synthesizing keys, sorting and managing multiple file cursors. Clearly we needed a better solution, so we began researching Hadoop. Today we are preparing to put our first Hadoop process into production.

Awesome project!

Where would you suggest the use of topic maps and subject identity to improve the project?

Comments Off

June 17, 2011

Natural Language Processing with Hadoop and Python

Filed under: Hadoop,Natural Language Processing,Python — Patrick Durusau @ 7:19 pm

Natural Language Processing with Hadoop and Python

From the post:

If you listen to analysts talk about complex data, they all agree, it’s growing, and faster than anything else before. Complex data can mean a lot of things, but to our research group, ever increasing volumes of naturally occurring human text and speech—from blogs to YouTube videos—enable new and novel questions for Natural Language Processing (NLP). The dominating characteristic of these new questions involves making sense of lots of data in different forms, and extracting useful insights.

Now that I think about it, a lot of the input from various intelligence operations consists of “naturally occurring human text and speech….” Anyone can crunch lots of text/speech, the question is being a good enough analyst to extract something useful.

Comments Off

June 13, 2011

Starfish: A Self-Tuning System for Big Data Analytics

Filed under: BigData,Hadoop,Topic Maps,Usability — Patrick Durusau @ 7:02 pm

Starfish: A Self-Tuning System for Big Data Analytics by Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu, of Duke University.

Abstract:

Timely and cost-effective analytics over “Big Data” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces—is a popular choice for big data analytics. Most practitioners of big data analytics—like computational scientists, systems researchers, and business analysts—lack the expertise to tune the system to get good performance. Unfortunately, Hadoop’s performance out of the box leaves much to be desired, leading to suboptimal use of resources, time, and money (in payas- you-go clouds). We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. While Starfish’s system architecture is guided by work on self-tuning database systems, we discuss how new analysis practices over big data pose new challenges; leading us to different design choices in Starfish

Accepts that usability is, at least for this project, more important than peak performance. That is the goal is to open up use of Hadoop with reasonable performance to a large number of non-expert users. That will probably do as much if not more than the native performance of Hadoop to spread its use in a number of sectors.

Makes me wonder what acceptance of usability over precision would look like for topic maps? Suggestions?

Comments Off

June 11, 2011

Hadoop: What is it Good For? Absolutely … Something

Filed under: Hadoop,Marketing — Patrick Durusau @ 12:43 pm

Hadoop: What is it Good For? Absolutely … Something by James Kobielus is an interesting review of how to contrast Hadoop with an enterprise database warehouse (EDW).

From the post:

So – apart from being an open-source community with broad industry momentum – what is Hadoop good for that you can’t get elsewhere? The answer to that is a mouthful, but a powerful one.

Essentially, Hadoop is vendor-agnostic in-database analytics in the cloud, leveraging an open, comprehensive, extensible framework for building complex advanced analytics and data management functions for deployment into cloud computing architectures. At the heart of that framework is MapReduce, which is the only industry framework for developing statistical analysis, predictive modeling, data mining, natural language processing, sentiment analysis, machine learning, and other advanced analytics. Another linchpin of Hadoop, Pig, is a versatile language for building data integration processing logic.

Promoting Hadoop without singing Aquarius, promising us a new era in human relationships, or that we are going to be smarter than we were 100, 500, or even 1,000 years ago. Just cold hard data analysis advantages, the sort that reputations, businesses and billings are built upon. Maybe there is a lesson there for topic maps?

Comments Off

June 3, 2011

IBM InfoSphere BigInsights

Filed under: Avro,BigInsights,Hadoop,HBase,Lucene,Pig,Zookeeper — Patrick Durusau @ 2:32 pm

IBM InfoSphere BigInsights

Two items stand out in the usual laundry list of “easy administration” and “IBM supports open source” list of claims:

The Jaql query language. Jaql, a Query Language for JavaScript Object Notation (JSON), provides the capability to process both structured and non-traditional data. Its SQL-like interface is well suited for quick ramp-up by developers familiar with the SQL language and makes it easier to integrate with relational databases.

….

Integrated installation. BigInsights includes IBM value-added technologies, as well as open source components, such as Hadoop, Lucene, Hive, Pig, Zookeeper, Hbase, and Avro, to name a few.

I guess it must include a “few” things since the 64-bit Linux download is 398 MBs.

Just pointing out its availability. More commentary to follow.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 25, 2011

July 23, 2011

July 22, 2011

July 21, 2011

July 20, 2011

July 19, 2011

July 18, 2011

July 17, 2011

July 8, 2011

July 7, 2011

July 1, 2011

June 28, 2011

June 26, 2011

June 25, 2011

June 22, 2011

June 17, 2011

June 13, 2011

June 11, 2011

June 3, 2011