Archive for the ‘Gora’ Category

Apache Gora

Monday, December 10th, 2012

Apache Gora

From the webpage:

What is Apache Gora?

The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column
stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.

Why Apache Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use in-memory data model and persistence for big data framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

When writing about the Nutch 2.X development path, I discovered my omission of Gora from this blog. Apologies for having overlooked it until now.

Apache Nutch v1.6 and Apache 2.1 Releases

Monday, December 10th, 2012

Apache Nutch v1.6 Released

From the news:

The Apache Nutch PMC are extremely pleased to announce the release of Apache Nutch v1.6. This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API inluding the normalization of URL’s and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. Please see the list of changes or the release report made in this version for a full breakdown. The release is available here.

See the Nutch 1.x tutorial.

Apache Nutch v2.1 Released

From the news:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Please see the list of changes made in this version for a full breakdown. The release is available here.

See the Nutch 2.x tutorial.

I haven’t done a detailed comparison but roughly, Nutch 1.x relies upon Solr for storage and Nutch 2.x relies upon Gora and HBase.

Surprised that isn’t in the FAQ.

Perhaps I will investigate further and offer a short summary of the differences.