Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 22, 2011

Biodiversity Indexing: Migration from MySQL to Hadoop

Filed under: Hadoop,Hibernate,MySQL — Patrick Durusau @ 6:36 pm

Biodiversity Indexing: Migration from MySQL to Hadoop

From the post:

The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the Hadoop stack.

The Data Portal was launched in 2007. It consists of crawling components and a web application, implemented in a typical Java solution consisting of Spring, Hibernate and SpringMVC, operating against a MySQL database. In the early days the MySQL database had a very normalized structure, but as content and throughput grew, we adopted the typical pattern of denormalisation and scaling up with more powerful hardware. By the time we reached 100 million records, the occurrence content was modeled as a single fixed-width table. Allowing for complex searches containing combinations of species identifications, higher-level groupings, locality, bounding box and temporal filters required carefully selected indexes on the table. As content grew it became clear that real time indexing was no longer an option, and the Portal became a snapshot index, refreshed on a monthly basis, using complex batch procedures against the MySQL database. During this growth pattern we found we were moving more and more operations off the database to avoid locking, and instead partitioned data into delimited files, iterating over those and even performing joins using text files by synthesizing keys, sorting and managing multiple file cursors. Clearly we needed a better solution, so we began researching Hadoop. Today we are preparing to put our first Hadoop process into production.

Awesome project!

Where would you suggest the use of topic maps and subject identity to improve the project?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress