Hadoop for Data Analytics: Implementing a Weblog Parser by by Ira Agrawal.
From the post:
With the digitalization of the world, the data analytics function of extracting information or generating knowledge from raw data is becoming increasingly important. Parsing Weblogs to retrieve important information for analysis is one of the applications of data analytics. Many companies have turned to this application of data analytics for their basic needs.
For example, Walmart would want to analyze the bestselling product category for a region so that they could notify users living in that region about the latest products under that category. Another use case could be to capture the area details — using IP address information — about the regions that produce the most visits to their site.
All user transactions and on-site actions are normally captured in weblogs on a company’s websites. To retrieve all this information, developers must parse these weblogs, which are huge. While sequential parsing would be very slow and time consuming, parallelizing the parsing process makes it fast and efficient. But the process of parallelized parsing requires developers to split the weblogs into smaller chunks of data, and the partition of the data should be done in such a way that the final results will be consolidated without losing any vital information from the original data.
Hadoop’s MapReduce framework is a natural choice for parallel processing. Through Hadoop’s MapReduce utility, the weblog files can be split into smaller chunks and distributed across different nodes/systems over the cluster to produce their respective results. These results are then consolidated and the final results are obtained as per the user’s requirements.
Walks you through the process from setting up the Hadoop cluster, loading the logs and then parsing them. Not a bad introduction to Hadoop.