Working with Small Files in Hadoop – Part 1, Part 2, Part 3

Working with Small Files in Hadoop – Part 1, Part 2, Part 3 by Chris Deptula.

From the post:

Why do small files occur?

The small file problem is an issue Inquidia Consulting frequently sees on Hadoop projects. There are a variety of reasons why companies may have small files in Hadoop, including:

  • Companies are increasingly hungry for data to be available near real time, causing Hadoop ingestion processes to run every hour/day/week with only, say, 10MB of new data generated per period.
  • The source system generates thousands of small files which are copied directly into Hadoop without modification.
  • The configuration of MapReduce jobs using more than the necessary number of reducers, each outputting its own file. Along the same lines, if there is a skew in the data that causes the majority of the data to go to one reducer, then the remaining reducers will process very little data and produce small output files.

Does it sound like you have small files? If so, this series by Chris is what you are looking for.

Comments are closed.