What Do Real-Life Hadoop Workloads Look Like? by Yanpei Chen.
From the post:
Organizations in diverse industries have adopted Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
Specific (and useful) to Hadoop installations but I suspect more useful for semantic processing in general.
Questions like:
- What topics are “hot spots” of merging activity?
- Where do those topics originate?
- How do changes in merging rules impact the merging process?
are only some of the ones that may be of interest.