The paper CoreBigBench: Benchmarking Big Data Core Operations by Todor Ivanov, Ahmad Ghazal, Alain Crolotte, Pekka Kostamaa andYoseph Ghazal will be presented at the 8th International Workshop on Testing Database Systems 2020 on June 19.
Significant effort was put into big data benchmarking with focus on end-to-end applications. While covering basic functionalities implicitly, the details of the individual contributions to the overall performance are hidden. As a result, end-to-end benchmarks could be biased toward certain basic functions. Micro-benchmarks are more explicit at covering basic functionalities but they are usually targeted at some highly specialized functions. In this paper we present CoreBigBench, a benchmark that focuses on the most common big data engines/platforms functionalities like scans, two way joins, common UDF execution and more. These common functionalities are benchmarked over relational and key-value data models which covers majority of data models. The benchmark consists of 22 queries applied to sales data and key-value web logs covering the basic functionalities. We ran CoreBigBench on Hive as a proof of concept and verified that the benchmark is easy to deploy and collected performance data. Finally, we believe that CoreBigBench is a good fit for commercial big data engines performance testing focused on basic engine functionalities not covered in end-to-end benchmarks.