Automatic Spark Tuning For Low Waste Batch Processing
June 4, 2024Quantcast Computer Science/Mathematics, 2023–24
Liaison(s): Theo Bayard de Volo PZ ’22, Scott McCoy
Advisor(s): Mark Kampe
Students(s): Tesfa Asmara (TL-S), Liam Martin (TL-F), Teja Reddy, Jimmy Chen, Jaime Pacheco
Quantcast is an American technology company, founded in 2006, that specializes in AI-driven real-time advertising, audience insights and measurement. Many of Quantcast’s data-workflows run atop Apache Spark. While Spark has many built-in optimizations, Quantcast has noticed that the clusters they run on are leaving a significant portion of their memory and processors unused. The goal of this project was to develop a system extension that automatically tunes Spark configurations. We developed a Spark plugin to capture critical statistics previously unavailable directly from Spark, such as memory and CPU utilization. The team has also been diligently working on training linear regression and decision tree models using simulated data sets to recommend more efficient cluster configurations for the Spark jobs. Next steps involve assessing the models’ predictive accuracy and reliability by applying these models to real data from Spark jobs at Quantcast by incorporating our plugin.