Training deep neural networks (DNNs) is resource-intensive, time-consuming, and expensive. Despite the compute and memory-intensive nature of DNN applications, they often underutilize expensive ML hardware accelerators, such as GPUs. This talk will explore how to improve the utilization of ML hardware accelerators by eliminating input data processing bottlenecks and efficiently sharing resources between jobs. We will discuss the characteristics of ML input data pipelines, which motivate the design of a new data preprocessing system architecture, in which we disaggregate data processing from model training. I will present Cachew, a fully-managed service for ML data processing, built on top of Tensorflow's data loading framework, tf.data. Cachew dynamically scales distributed resources for data processing to avoid input data stalls. The service also maintains a global view of data processing across jobs, which enables selectively caching preprocessed datasets to maximize training throughput and improve energy efficiency across jobs.
I will conclude by discussing further oppotunities to improve the energy efficiency of DNN training, particularly in real-world settings where data dynamics require models to be frequently retrained. I will give a preview of our early work on Modyn, an open-source platform and benchmark suite for ML training on dynamic datasets, which enables researchers to explore data selection and retraining policies.