AICurious Logo

What is: ZeRO-Offload?

SourceZeRO-Offload: Democratizing Billion-Scale Model Training
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

ZeRO-Offload is a sharded data parallel method for distributed training. It exploits both CPU memory and compute for offloading, while offering a clear path towards efficiently scaling on multiple GPUs by working with ZeRO-powered data parallelism. The symbiosis allows ZeRO-Offload to maintain a single copy of the optimizer states on the CPU memory regardless of the data parallel degree. Furthermore, it keeps the aggregate communication volume between GPU and CPU, as well as the aggregate CPU computation a constant regardless of data parallelism, allowing ZeRO-Offload to effectively utilize the linear increase in CPU compute with the increase in the data parallelism degree.