An overview of Baidu’s industrial-scale GPU training architecture


Like its US counterpart, Google, Baidu has made significant investments to build robust systems at scale to support global ad programs. As one can imagine, AL / ML has played a central role in the way these systems are built. Massive GPU-accelerated clusters on par with the world’s most powerful supercomputers are the norm and advancements in AI efficiency and performance are paramount.

Baidu’s history with supercomputer-type systems for advertising dates back over a decade, when the company used a new distributed regression model to determine the click-through rate (CTR) of advertising campaigns. These MPI-based algorithms gave way to nearest neighbor and maximum internal product searches just before the launch of AI / ML and have now evolved to include ultra-compressed CTR models running on large pools of resources. GPU.

Weijie Zhao of Baidu Research’s Cognitive Computing Lab emphasizes the phrase “industrial scale” when talking about Baidu’s infrastructure to support advertising, noting that the training data sets are at the forefront. petabyte scale with hundreds of billions of training instances. The size of the models can reach more than ten terabytes. Compute and network efficiency, but with these models storage definitely becomes a bottleneck, necessitating new and serious compression / quantization.

“It is not trivial to deploy these models on an industrial scale with hundreds to trillions of input features, which, under budgetary constraints, can pose fundamental problems for storage, communication and communication. training speed. He says their training systems, which are now loaded with GPUs, were once fully CPU, but with large-scale MPI “they consume a large amount of communication and compute to remain fault-tolerant and in sync, which means substantial costs. for cluster maintenance and energy consumption.

What’s interesting about Baidu’s CTR advertising segment is that it’s not possible to trade precision for performance / efficiency. This limits the options for creating robust training systems that aren’t incredibly expensive to run and that don’t stumble in terms of data movement etc.

What’s also interesting is the node architecture Baidu selected for its large-scale training runs and the clear tradeoffs they make in terms of network, storage, and compute performance.

HBM plays an active role in the training process with all parameters stored there. Using full-scale GPU communication, updates are synchronized across the entire supercomputer. General main memory is used to manage the cache and any settings that are low on memory. “Compared with a multi-node system, the advantages of single-node system with multi-GPU include low price, low communication cost, much less synchronization and low failure rate,” explains Weijie Zhao.

“This design simply eliminates our concerns about communication and synchronization costs by MPI clusters and significantly reduces IT cluster maintenance expenses and power consumption. The storage pressure of the integration table larger than 10TB is handled by three levels of hardware structure, namely SSDs (solid state drives), main memory and GPU memories. Different from the existing system which distributes training tasks to both CPUs and GPUs, all training tasks are distributed only to multiple GPUs, further reducing the complexity of the system.

“Our old CPU-only training system performed distributed training and then was upgraded to the current one-node GPU training. The distributed training system consisted of 150 CPU-only compute nodes, each with a 16-core processor and 180 GB of memory, while the GPU compute node in the current training system uses more expensive hardware: more memory. (1TB), SSDs and GPUs: The cost of a GPU compute node is about 15 times the CPU node only. However, we only need a single one-node GPU compute node to complete the work on the 150-node CPU-only cluster: the current one-node system uses a much lower expense (one tenth) compared to to the 150 CPU cluster only. . “

A sense of scale for each layer in CTR models.

Weijie Zhao says that within the CTR Group, just over half a billion training instances are generated daily. He says that with the current GPU approach with strict quantification, training only takes 2.5 hours for this sample size with a delay, caused by page viewing, online updating of input features and online update of the model, not exceeding 3 hours, while the CPU- only a multi-node took 10 hours.

This node architecture is remarkable, but it does not fully meet storage demands. It took Baidu Research serious quantification work, which you can read here, their approach halved their storage usage from what it was before the effort and 75% more than using standard 32-bit precision.

“Our work also reveals that, unlike many academic studies of 1 or 2-bit quantization-based learning systems, industrial-scale production systems may need many more bits, for example, 16 bits in our system. This quantification step allows us to double the dimensionality of the embedding layer without increasing the storage. We deployed this system into production and observed a substantial increase in prediction accuracy and revenue. “

Subscribe to our newsletter

Featuring the week’s highlights, analysis, and stories straight from us to your inbox with nothing in between.
Subscribe now

Leave A Reply

Your email address will not be published.