Figure 3. The hidden gate layer, from GateNet paper
All session training data
Data is the key to unleashing the full potential of personalized embeddings. Our model was originally trained with impression data from a segment of sessions, with the understanding that limited impressions or samples are not sufficient to have a good representation for millions of members and hashtags.
To mitigate sparsity challenges, we incorporate data from all feed sessions. Models may favor specific items, members, and creators (e.g., popular ones versus overrepresented ones in training data), which may be reinforced when training on data generated by other models. We determined that we can reduce this effect by sampling our training data with an inverse propensity score (Horvitz-Thompson-1952-jasa.pdf, Doubly Robust Policy Evaluation and Optimization):
-
Where inverse propensity score (position, response) = RandomSession-CTR (position, response) / NonRandomSession-CTR (position, response)
These weights are subsequently utilized in the cross entropy loss calculation, enabling a more accurate and balanced training process.
Training scalability
Flow footprint optimization
We process data with Spark in multiple stages composed of both row-wise and column-wise manipulations such as sampling, feature join, and record reweighting. While our original approach was to materialize the output at each step (TB-sized data) for ease of further analysis, this didn’t scale with dozens of flows producing giant footprints in parallel. We adapted our in-house ML pipeline to output only a virtual view of data at each step, which stores computation information for generating the intermediate data in memory without materializing it, and then triggers the whole DAG to materialize the data right before the trainer step, reducing the storage footprint by 100x.
Training speed and scalability
We solve the problem of efficient training by adopting and extending Horovod on our Kubernetes cluster. This is the first major use case of Horovod at LinkedIn. It uses MPI and finer grained control for communicating parameters between multiple GPUs. We use careful profiling to identify and tune the most inefficient ops, batch parameter sharing, and evaluate tradeoffs between sparse vs dense embedding communication. After these optimizations, we saw a 40x increase in speed and were able to successfully train on multi-billion records, with parameter sizes of hundreds of millions, on a single node in a few hours. Here are some optimizations with prominent training speedup:
-
Op device placement: We’ve identified that data transfer between GPU and CPU is time consuming for certain TensorFlow ops, so we placed the corresponding kernels on the GPU to avoid GPU/CPU data copy, resulting in a 2x speedup.
-
I/O tuning: I/O and gradient communication are huge overheads when operating with large models trained on extensive records encompassing historical interaction features. We tuned the I/O read buffer size, number of reader parsing threads (which convert on-disk format to TFRecord format), and dataset interleave to prevent data reading from being a bottleneck. We also chose the proper format for efficient passing over the network (Sparse vs Dense), which further enhanced the training speed by 4x.
-
Gradient accumulation: From profiling TensorFlow, we observed that there is a training bottleneck with the all-reduce. So, we accumulate the gradients for each batch in-process and perform the all-reduce for multiple batches together. After accumulating five batches we are able to see a 33% increase in the number of training examples processed per hour.
By further incorporating model parallelism for DLRM-style architecture we are able to achieve a comparable training time with model size enlarged by 20x to 3 billion parameters. We effectively addressed two critical challenges. First, we tackled the issue of running out of memory on individual processes caused by sharding embedding tables on different GPUs. This optimization ensures efficient memory utilization, allowing for smoother and more streamlined operations. Second, we solved the latency problem associated with synchronizing gradients across processes. Traditionally, gradients of all embeddings were passed around at each gradient update step; for large embedding tables, this leads to significant delays. With model parallelism, we eliminated this bottleneck, resulting in faster and more efficient model synchronization.
Moreover, the initial implementation imposed restrictions on the number of GPUs to be bounded by embedding tables in architectures. By implementing 4-D model parallelism (refers to embedding table-wise split, row-wise split, column-wise split, and regular data parallel), we have unleashed full computational power, empowering modelers to fully exploit the potential of the hardware without being constrained by the model’s architecture. For example, with only table-wise split, an architecture with two embedding tables can only leverage 2 GPUs. By using column-wise split to partition each table into three parts along the column dimension and placing each shard on different GPUs, we can leverage all available GPUs on a six GPU node to achieve about a 3x training speedup, which reduces training time by approximately 30%.
We implemented it in TensorFlow and contributed code to Horovod in the open-source community as an easy to use gradient tape. We hope to foster collaboration and inspire further breakthroughs in the field.
Serving scalability
External serving vs in-memory serving
In this multi-quarter effort, we progressively developed large models. Memory on the host used for serving was initially a bottleneck to serve multiple giant models in parallel. To deliver good relevance metrics to members earlier, under the existing infrastructure, we externalized these features by partitioning the trained model graph, precomputing embeddings offline, and storing them in an efficient key value store for online fetching. As a result, the parameters hosted in the service were limited to only the MLP layer. However, this serving strategy limited us on several aspects:
-
Iteration flexibility: Individual modelers could not train their MLP jointly with embedding tables because they were consuming ID embeddings as features.
-
Feature fidelity: Features are precomputed offline on a daily basis and pushed to the online store, causing potential delays in terms of feature delivery.
While this was not the final goal we envisioned, this worked because the members and hashtag ID features are relatively long-lasting.