Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta’s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure our ML systems are intrinsically resilient, we have built a comprehensive set of prediction robustness solutions that ensure stability without compromising performance or availability of our ML systems.
Why is machine learning robustness difficult?
Solving for ML prediction stability has many unique characteristics, making it more complex than addressing stability challenges for traditional online services:
- ML models are stochastic by nature. Prediction uncertainty is inherent, which makes it difficult to define, identify, diagnose, reproduce, and debug prediction quality issues.
- Constant and frequent refreshing of models and features. ML models and features are continuously updated to learn from and reflect people’s interests, which makes it challenging to locate prediction quality issues, contain their impact, and quickly resolve them
- Blurred line between reliability and performance. In traditional online services, reliability issues are easier to detect based on service metrics such as latency and availability. However, ML prediction stability implies a consistent prediction quality shift, which is harder to distinguish. For example, an “available” ML recommender system that reliably produces inaccurate predictions is actually “unreliable.”
- Cumulative effect of small distribution shifts over time. Due to the stochastic nature of ML models, small regressions in prediction quality are hard to distinguish from the anticipated organic traffic-pattern changes. However, if undetected, such small prediction regressions could have a significant cumulative negative impact over time.
- Long chain of complex interactions. The final ML prediction result is derived from a complex chain of processing and propagation across multiple ML systems. Regression in prediction quality could be traced back to several hops upstream in the chain, making it hard to diagnose and locate stability improvements per specific ML system.
- Small fluctuations can amplify to become big impacts. Even small changes in the input data (e.g., features, training data, and model hyperparameters) can have a significant and unpredictable impact on the final predictions. This poses a major challenge in containing prediction quality issues at particular ML artifacts (model, feature, label), and it requires end-to-end global protection.
- Rising complexity with rapid modeling innovations. Meta’s ML technologies are evolving rapidly, with increasingly larger and more complex models and new system architectures. This requires prediction robustness solutions to evolve at the same fast pace.
Meta’s approach and progress towards prediction robustness
Meta has developed a systematic framework to build prediction robustness. This framework includes a set of prevention guardrails to build control from outside-in, fundamental understanding of the issues to gain ML insights, and a set of technical fortifications to establish intrinsic robustness.
These three approaches are exercised across models, features, training data, calibration, and interpretability to ensure all possible issues are covered throughout the ML ecosystem. With prediction robustness, Meta’s ML systems are robust by design, and any stability issues are actively monitored and resolved to ensure smooth ads delivery for our users and advertisers.
Our prediction robustness solution systematically covers all areas of the recommender system – training data, features, models, calibration, and interpretability.
Model robustness
Model robustness challenges include model snapshot quality, model snapshot freshness, and inferencing availability. We use Snapshot Validator, an internal-only real-time, scalable, and low-latency model evaluation system, as the prevention guardrail on the quality of every single model snapshot, before it ever serves production traffic.
Snapshot Validator runs evaluations with holdout datasets on newly-published model snapshots in real-time, and it determines whether the new snapshot can serve production traffic. Snapshot Validator has reduced model snapshot corruption by 74% in the past two years. It has protected >90% of Meta ads ranking models in production without prolonging Meta’s real-time model refresh.
In addition, Meta engineers built new ML techniques to improve the intrinsic robustness of models, such as pruning less-useful modules inside models, better model generalization against overfitting, more effective quantization algorithms, and ensuring model resilience in performance even with a small amount of input data anomalies. Together these techniques have improved the ads ML model stability, making the models resilient against overfitting, loss divergence, and more.
Feature robustness
Feature robustness focuses on guaranteeing the quality of ML features across coverage, data distribution, freshness, and training-inference consistency. As prevention guardrails, robust feature monitoring systems were in production to continuously detect anomalies on ML features. As the ML-feature-value distributions can change widely with non-deterministics sways on model performance, the anomaly detection systems have turned to accommodate the particular traffic and ML prediction patterns for accuracy.
Upon detection, automated preventive measures will kick in to ensure abnormal features are not used in production. Furthermore, a real-time feature importance evaluation system is built to provide fundamental understanding of the correlation between feature quality and model prediction quality.
All these solutions have effectively contained ML feature issues on coverage drop, data corruption, and inconsistency in Meta.
Training data robustness
The wide spectrum of Meta ads products requires distinct labeling logics for model training, which significantly increases the complexity of labeling. In addition, the data sources for label calculation could be unstable, due to the complicated logging infrastructure and the organic traffic drifts. Dedicated training-data-quality systems were built as the prevention guardrails to detect label drifts over time with high accuracy, and swiftly and automatically mitigate the abnormal data changes and prevent models from learning the affected training data.
Additionally, fundamental understanding of training data label consistency has resulted in optimizations in training data generation for better model learning.
Calibration robustness
Calibration robustness builds real-time monitoring and auto-mitigation toolsets to guarantee that the final prediction is well calibrated, which is vital for advertiser experiences. The calibration mechanism is technically unique because it is unjoined-data real-time model training, and it is more sensitive to traffic distribution shifts than the joined-data mechanism.
To improve the stability and accuracy of calibration Meta has built prevention guardrails that consist of high-precision alert systems to minimize problem-detection time, as well as high-rigor, automatically orchestrated mitigations to minimize problem-mitigation time.
ML interpretability
ML interpretability focuses on identifying the root causes of all ML instability issues. Hawkeye, our internal AI debugging toolkit, allows engineers at Meta to root-cause tricky ML prediction problems. Hawkeye is an end-to-end and streamlined diagnostic experience covering all ML artifacts at Meta, and it has covered >80% of ads ML artifacts. It is now one of the most widely used tools in the Meta ML engineering community.
Beyond debugging, ML interpretability invests heavily in model internal state understanding – one of the most complex and technically challenging areas in the realm of ML stability. There are no standardized solutions to this challenge, but Meta uses model graph tracing, which uses model internal states on model activations and neuron importance, to accurately explain why models get corrupted.
Altogether, advancements in ML Interpretability have reduced the time to root-cause ML prediction issues by 50%, and have significantly boosted the fundamental understanding of model behaviors.
Improving ranking and productivity with prediction robustness
Going forward, we’ll be extending our prediction robustness solutions to improve ML ranking performance, and boost engineering productivity by accelerating ML developments.
Prediction robustness techniques can boost ML performance by making models more robust intrinsically, with more stable training, less normalized entropy explosion or loss divergence, more resilience to data shift, and stronger generalizability. We’ve seen performance gains from applying robustness techniques like gradient clipping and more robust quantization algorithms. And we will continue to identify more systematic improvement opportunities with model understanding techniques.
In addition, model performance will be improved with less staleness and stronger consistency between serving and training environments across labels, features, inference platform, and more. We plan to continue upgrading Meta’s ads ML services with stronger guarantees of training-serving consistency and more aggressive staleness SLAs.
Regarding ML development productivity, prediction robustness techniques can facilitate model development, and improve daily operations by reducing the time needed to address ML prediction stability issues. We’re currently building an intelligent ML diagnostic platform that will leverage the latest ML technologies, in the context of prediction robustness, to help even engineers with little ML knowledge locate the root cause of ML stability issues within minutes.
The platform will also evaluate reliability risk continuously across the development lifecycle, minimizing delays in ML development due to reliability regressions. It will embed reliability into every ML development stage, from idea exploration all the way to online experimentation and final launches.
Acknowledgements
We would like to thank all the team members and the leadership that contributed to make the Prediction Robustness effort successful in Meta. Special thanks to Adwait Tumbde, Alex Gong, Animesh Dalakoti, Ashish Singh, Ashish Srivastava, Ben Dummitt, Booker Gong, David Serfass, David Thompson, Evan Poon, Girish Vaitheeswaran, Govind Kabra, Haibo Lin, Haoyan Yuan, Igor Lytvynenko, Jie Zheng, Jin Zhu, Jing Chen, Junye Wang, Kapil Gupta, Kestutis Patiejunas, Konark Gill, Lachlan Hillman, Lanlan Liu, Lu Zheng, Maggie Ma, Marios Kokkodis, Namit Gupta, Ngoc Lan Nguyen, Partha Kanuparthy, Pedro Perez de Tejada, Pratibha Udmalpet, Qiming Guo, Ram Vishnampet, Roopa Iyer, Rohit Iyer, Sam Elshamy, Sagar Chordia, Sheng Luo, Shuo Chang, Shupin Mao, Subash Sundaresan, Velavan Trichy, Weifeng Cui, Ximing Chen, Xin Zhao, Yalan Xing, Yiye Lin, Yongjun Xie, Yubin He, Yue Wang, Zewei Jiang, Santanu Kolay, Prabhakar Goyal, Neeraj Bhatia, Sandeep Pandey, Uladzimir Pashkevich, and Matt Steiner.