Taming the tail utilization of ads inference at Meta scale
Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization. The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability. Failure rates, which are mostly timeout errors, were reduced by two-thirds; the compute footprint delivered 35% more work for…