Dynamic Length Factorization Machines for CTR Prediction

Dec 15, 2021

Ad click-though rate prediction (pCTR) is one of the core tasks of online advertising. Driving the pCTR models of Yahoo Gemini native advertising is OffSet - a feature enhanced collaborative-filtering based event prediction algorithm. Due to data sparsity issues \offset\ models both users and items by mapping their features into a latent space, where the resulting user vector is a non-linear function of the user feature vectors (e.g., age, gender, hour, etc.) which allows pairwise dependencies.
This pairwise dependencies concept is also used by other algorithms such as the Field-aware Factorization Machines (FFM). However, both in OffSet and in FFM, the different pairwise interactions are modeled by latent vectors of constant and equal lengths.
When prediction models are used online for serving real traffic, where the total serving model size is often limited, a non uniform representation of the pairwise interactions should be considered in order to maximize the accuracy of the model while consuming the same or even less space.
In this work we present a Dynamic Length Factorization Machines (DLFM) algorithm that dynamically optimizes the length of the vectors for each feature interaction during training, while not exceeding a maximal overall latent vector size.
After showing good online performance of 1.46% revenue lift and a 2.15% CTR lift, serving Gemini native traffic, the DLFM was pushed into production.
Since integrated into production, the DLFM has not only improved the accuracy of the model by optimizing the length of each latent space, but has also reduced the total size of the model by 25%.
Although the algorithm was applied to \offset, we show that DLFM can be applied to any FFM-like algorithm to optimize its pairwise feature vector lengths.
We also present an Educated Model Initialization - a novel mechanism for initializing a new model based on an existing model that has some mutual user features. Using this mechanism, we managed to reduce the training time of our models by more than 90% when compared to an equivalent model that is trained from "scratch''.

  • IEEE International Conference On Big Data (BigData 2021)
  • Conference/Workshop Paper