On 31st March 2019, a bunch of folks from Facebook released a paper talking about a state-of-the-art deep learning recommendation and personalization model (DLRM). They mentioned in the paper that deep learning for such tasks differ significantly from other deep learning networks due to their need to handle categorical features and they are not well studied or understood.
This post gives a deep dive into how this DLRM works, the institution behind it, the architecture, it’s advantages and a single jupyter notebook implementation of DLRM.
Before getting into the intuition behind it, Let’s talk about a couple of places where personalization and recommendation are currently deployed. Facebook uses recommendation engines for recommending pages you might like or groups you should join. Netflix uses a recommendation engine to suggest to you, what to watch next. Amazon uses it to suggest what else to add to the cart. There are so many internet companies that use personalization and recommendation systems to do as click-through rate (CTR) prediction. It basically means predicting if a user clicks on an ad or not. As clicking on ad translates to revenue generation, it is pivotal for the companies to choose the most probable and relevant ad to show the user for maximizing revenue.
In the paper, they picked this CTR prediction use case to benchmark their results. The dataset they picked is from a Kaggle challenge called Criteo’s Display advertising challenge. The goal of this challenge is to benchmark the most accurate ML algorithms for CTR estimation.
The Criteo AI Labs Ad Kaggle and Terabyte data sets are open-sourced data sets consisting of click logs for ad CTR prediction. Each data set contains 13 continuous and 26 categorical features. The Criteo Ad Kaggle data set contains approximately 45 million samples over 7 days. In experiments, typically the 7th day is split into validation and test set while the first 6 days are used as the training set. The Criteo Ad Terabyte data set is sampled over 24 days, where the 24th day is split into a validation and test set and the first 23 days is used as a training set.
Personalization and Recommendation systems are not new in the industry. They have had long histories. But only recently they started embracing neural networks. The authors of the paper feel that there are two primary perspectives contributed to this architectural change.
The first comes from the view of recommendation systems. These systems initially employed content filtering was a set of experts classified products into categories, while users selected their preferred categories and were matched based on their preferences. The field subsequently evolved to use collaborative filtering, where recommendations are based on past user behaviors, such as prior ratings given to products. Neighborhood methods that provide recommendations by grouping users and products together and latent factor methods that characterize users and products by certain implicit factors via matrix factorization techniques were later deployed with success.
The second view comes from predictive analytics, which relies on statistical models to classify or predict the probability of events based on the given data. Predictive models shifted from using simple models such as linear and logistic regression to models that incorporate deep networks. In order to process categorical data, these models adopted the use of embeddings, which transform the one- and multi-hot vectors into dense representations in an abstract space. This abstract space may be interpreted as the space of the latent factors found by recommendation systems.
In the paper, the authors claimed that they succeeded in unifying the two perspectives above in there DLRM model. The model uses embeddings to process sparse features that represent categorical data and a multilayer perceptron (MLP) to process dense features, then interacts these features explicitly using the statistical techniques proposed in. Finally, it finds the event probability by post-processing the interactions with another MLP.
If you are a newbie, you might wonder why all this fiasco to deal with categorical and numerical features. The answer is twofold. One, embedding are much more rich representations of data than typical on- and multi- hot vectors and can produce much prettier results and two, simple fitting a curve by taking a bunch of numerical and categorical features won’t give you accuracies. The primary reason for that would be the missing interactions among the features. Interactions are how features interact with other features. An example would be how likely a user who likes comedy and horror movies would like a horror-comedy movie. Such interactions play a major role in working of recommendation engines. So, a simple linear or logistic regression can’t be used in recommendation systems. Polynomial regression will have this interactions component but it fails to predict unseen interactions. Factorization machines (FM) incorporate second-order interactions into a linear model with categorical data by defining a model of the form
FMs are notably distinct from support vector machines (SVMs) with polynomial kernels because they factorize the second-order interaction matrix into its latent factors (or embedding vectors) as in matrix factorization, which more effectively handles sparse data. This significantly reduces the complexity of the second-order interactions by only capturing interactions between pairs of distinct embedding vectors, yielding linear computational complexity. These interactions are then fed into another Multilayer perceptron to get the click probabilities.
This section explains the high-level view of DLRM architecture. To understand it easily let’s look at the four major techniques which are involved in DLRM without much of the math.
An embedding is a mapping of a discrete — categorical — variable to a vector of continuous numbers. In the context of neural networks, embeddings are low-dimensional, learned continuous vector representations of discrete variables. Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space.
Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices.
A factorization machine is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the factorization machine model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.
A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.
Now combining the above techniques, we can easily understand the DLRM architecture.
Let the users and products be described by many continuous and categorical features. To process the categorical features, each categorical feature will be represented by an embedding vector of the same dimension, generalizing the concept of latent factors used in matrix factorization. To handle the continuous features, the continuous features will be transformed by an MLP (which we call the bottom or dense MLP) which will yield a dense representation of the same length as the embedding vectors.
Computation of second-order interaction of different features is done explicitly, following the intuition for handling sparse data provided in FMs, optionally passing them through MLPs. This is done by taking the dot product between all pairs of embedding vectors and processed dense features. These dot products are concatenated with the original processed dense features and post-processed with another MLP (the top or output MLP) and fed into a sigmoid function to give a probability.
The authors evaluated the accuracy of the model on Criteo Ad Kaggle data set and compare the performance of DLRM against a Deep and Cross network (DCN) as-is without extensive tuning. Notice that in this case the models are sized to accommodate the number of features present in the data set. In particular, DLRM consists of both a bottom MLP for processing dense features consisting of three hidden layers with 512, 256 and 64 nodes, respectively and a top MLP consisting of two hidden layers with 512 and 256 nodes. On the other hand, DCN consists of six cross layers and a deep network with 512 and 256 nodes. An embedding dimension of 16 is used. Note that this yields a DLRM and DCN both with approximately 540M parameters.
Both the training (solid) and validation (dashed) accuracies are plots over a full single epoch of training for both models with SGD and Adagrad optimizers. No regularization is used. In this experiment, DLRM obtains slightly higher training and validation accuracy.
Dataset can be found in the GitHub repo provided at the bottom. Optionally you can also get it from https://labs.criteo.com/2014/09/kaggle-contest-dataset-now-available-academic-use/
DLRM is memory intensive and computation intensive.
The schematic representation of this model was discussed in detail in this Open Compute Project Summit keynote address
Preferred hardware for the Criteo dataset would be :
CPU — more than 8 cores
GPU — recommended
Ram — 64GB (anything less won’t work for criteo dataset)
Optionally the authors created two other ways of running the training if you have hardware constraints, one with randomly generated data and other with synthetic data with implicit trends. You can choose the type as a parameter in the code. Play with data_generation: “random”/ “synthetic” to generate smaller datasets and run the training loop for experimentation.
Running on Criteo dataset takes a good amount of time and I suggest to clear up any running tasks and let it train overnight. If you are running for the first time give the train.txt file path to the “raw_data_file” variable, it will run a bunch of preprocessing tasks and generate files indicating day 1 to day 7 and as a final step, it collates all the individual preprocessed files into a single mega final preprocessed file. From the next time, you can give the path to this mega file in the “processed_data_file” variable to skip the preprocessing step for any further experimentation.
I suggest you write up some utility functions to convert categorical variables into sparse features with indices and offset representation. This will come handy for your own dataset as DLRM accepts only sparse representation as input for categorical variables.
It’s a complex implementation and not very intuitive. So, I suggest you go through the code a couple of times before getting hands-on. Follow the algorithm, go through a toy example with one data point. Trust me, it helps a lot.
The paper also talks about parallelism which deals with training the model in parallel as DLRM is computation intensive. But that is out of scope for this article and not discussed.