Starting at the beginning (again)
A few weeks ago, I open-sourced a personal project of mine: Xanthus. Xanthus is a Deep Learning (DL) library built on top of Tensorflow and uses the Keras API to implement various neural recommendation model architectures. It started as an exercise in translating state-of-the-art (SOTA) academic research into 'production ready' Machine Learning (ML) software. I was also interested in scientific replicability: can I re-implement a research paper from scratch and reproduce the same (or similar results)? You can read the full intro here:
As part of that post, I mentioned providing benchmarks to show the performance of the models implemented in Xanthus in comparison to other prior-SOTA approaches. That's what this post is all about. Well, that and comparing the Xanthus implementations against the results presented by He et al in their original paper.
For those of you not quite in the mood to read to the end, here's some headline stats! Based on this first round of benchmarking, the Xanthus model implementations achieve:
- A 4-7% improvement in key performance metrics over baseline models in like-for-like comparisons.
- A 5-9% improvement in key performance metrics over baseline models when using additional item metadata (an extension on the work of He et al).
- The implementation of the DL models in Xanthus replicates the general performance shown in literature.
Sounds good right? Well there is one catch, but more on that later. If you'd rather just get stuck into the code, here it is:
Now, on to the main event...
Benchmarking a recommender model
So how do you go about evaluating the performance of a recommendation model? There's a few options, but this post adopts the approach outlined in He et al's work, above. Their approach goes something like this...
Say you have
n users, each with
k interactions (e.g. each has bought
k unique items). For each of the
n users you then split their interactions into two sets: a 'training' set with
k-m transactions and a 'held out' set with
m interactions (where
k > m). By setting
m=1 you will end up with a splitting policy commonly referred to in recommendation system literature as a 'leave one out policy'. Figure 1 shows an a simple visual example of this policy applied to a single user's interactions (in this case movie views).
Next up, you take the 'held out' interactions and append to them a set of
s negative samples. In this context, negative samples refer to items a user has not interacted with (again, in this case movies a user has not watched). Figure 2 shows an example of the situation you'd then find yourself in for each user. Note that these are randomly sampled negative instances – the ordering of the movie is random.
Why is this new setup useful? Well the idea goes that if a recommender really does learn a user's preferences, it should learn to reorder the configuration shown in Figure 2 to tend to place 'held out' items higher in the list than those they have not. The ideal case of this reordered list is shown in Figure 3. This would be the recommender's ideal suggestion given the input items.
With this in mind, there are two questions you might choose to ask to evaluate the performance of your model:
- How good is the model at keeping a user's preferred item in the top
krecommended items. In other words: does it tend to select relevant items for users?
- How good is the model at keeping a user's preferred item towards the top of the top
krecommended items. In other words: can it order relevant items well (i.e. place the most preferred items nearer the top)?
To evaluate these aspects, there are two pertinent metrics you might choose to use: hit ratio and Normalized Cumulative Discounted Gain (NDCG) respectively. These are the metrics selected in He et al's work, and are the ones used here. The former provides a metric for how often an item appears in the top
k recommended items, the latter how 'high up' those items tend to appear in the top
Practically, the objective of benchmarking is going to be to evaluate these two metrics for different types of recommender models using different model configurations, and to compare these against one another. This'll give an indication of how good the recommender models are at selecting and ranking recommendations.
The benchmarking performed for this post was carried out in (so far as possible) accordance with the protocol outlined in He et al's work (see their Evaluation Protocols in section 4.1). If you'd like a more complete technical rundown of the setup, see the paper. The code used for the benchmarks is also available if you're interested – you can run it yourself if you like. Only a subset of the experiments shown in the paper are shown in this post, however. Additionally, note that for the purposes of this paper, the evaluation problem was approached as an implicit recommendation problem. This assumes that there is a binary signal for user interactions: users have either interacted with or not interacted with each item (in this case: movies), and no additional information is available (e.g. user ratings). This is again brings the benchmarks into alignment with results presented the original paper.
This post looks at evaluating five distinct types of models, two baseline models and three deep learning (DL) models implemented in Xanthus. Both baselines are built on top of
implicit. The two baselines were:
- Bayesian Personalized Ranking (BPR) - As the name suggests, this approach uses a Bayesian approach to learning a ranking function for each user over items in a dataset. The approach relies on pairwise sampling of positive and negative examples, which allows it to learn a model that tends to rank positive items (items a user has interacted with) higher than negative ones (items a user hasn't interacted with). It is a variant of an approach known as collaborative filtering, and more specifically a form of matrix factorisation (MF) model.
- Alternating Least Squares (ALS) - Variations on this approach are regarded as some of the best 'traditional' (in general: not Deep Learning-based) approaches out there. It's an approach to MF that proceeds by learning the factorised matrices. It does this by keeping a user's latent features (factorised representation of the users) fixed while updating item latent features (a factorised representation of the items). Each alternating step aims to minimise the sum of the squared residuals (objective function) in the process – hence the name. It's fast, simple and effective. Everything you want for an ML model.
The three DL models included in these benchmarks were taken from He et al's work and were:
- Generalized Matrix Factorization (GMF) - This approach is introduced in the original paper as a generalisation of the above MF methods into a neural network based representation. In effect, this allows the user to specify a model to learn factorised representations of users and items (as above), but enables them to use arbitrary loss functions and also to potentially add additional layers to enable the model to capture non-linear relationships in the dataset – a limitation of 'traditional' methods. This adds a great deal of flexibility to the already powerful 'classic' MF setup. It is structurally a form of 'two tower' model (it has two distinct 'submodels' that feed into a single 'higher level' model).
- Multilayer Perceptron (MLP) - Building on the idea introduced in the GMF approach, this approach preserves the 'two tower' structure, but replaces parts of the GMF's 'higher level' model architecture with a conventional deep feedforward network. This can be stacked as deep as you'd like, and (according to the authors) allow the model to exploit richer interactions in the target dataset.
- Neural Matrix Factorization (NMF) - This approach is an amalgam of the GMF and MLP model presented above. It is still a 'two tower' model of sorts, but in this case the two 'sub models' are a GMF model and a MLP model respectively. The output layers from these models are concatenated and fed into a 'new' output layer. From this point of view, it's a nice way of ensembling the two aforementioned approaches. The idea here is that the model will learn to exploit the strengths of the two approaches. Simple enough!
This post isn't going to cover the 'real' technical details of these approaches (both baseline and DL), but if you're interested in how these work (or SOTA recommendation models in general) and would like to see a post on this, let me know!
How about those hyperparameters...
The three DL models were optimised using the well-loved Adam optimiser. As the paper that introduced it indicates, this optimiser is usually performant with its default parameters. The optimiser therefore uses the default parameters provided by Keras. Additionally, a batch size of 256 was chosen for the DL models. Finally, the framework proposed by He et al requires negative instances to be re-sampled in every training epoch. The total number of sampled negative instances was set to 3 for each of the DL models as this was identified as (roughly) optimal in the original work. The effect of varying the negative sample size is not covered in this post.
Additionally, each of the benchmarks was run using one of a fixed number of 'predictive factors' for each model. In the non-DL models, the number of predictive factors is the dimensionality of the factorised matrix representations produced by the model in question. For the DL models, the number of predictive factors corresponds to the number of 'hidden units' in the model's output layer.
The models are compared based on the number of predictive factors as discussed in the original paper. For example, an ALS model with 16 predictive factors would be referred to as (ALS-16) and would be compared to a GMF model with the same number of factors (GMF-16). This convention is used in labelling results in the plots below, so keep it in mind!
Beyond these settings, no further hyperparameter tuning was performed on any of the chosen models. In other words: these are not finely tuned models. The results here are therefore best seen as a direct comparison of the 'out-of-the-box' performance of the models on a commonly used benchmark dataset.
Some additional settings
Benchmarks were run for four sets of predictive factors (8, 16, 32, 64, respectively) for each of the five model variants. Each of these configurations was trained for 50 epochs. The models were trained on the Movielens latest dataset (small). This is an important distinction with respect to the original paper which used a larger, older version of this dataset.
However, ad hoc testing with the larger dataset showed similar performance/behaviour as found on the smaller dataset, so for expediency the smaller set was used for this test (total elapsed benchmarking time exceeded 5 hours, so total elapsed time for the larger set with the same setup would have exceeded 48 hours). A future benchmark will be run on the larger set after some optimisation upgrades for Xanthus are completed. Finally, all tests were performed on a basic Google Cloud Platform
e2-standard-4 VM instance with 4 vCPUs and 16GB of RAM.
Time for the some plots! First up, let's take a look at how the models do over different
k values. This'll give you a good idea of the relative performance of the models when they produce different numbers of recommendations. Figure 4 shows the NDCG scores for the five models (in this case each with 64 factors) for varying
How about the hit ratio (HR)? Figure 5 shows the performance of the HR scores for the same five models for the same values of varying
To avoid cluttering the plots, only the
n=64 data is shown, though similar patterns appear over each configuration tested. Note that in both figures, the Xanthus implementations consistently outperform the baselines by between 4 and 7% over all configurations. This is consistent with the reported findings in He et al. Additionally, the behaviour and relative improvement of the models based on their configurations matches the behaviours shown in the original work.
... the Xanthus implementations consistently outperform the baselines by between 4 and 7% over all configurations over both metrics.
You may also notice that in this case the GMF model is performing pretty much as well as the NMF model. This is an interesting tidbit that emerged in the benchmarks: for larger numbers of predictive factors (64+), the GMF tended to perform similarly to NMF. This wasn't expected, but does make sense if the GMF component of the 'two tower' architecture is the 'dominant' component in the setup. As with other ensembling techniques, the outputs from this model will be naturally favoured by the 'higher level' model. Why this would be so is a different question, and one to be looked into.
There was one discrepancy though: all scores for both baseline and Xanthus models were consistently ~10% higher than than their reported counterparts in the original paper. There are a few reasons this might occur, including changes in the dataset (the evaluated dataset is much more recent, is smaller and includes different users and behaviours) and subtle differences in calculating the metrics. After careful review and testing, the former explanation is probably more likely, though a deeper dive with the original dataset is needed to check it out.
How about a top-level overview of the results? The table below shows the benchmarking scores for the top-performing baseline (ALS) and top-performing DL model over all runs (NMF). As you can see, NMF performs the best over all the tested configurations, and the difference in performance appears to grow with the number of factors. From these tests (for
n>=32) the improvement of the DL models over the baselines is ~5% for NDCG, and ~7% for HR.
Interestingly, ALS begins to show signs of overfitting later in its training epochs for
n>=64, whereas this doesn't appear to be so with the DL approaches. In particular, the DL models appear to continue to improve as both the number of epochs and the number of factors increases. A further trial with
n=128 indicated that the baseline models did indeed begin to strongly overfit, while the DL models continued to improve on the evaluation set (reaching a new 'high score' of
hr=0.86 for NMF). This apparently improved capacity for generalisation is an interesting feature of these models in its own right.
Comparison with metadata
Here's where the benchmarks go beyond the original paper from He et al. In the paper, the team indicated the potential to 'easily extend' their proposed framework to include metadata during model training and inference, though they did not perform this step in their work. However, their framework does indeed make it straightforward to include metadata (e.g. item categories, user demographic information) in the model. This capability is supported in Xanthus 'out-of-the-box'.
To evaluate this capability, an additional set of benchmarks were run. These benchmarks added item metadata into the model (in this case: movie genre information). Figure 6 shows the outcome of these benchmarks for
n=64 (for consistency with the previous plots!). Similar results were observed over all other tested configurations.
As you can see, the NMF model using metadata outperforms the 'vanilla' NMF model, and therefore continues to outperform the baseline ALS model, too. This indicates roughly a ~2% improvement over the NMF model, and a >6% improvement over the baseline in this metric. That's surprisingly good.
Comparison of elapsed training time
So far so good! There is the sting in the tail though. As they say: there's no free lunch. The training time for the Xanthus models is much longer than the baseline models. Figure 7 shows the elapsed training time over all of the tested model variants for different numbers of predictive factors.
Clearly, the Xanthus models are much 'slower' than their baseline counterparts (typically more than 20x slower). This isn't completely unsurprising, however. The baseline implementations are built on top of a well-tuned single-purpose library, whereas the Xanthus implementation is built on top of a well-tuned general-purpose library. Single-purpose implementations usually wins out in these situations – you can do a lot of tailored optimisation after all!
While this is a source of distinction between the baselines and the Xanthus implementations, it's not the main source, however. The main source appears to be the pointwise negative sampling approach outlined in He et al's paper. This is expensive. For each epoch, the approach randomly draws negative samples from the training interactions to augment the training inputs. This involves a fair few filtering and sorting operations that add to the complexity of the algorithm (and thereby increase running time) The implementation in Xanthus is solid and correct, but it can definitely be accelerated. That's a target for a future release!
So that's about it for this post. As you've seen, Xanthus outperforms the baseline models pretty comfortably on the chosen benchmarks, and expands on the original work to increase this lead a bit more too. The slow training time is a problem, however, especially for large datasets, though even for larger datasets (10 million instances+) the training time for a model can still be measured in a few hours on a small-ish machine, so this isn't likely to be too much of an issue in practice.
Xanthus outperforms the baseline models pretty comfortably, and expands on the original work to increase this lead a bit more too.
With that said, there's still a fairly long list of outstanding tasks to look into. Here's an abbreviated version:
- Why does the performance of the NMF and GMF models converge for larger factors? It seems there could be a sensible reason for this, but it'd be good to take a deeper dive.
- In the original paper, He et al indicate that a form of pre-training can boost the performance of NMF models. This might resolve the above issue, so it'll be worth a look.
- The effect of the number of negative samples during training is explored in the original paper. This is definitely another area to look into though!
- The negative sampling code is a bit of a performance bottleneck. Improving performance on this code would allow the approach to scale more gracefully.
- These benchmarks ran on smaller datasets than the original research was performed on. When performance is improved, the tests will be re-run with larger datasets. And more datasets, too.
- The three DL model varieties introduced here are only the tip of the iceberg of the variants in use in academia and industry. Other varieties (inc. VAE-based models) are coming. Soon. Probably. Hopefully.
If you'd like to help out with any of this, let me know!
In the meantime, thanks for reading. If you'd like to sign-up to receive posts from this blog in a newsletter, hit 'Subscribe'.