Softmax training

  • This page focuses on the training data and process for a softmax deep neural network recommendation system.

  • Negative sampling is crucial to avoid "folding," where embeddings from different categories are incorrectly grouped together.

  • Negative sampling involves training the model on both positive (relevant) and negative (irrelevant) examples.

  • Compared to Matrix Factorization, softmax DNNs are more flexible but computationally expensive and susceptible to folding.

  • While Matrix Factorization is better for large-scale applications, DNNs excel at capturing personalized preferences for recommendation tasks.

The previous page explained how to incorporate a softmax layer into a deepneural network for a recommendation system. This page takes a closer look at thetraining data for this system.

Training data

The softmax training data consists of the query features \(x\) anda vector of items the user interacted with (represented as aprobability distribution \(p\)). These are marked in blue inthe following figure. The variables of the model are the weightsin the different layers. These are marked as orange in the followingfigure. The model is typically trained using any variant ofstochastic gradient descent.

Image highlighting training of a softmax deep neural network

Negative sampling

Since the loss function compares two probability vectors\(p, \hat p(x) \in \mathbb R^n\) (the ground truth andthe output of the model, respectively), computing thegradient of the loss (for a single query \(x\)) can beprohibitively expensive if the corpus size \(n\) is too big.

You could set up a system to compute gradients only on the positive items(items that are active in the ground truth vector). However, if the systemonly trains on positive pairs, the model may suffer from folding, asexplained below.

Folding
Image of a plane that has been folded in half showing 3 different groups of squares representing queries, and circles representing items. Each group has a different color, and queries only interact with items from the same group. In the following figure, assume that each color represents a different category of queries and items. Each query (represented as a square) only mostly interacts with the items (represented as a circle) of the same color. For example, consider each category to be a different language in YouTube. A typical user will mostly interact with videos of one given language.

The model may learn how to place the query/item embeddings of a given color relative to each other (correctly capturing similarity within that color), but embeddings from different colors may end up in the same region of the embedding space, by chance. This phenomenon, known asfolding, can lead to spurious recommendations: at query time, the model may incorrectly predict a high score for an item from a different group.

Negative examples are items labeled "irrelevant" to a given query. Showing the model negative examples during training teaches the model that embeddings of different groups should be pushed away from each other.

Instead of using all items to compute the gradient (which can be tooexpensive) or using only positive items (which makes the model prone tofolding), you can use negative sampling. More precisely, you compute anapproximate gradient, using the following items:

  • All positive items (the ones that appear in the target label)
  • A sample of negative items (\(j\) in \({1, …, n}\))

There are different strategies for sampling negatives:

  • You can sample uniformly.
  • You can give higher probability to items j with higherscore \(\psi(x) . V_j\). Intuitively, these are examplesthat contribute the most to the gradient); these examples are oftencalled hard negatives.
Extra resources:

On matrix factorization versus softmax

DNN models solve many limitations of Matrix Factorization, but are typicallymore expensive to train and query. The table below summarizes some of theimportant differences between the two models.

Matrix FactorizationSoftmax DNN
Query featuresNot easy to include.Can be included.
Cold startDoes not easily handle out-of vocab queries or items. Some heuristics can be used (for example, for a new query, average embeddings of similar queries).Easily handles new queries.
FoldingFolding can be easily reduced by adjusting the unobserved weight in WALS.Prone to folding. Need to use techniques such as negative sampling or gravity.
Training scalabilityEasily scalable to very large corpora (perhaps hundreds of millions items or more), but only if the input matrix is sparse.Harder to scale to very large corpora. Some techniques can be used, such as hashing, negative sampling, etc.
Serving scalabilityEmbeddings U, V are static, and a set of candidates can be pre-computed and stored.Item embeddings V are static and can be stored.

The query embedding usually needs to be computed at query time, making the model more expensive to serve.

In summary:

  • Matrix factorization is usually the better choice for large corpora.It is easier to scale, cheaper to query, and less prone to folding.
  • DNN models can better capture personalized preferences, but areharder to train and more expensive to query. DNN models are preferableto matrix factorization for scoring because DNN models can usemore features to better capture relevance. Also, it is usuallyacceptable for DNN models to fold, since you mostly care aboutranking a pre-filtered set of candidates assumed to be relevant.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.