Deep neural network models Stay organized with collections Save and categorize content based on your preferences.
Page Summary
Deep Neural Networks (DNNs) for recommendation address limitations of matrix factorization by incorporating side features and improving relevance.
Softmax DNN treats recommendation as a multiclass prediction problem, predicting the probability of user interaction with each item.
DNNs learn embeddings for both queries and items, using a nonlinear function to map features to embeddings.
Two-tower neural networks further enhance DNN models by using separate networks to learn embeddings for queries and items based on their features, enabling the use of item features for improved recommendations.
The previous section showed you how to use matrix factorization tolearn embeddings. Some limitations of matrix factorization include:
- The difficulty of using side features (that is, any features beyondthe query ID/item ID). As a result, the model can only be queried witha user or item present in the training set.
- Relevance of recommendations. Popular items tend to be recommended foreveryone, especially when usingdot product as a similarity measure. It is better to capture specificuser interests.
Deep neural network (DNN) models can address these limitations of matrixfactorization. DNNs can easily incorporate query features and item features(due to the flexibility of the input layer of the network), which can helpcapture the specific interests of a user and improve the relevance ofrecommendations.
Softmax DNN for recommendation
One possible DNN model issoftmax,which treats the problem as a multiclass prediction problem in which:
- The input is the user query.
- The output is a probability vector with size equal to the number ofitems in the corpus, representing the probability to interact witheach item; for example, the probability to click on or watch aYouTube video.
Input
The input to a DNN can include:
- dense features (for example, watch time and time since last watch)
- sparse features (for example, watch history and country)
Unlike the matrix factorization approach, you can add side features such asage or country. We'll denote the input vector by x.
Model architecture
The model architecture determines the complexity and expressivity of the model.By adding hidden layers and non-linear activation functions (for example, ReLU),the model can capture more complex relationships in the data. However,increasing the number of parameters also typically makes the model harder totrain and more expensive to serve. We will denote the output of the last hiddenlayer by \(\psi (x) \in \mathbb R^d\).
Softmax Output: Predicted Probability Distribution
The model maps the output of the last layer, \(\psi (x)\), through a softmaxlayer to a probability distribution \(\hat p = h(\psi(x) V^T)\), where:
- \(h : \mathbb R^n \to \mathbb R^n\) is the softmax function,given by \(h(y)_i=\frac{e^{y_i}}{\sum_j e^{y_j}}\)
- \(V \in \mathbb R^{n \times d}\) is the matrix of weights of thesoftmax layer.
The softmax layer maps a vector of scores \(y \in \mathbb R^n\)(sometimes called thelogits)to a probability distribution.
The name softmax is a play on words. A "hard" max assigns probability 1 to the item with the largest score \(y_i\). By contrast, the softmax assigns a non-zero probability to all items, giving a higher probability to items that have higher scores. When the scores are scaled, the softmax \(h(\alpha y)\) converges to a "hard" max in the limit \(\alpha \to \infty\).
Loss Function
Finally, define a loss function that compares the following:
- \(\hat p\), the output of the softmax layer (a probability distribution)
- \(p\), the ground truth, representing the items the user hasinteracted with (for example, YouTube videos the user clicked or watched).This can be represented as a normalized multi-hot distribution (aprobability vector).
For example, you can use the cross-entropy loss since you are comparingtwo probability distributions.
Softmax Embeddings
The probability of item \(j\) is given by\(\hat p_j = \frac{\exp(\langle \psi(x), V_j\rangle)}{Z}\),where \(Z\) is a normalization constant that does not depend on \(j\).
In other words, \(\log(\hat p_j) = \langle \psi(x), V_j\rangle - log(Z)\),so the log probability of an item \(j\) is (up to an additive constant)the dot product of two \(d\)-dimensional vectors, which can be interpretedas query and item embeddings:
- \(\psi(x) \in \mathbb R^d\) is the output of the last hidden layer. We call it the embedding of the query \(x\).
- \(V_j \in \mathbb R^d\) is the vector of weights connecting the last hidden layer to output j. We call it the embedding of item \(j\).
DNN and Matrix Factorization
In both the softmax model and the matrix factorization model,the system learns one embedding vector\(V_j\) per item \(j\). What we called theitem embedding matrix \(V \in \mathbb R^{n \times d}\) in matrixfactorization is now the matrix of weights of the softmax layer.
The query embeddings, however, are different. Instead of learningone embedding \(U_i\) per query \(i\), the system learns a mappingfrom the query feature \(x\) to an embedding \(\psi(x) \in \mathbb R^d\).Therefore, you can think of this DNN model as a generalization of matrixfactorization, in which you replace the query side by a nonlinearfunction \(\psi(\cdot)\).
Can You Use Item Features?
Can you apply the same idea to the item side? That is, instead of learningone embedding per item, can the model learn a nonlinear function that mapsitem features to an embedding? Yes. To do so, use a two-towerneural network, which consists of two neural networks:
- One neural network maps query features\(x_{\text{query}}\) to query embedding\(\psi(x_{\text{query}}) \in \mathbb R^d\)
- One neural network maps item features\(x_{\text{item}}\) to item embedding\(\phi(x_{\text{item}}) \in \mathbb R^d\)
The output of the model can be defined as the dot product of\(\langle \psi(x_{\text{query}}), \phi(x_{\text{item}}) \rangle\).Note that this is not a softmax model anymore. The new model predictsone value per pair \((x_{\text{query}}, x_{\text{item}})\)instead of a probability vector for each query \(x_{\text{query}}\).
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-25 UTC.