Embeddings: Obtaining embeddings

  • Embeddings can be created using dimensionality reduction techniques like PCA or by training them as part of a neural network.

  • Training an embedding within a neural network allows customization for specific tasks, where the embedding layer learns optimal weights to represent data in a lower-dimensional space.

  • Word embeddings, like word2vec, leverage the distributional hypothesis to map semantically similar words to geometrically close vectors.

  • Static word embeddings have limitations as they assign a single representation per word, while contextual embeddings offer multiple representations based on context.

This section covers several means of obtaining embeddings, as well as howto transform static embeddings into contextual embeddings.

Dimensionality reduction techniques

There are many mathematical techniques that capture the importantstructures of a high-dimensional space in a low-dimensional space. In theory,any of these techniques can be used to create an embedding for a machinelearning system.

For example,principal component analysis (PCA)has been used to create word embeddings. Given a set of instances likebag of words vectors, PCA triesto find highly correlated dimensions that can be collapsed into a singledimension.

Training an embedding as part of a neural network

You can create an embedding while training aneural network foryour target task. This approach gets you an embedding well customized for yourparticular system, but may take longer than training the embedding separately.

In general, you can create a hidden layer of sized in your neuralnetwork that is designated as theembedding layer, wheredrepresents both the number of nodes in the hidden layer and the numberof dimensions in the embedding space. This embedding layer can be combined withany other features and hidden layers. As in any deep neural network, theparameters will be optimized during training to minimize loss on the nodes inthe network's output layer.

Returning to ourfood recommendation example, our goal isto predict new meals a user will like based on their current favoritemeals. First, we can compile additional data on our users' top five favoritefoods. Then, we can model this task as a supervised learning problem. We setfour of these top five foods to be feature data, and then randomly set aside thefifth food as the positive label that our model aims to predict, optimizing themodel's predictions using asoftmaxloss.

During training, the neural network model will learn the optimal weights forthe nodes in the first hidden layer, which serves as the embedding layer.For example, if the model contains three nodes in the first hidden layer,it might determine that the three most relevant dimensions of food items aresandwichness, dessertness, and liquidness. Figure 12 shows the one-hot encodedinput value for "hot dog" transformed into a three-dimensional vector.

Figure 12. Neural net for one-hot encoding of hot dog. The first layer is an    input layer with 5 nodes, each annotated with an icon of the food it    represents (borscht, hot dog, salad, ..., and shawarma). These nodes have    the values [0, 1, 0, ..., 0], respectively, representing the one-hot    encoding of 'hot dog'. The input layer is connected to a 3-node embedding    layer, whose nodes have the values 2.98, -0.75, and 0, respectively. The    embedding layer is connected to a 5-node hidden layer, which is then    connected to a 5-node output layer.
Figure 12. A one-hot encoding ofhot dog provided as input to a deep neural network. An embedding layer translates the one-hot encoding into the three-dimensional embedding vector[2.98, -0.75, 0].

In the course of training, the weights of the embedding layer will be optimizedso that theembedding vectorsfor similar examples are closer to each other. As previously mentioned, thedimensions that an actual model chooses for its embeddings are unlikely to beas intuitive or understandable as in this example.

Contextual embeddings

One limitation ofword2vec static embedding vectors is that words can meandifferent things in different contexts. "Yeah" means one thing on its own,but the opposite in the phrase "Yeah, right." "Post" can mean "mail,""to put in the mail," "earring backing," "marker at the end of a horse race,""postproduction," "pillar," "to put up a notice," "to station a guard orsoldier," or "after," among other possibilities.

However, with static embeddings, each word is represented by a single pointin vector space, even though it may have a variety of meanings.In thelast exercise,you discovered the limitations of static embeddings for the wordorange, which can signify either a color or a type of fruit. With only onestatic embedding,orange will always be closer to other colors than tojuice when trained on theword2vec dataset.

Contextual embeddings were developed to address this limitation.Contextual embeddings allow a word to be represented by multiple embeddingsthat incorporate information about the surrounding words as well as theword itself.Orange would have a different embedding for every uniquesentence containing the word in the dataset.

Some methods for creating contextual embeddings, likeELMo, take the staticembedding of an example, such as theword2vec vector for a word in a sentence,and transform it by a function that incorporates information about the wordsaround it. This produces a contextual embedding.

Click here for details on contextual embeddings

  • For ELMo models specifically, the static embedding is aggregated with embeddings taken from other layers, which encode front-to-back and back-to-front readings of the sentence.
  • BERT models mask part of the sequence that the model takes as input.
  • Transformer models use aself-attention layer to weight the relevance of the other words in a sequence to each individual word. They also add the relevant column from apositional embedding matrix (seepositional encoding) to each previously learned token embedding, element by element, to produce the input embedding that is fed into the rest of the model for inference. Thisinput embedding, unique to each distinct textual sequence, is a contextual embedding.

NOTE: See theLLM modulefor more details on transformers and encoder-decoder architecture.

While the models described above are language models, contextual embeddingsare useful in other generative tasks, like images. An embedding of the pixelRGB values in a photo of a horse provides more information to the modelwhen combined with a positional matrix representing each pixel and someencoding of the neighboring pixels, creating contextual embeddings, than theoriginal static embeddings of the RGB values alone.

Key terms:

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.