Next Module →
In the rapidly evolving field of Natural Language Processing (NLP), Sentence Transformers have emerged as a powerful tool for encoding sentences into high-dimensional vectors (also known asembeddings), which can then be used for various tasks such assemantic search, clustering, and sentence similarity. Sentence transformers are a type of deep learning model specifically designed to capture the semantic meaning of sentences, going beyond the capabilities of traditional word embeddings.
Traditional word embeddings like Word2Vec or GloVe focus on representing individual words in a continuous vector space, capturing semantic relationships between words. However, these models often fall short when dealing with longer texts or sentences, as they do not consider the context in which words appear. Sentence transformers address this limitation by encoding entire sentences or text fragments into fixed-size vectors, preserving this contextual meaning. This approach has significantly enhanced the performance of NLP applications, enabling more accurate and meaningful text analysis.
In this article, we’ll explore the fundamentals of Sentence Transformers and provide hands-on examples utilising Hugging Face’s Python library,sentence-transformers.
Before diving in, if you need help, guidance, or want to ask questions, join ourCommunity and a member of the Marqo team will be there to help.
Before we jump into the fundamentals of Sentence Transformers, it’s important to understand the advancements in machine learning that led to the creation of sentence transformers.
Previously, translating text from one language to another heavily relied on encoder-decoder architectures based on Recurrent Neural Networks (RNNs) [1] and Long Short-Term Memory Networks (LSTMs) [2]. These architectures had two main phases:
You might be wondering what the meaning of the “hidden state” is. In very simple terms, imagine you are reading a paragraph to understand it (encoding phase). As you read each sentence, you form an idea in your mind (hidden states at each step). After finishing the paragraph, you have a clear understanding of the overall meaning (final hidden state). Now, you want to explain this paragraph to someone else (decoding phase). You start explaining using your clear understanding (initial hidden state), and you continue to elaborate sentence by sentence until you've conveyed the entire meaning. So, "hidden states" are intermediate representations of the input data at each step of the process within the neural network.
The process of this architecture can be seen below for English to Spanish translation:
The primary issue with this approach is the reliance on a single context vector to represent the entire input sentence. If the encoder's summary is inadequate, the quality of the translation deteriorates. This is especially true for longer sentences because of the long-range dependency problem; the difficulty that models like RNNs and LSTMs face in capturing and retaining information about relationships between distant elements in a sequence. The encoder-decoder architecture with one context vector shared between two models can act as aninformation bottleneckas all information is passed through this point.
To address this, in 2015, Bahdanau et al. [3] proposed an innovative approach: rather than relying on a single context vector, their model incorporates all the input words, assigning varying levels of importance to each. This approach, known as theattention mechanism, allows the model to focus on the most relevant parts of the input sentence when generating the output.
The attention mechanism is inspired by human visual processing system. When you read a book, the majority of what is in your field of vision is actually disregarded; you pay more attention to the word you are currently reading. This allows your brain to focus on what matters most while ignoring everything else.
In order to imitate the same effects in deep learning models, we assign an attention weight to each of our inputs. These weights represent the relative importance of each input element to the other input elements. This way we guide our model to pay greater attention to particular inputs that are more critical to performing the task at hand.
So, with the attention mechanism, our original architecture becomes:
This architecture evaluates a set of positions in the encoder’s hidden states to identify where the most relevant information resides. By doing so, it creates a context vector that dynamically adjusts to include significant details, enhancing the model’s ability to handle long sentences and complex dependencies. This elegant solution has since become a cornerstone in the development of advanced Natural Language Processing models and beyond.
The attention mechanism began to influence further ideas with the infamous paperAttention is All You Needpublished in 2017 [4]. The authors introduced thetransformer model, which eliminated the need for RNNs by using the attention mechanismalone, leading to superior performance and generalization capabilities. This shift revolutionized the NLP ecosystem, moving it away from RNN-based models towards transformers.
The original transformer architecture worked well for sequence to sequence problems but for specific natural language problems like question and answering, and text summarisation, it needed improvement. There were two main concerns:
To address this issue, BERT was introduced.
One of the most famous pre-trained models isBERT, Bidirectional Encoder Representations from Transformers, by Google AI [5]. BERT was trained on theBooksCorpus, which has over 800 million words, andEnglish Wikipedia, which has 2.5 billion words. This extensive and diverse dataset enabled BERT to achieve state-of-the-art performance across a variety of NLP tasks.
BERT was built with the ideology that different NLP problems all rely on the same fundamental understanding of language. BERT models can be built in two phases. The first is the pre-training phase where we train the model to understand the language. The second is the fine-tuning phase where we further train the model on a specific task. This addresses the data concern of transformers; if we already have a pre-trained model that understands language then we only need data to fine-tune the model for our specific case.
Previously, RNNs were mainly built for very specific use-cases and would change depending on that. This is the beauty of transformers; they can begeneralized. This means it’s possible to use the same ‘core’ of a model and change the final layers for different use cases. This feature of transformers sparked a whole new type of models in NLP:pretrained models.
Pre-trained transformer models are trained on enormous amounts of training data. During training, the model learns patterns, structures and relationships within the data. Overall, allowing it to make accurate predictions and generate meaningful responses when presented with new, unseen data. Typically, this pre-training is done by companies like Google and OpenAI as training on such vast datasets is expensive. These models are then available to the public to use for free — super helpful!
Imagine we take a question and answering platform and we want to determine questions that are similar to a given input question. How would we do this with BERT?
Well, we’d take an input question “What is gravity?” and another question on the platform “How do aeroplanes fly?” and pass them into BERT. BERT would then generate word vectors, that are then passed into some feed-forward layer that gives an individual output known as thesimilarity score. The higher the score, the more similar the two questions.
Of course, we’d have to pass in every question on the platform to determine the similarity. Once complete, you’d select the questions that had the highest similarity as the most relevant/similar questions to the input question.
There’s a big problem here. Imagine we have 10 million questions on the question-answering platform. Then, whenever a new question came in, we’d need to run the forward pass of BERT 10 million times. This is not viable.
The most obvious solution here would be to pass the input question into BERT to get a single vector that represents that question. Then, compare it against all other questions using something like a cosine similarity metric as explained in thisarticle. The final step would be to return the nearest neighbours as the most related questions. This set up would require us to use the BERT model once and not 10 million times.
The issue with BERT is that it only gives us word vectors so, if you want a sentence vector, you’d need to somehow aggregate the word vectors into a single vector. The most straightforward way to do this would be to take the average of the word vectors. This technique is calledmean pooling. This system as outlined in the figure below is the simplest form of sentence transformer.
Okay great! Well…not so great. The output sentence produced here is actually really poor quality. So poor that you might be better to take the average of GloVe embeddings and not even use BERT! Let’s look at how we can fix that.
The solution came in 2019 with Nils Reimers and Iryna Gurevych's SBERT (Sentence-BERT) [6] and thesentence-transformers library [7]. SBERT generated high-quality sentence embeddings, drastically reducing search times and outperforming previous models in semantic textual similarity tasks. Unlike BERT, SBERT allowed for efficient storage and comparison of sentence embeddings, making it highly scalable.
So, how does SBERT work?
SBERT is fine-tuned on sentence pairs using asiamesearchitecture. This means we have two of the exact same BERT sentence transformer networks connected. This can be seen in the Figure below where two separate sentences are passed through the same architecture. In the original SBERT paper, they tested three different pooling methods (mean, max and [CLS]) and mean pooling performed best.
This siamese architecture produces sentence embeddings. Let’s call these embeddingsuandvrespectively. These embeddings are then combined to form a single vector embedding that represents the relationship between the two sentences. Different concatenation methods were tested but the best performing was (u,v, |u-v|) as illustrated below.
To train the sentence transformer, this combined vector is fed into a neural network, typically a feed-forward neural network (FFNN), which can be trained with various objectives depending on the task. Let’s take a look at an example: Natural Language Inference (NLI). NLI works by taking two sentences and determining whether sentence oneentails orcontradicts sentence two, orneither. This allows BERT to understand sentence meanings as a whole and is therefore chosen for training.
An example of an NLI sentence pair can be seen below for the sentences “She purchased a new laptop from the store” and “She works in technology”. This example is considered neutral because the fact that she has purchased a laptop does not necessarily imply that she works in technology.
We then minimise a loss function which trains the network. Let’s break down these steps for training sentence transformers on the NLI dataset:
The overall architecture for our use case therefore looks like the Figure below.
More generally, this architecture can be used for training on different datasets. The same idea holds: take an input that consists of pairs of sentences (or individual sentences) to be embedded. Each sentence is passed through a BERT encoder. A pooling layer is then applied to the output. For sentence pair tasks, a similarity function is used to compare the embeddings of the two sentences. Then the similarity scores are fed into a loss function which trains the sentence transformer.
Since SBERT, various sentence transformer models have been developed and optimized using loss functions to produce accurate sentence embeddings. These models are trained on diverse sentence pairs to ensure robustness in capturing sentence similarities.
Awesome! Now that you have the background about sentence transformers, let's explore programming them with thesentence-transformers library!
Now we’ve covered the basics of sentence transformers we can use Hugging Face’s librarysentence-transformers to run some examples. This library was created by the creators of SBERT and it’s free and easy to use!
We’ll be using Google Colab for this tutorial. If you are new to Google Colab, you can followthis guide on getting set up - it’s super easy! For this module, you can find the notebook on Google Colabhere or on GitHubhere. As always, if you face any issues, join ourCommunity and a member of our team will help!
First we install it with pip:
!pip install sentence-transformers
Next, we create our Sentence Transformer model. We will use the original SBERT model bert-base-nli-mean-tokens:
from sentence_transformers import SentenceTransformermodel = SentenceTransformer('bert-base-nli-mean-tokens')model
This produces the following:
SentenceTransformer( (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}))
The output is theSentenceTransformer object which is comprised of some key components.
We can now create some sentences to be turned into sentence embeddings. We’ll take some examples fromMarqo’s website.
Let’s create a list of these sentences in Python.
marqo_sentences = [ "With Marqo you can use your data to increase relevance with embedding search", "Join the conversation on our Marqo Community Channel", "AI search that understands the way your customers think", "How RedBubble increased revenue with Marqo", "With Marqo you can use your data to increase downloads with embedding search"]embeddings = model.encode(marqo_sentences)
As we’ve seen in theSentenceTransformer object above, we expect these sentence embeddings to be 768-dimensional. Let’s see if they are:
embeddings.shape
This returns the output:
(5, 768)
So, yes! We have 5 sentences that are all 768-dimensional. Awesome!
Let’s usecosine similarity to compute how similar each sentence is:
import numpy as npfrom sentence_transformers.util import cos_sim# Assuming `embeddings` is a numpy array or a list of numpy arraysembeddings = np.array(embeddings)# Compute the cosine similarity matrixsim = cos_sim(embeddings, embeddings).numpy()print(sim)
This will produce a 5x5 matrix similar to the image below.
The value in each entry corresponds to the similarity of the row and column number. So for example, the first entry (top-left) would be the similarity between sentence 1 and sentence 1. In this case, we expect the cosine value to be 1 because both sentences are identical. The top-right entry would be the similarity between sentence 1 and sentence 5.
This produces a 5x5 matrix as seen below:
[[0.9999999 0.6356604 0.66937524 0.4040454 0.9407738 ] [0.6356604 0.9999999 0.6081568 0.21924865 0.58871996] [0.66937524 0.6081568 1. 0.2225984 0.59990084] [0.4040454 0.21924865 0.2225984 0.99999976 0.5078909 ] [0.9407738 0.58871996 0.59990084 0.5078909 1.0000001 ]]
We can see that sentence 1 and sentence 5 are similar as the (5, 1) entry (bottom-left) is0.9407738. Take a glance yourself at the different values and come to a conclusion over which sentences you think are similar. It’s worth noting that the entries on the diagonal are roughly equal to 1. This is because these are the points at which a sentence is compared against itself—these are obviously identical and so we get maximum similarity.
In this example we used SBERT but there are so many other sentence transformer models available. After its release, newer and faster models were released and as a result, significantly outperformed the original SBERT. Interestingly, SBERT is actually no longer listed as an available model on SBERT.net’s models page!
Here are some of the best SBERT models currently available:
Feel free to change your code to use these models and see what results you get!
Some of the advantages of using these models over SBERT include:
Pretty cool!
In this article, we’ve explored the background behind sentence transformers and started coding with Hugging Face’s Python library,sentence-transformers. In the next article, we’ll explore some of the newer models in more detail and explain how you can train and fine-tune your own sentence transformers!
[1] D. Rumelhart, et al.Learning Internal Representations by Error Propagation (1986)
[2] S. Hochreiter and J. Schmidhuber.Long Short-Term Memory (1997)
[3] Bahdanau, D., Cho, K., & Bengio, Y.Neural Machine Translation by Jointly Learning to Align and Translate (2015)
[4] A. Vaswani, et al.Attention is All You Need (2017)
[5] J. Devlin, et al.BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
[6] N. Reimers and I. Gurevych.Sentence Embeddings using Siamese BERT-Networks (2019)
[7]Sentence Transformer Library, Hugging Face
Get updated when they’re published