Embedding

After partitioning, chunking, and summarizing, theembedding step creates arrays of numbersknown asvectors, representing the text that is extracted by Unstructured.These vectors are stored orembedded next to the text itself. These vector embeddings are generated by anembedding model that is provided byanembedding provider.You typically save these embeddings in avector store.When a user queries a retrieval-augmented generation (RAG) application, the application can use a vector database to performasimilarity search in that vector storeand then return the items whose embeddings are the closest to that user’s query.Here is an example of a document element generated by Unstructured, along with its vector embeddings generated bythe embedding modelsentence-transformers/all-MiniLM-L6-v2on Hugging Face:

{    "type":"Title",    "element_id":"fdbf5369-4485-453b-9701-1bb42c83b00b",    "text":"THE CONSTITUTION of the United States",    "metadata": {        "filetype":"application/pdf",        "languages": [            "eng"        ],        "page_number":1,        "filename":"constitution.pdf",        "data_source": {            "record_locator": {                "path":"/input/constitution.pdf"            },            "date_created":"1723069423.0536132",            "date_modified":"1723069423.055078",            "date_processed":"1725666244.571788",            "permissions_data": [                {                    "mode":33188                }            ]        }    },    "embeddings": [        -0.06138836592435837,        0.08634615689516068,        -0.019471267238259315,        "<full-results-omitted-for-brevity>",        0.0895417109131813,        0.05604064092040062,        0.01376157347112894    ]}

Learn more.

Generate embeddings

To generate embeddings, choose one of the available embedding providers and models in theSelect Embedding Model section of anEmbedder node in a workflow.When choosing an embedding model, be sure to pay attention to the number of dimensions listed next to each model. This number must match the number of dimensions in theembeddings field of your destination connector’s table, collection, or index.

You can change a workflow’s preconfigured provider only throughCustom workflow settings.

Chunk sizing and embedding models

If your workflow has anEmbedder node, your workflow’sChunker node settings must stay within the selected embedding model’s token limits.Exceeding these limits will cause workflow failures.Set yourChunker node’sMax Characters to a value at or below Unstructured’s recommended maximum chunk size for your selected embedding model,as listed in the following table’s last column.

The following list applies only to UnstructuredLet’s Go andPay-As-You-Go accounts.For UnstructuredBusiness accounts, see your Unstructured account administrator for your list of available embedding models.To add more embedding models to your list, contact your Unstructured account administrator or Unstructured sales representative,or email Unstructured Support atsupport@unstructured.io.

Embedding model	Dimensions	Tokens	Chunker Max Characters^*
Amazon Bedrock
Cohere Embed English	1024	512	1792
Cohere Embed Multilingual	1024	512	1792
Titan Embeddings G1 - Text	1536	8192	28672
Titan Multimodal Embeddings G1	1024	256	896
Titan Text Embeddings V2	1024	8192	28672
Azure OpenAI
Text Embedding 3 Large	3072	8192	28672
Text Embedding 3 Small	1536	8192	28672
Text Embedding Ada 002	1536	8192	28672
Together AI
M2-Bert 80M 32K Retrieval	768	8192	28672
Voyage AI
Voyage 3	1024	32000	112000
Voyage 3 Large	1024	32000	112000
Voyage 3 Lite	512	32000	112000
Voyage Code 2	1536	16000	56000
Voyage Code 3	1024	32000	112000
Voyage Finance 2	1024	32000	112000
Voyage Law 2	1024	16000	56000
Voyage Multimodal 3	1024	32000	112000

^* This is an approximate value, determined by multiplying the embedding model’s token limit by 3.5.

Was this page helpful?

Suggest edits Raise issue

Generative OCR optimization More examples

Movatterモバイル変換

Unstructured UI

Getting started with the UI

Using the UI

Concepts

​Generate embeddings

​Chunk sizing and embedding models

Generate embeddings

Chunk sizing and embedding models