huggingface/text-embeddings-inferencePublic

NotificationsYou must be signed in to change notification settings
Fork288
Star3.8k

A blazing fast inference solution for text embeddings models

huggingface.co/docs/text-embeddings-inference/quick_tour

License

Apache-2.0 license

3.8k stars 288 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 276 Commits
.cargo		.cargo
.github		.github
assets		assets
backends		backends
core		core
docs		docs
integration_tests		integration_tests
load_tests		load_tests
proto		proto
router		router
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile-cuda		Dockerfile-cuda
Dockerfile-cuda-all		Dockerfile-cuda-all
Dockerfile-intel		Dockerfile-intel
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cuda-all-entrypoint.sh		cuda-all-entrypoint.sh
flake.lock		flake.lock
flake.nix		flake.nix
rust-toolchain.toml		rust-toolchain.toml
sagemaker-entrypoint-cuda-all.sh		sagemaker-entrypoint-cuda-all.sh
sagemaker-entrypoint.sh		sagemaker-entrypoint.sh

Repository files navigation

Text Embeddings Inference

A blazing fast inference solution for text embeddings models.

Benchmark forBAAI/bge-base-en-v1.5 on an Nvidia A10 with a sequencelength of 512 tokens:

Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequenceclassification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding,Ember, GTE and E5. TEI implements many features such as:

No model graph compilation step
Metal support for local execution on Macs
Small docker images and fast boot times. Get ready for true serverless!
Token based dynamic batching
Optimized transformers code for inference usingFlash Attention,CandleandcuBLASLt
Safetensors weight loading
ONNX weight loading
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Get Started

Supported Models

Text Embeddings

Text Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERTmodel with Alibi positions and Mistral, Alibaba GTE, Qwen2 models with Rope positions, MPNet, ModernBERT, and Qwen3.

Below are some examples of the currently supported models:

MTEB Rank	Model Size	Model Type	Model ID
2	8B (Very Expensive)	Qwen3	Qwen/Qwen3-Embedding-8B
4	0.6B	Qwen3	Qwen/Qwen3-Embedding-0.6B
6	7B (Very Expensive)	Qwen2	Alibaba-NLP/gte-Qwen2-7B-instruct
7	0.5B	XLM-RoBERTa	intfloat/multilingual-e5-large-instruct
14	1.5B (Expensive)	Qwen2	Alibaba-NLP/gte-Qwen2-1.5B-instruct
17	7B (Very Expensive)	Mistral	Salesforce/SFR-Embedding-2_R
34	0.5B	XLM-RoBERTa	Snowflake/snowflake-arctic-embed-l-v2.0
40	0.3B	Alibaba GTE	Snowflake/snowflake-arctic-embed-m-v2.0
51	0.3B	Bert	WhereIsAI/UAE-Large-V1
N/A	0.4B	Alibaba GTE	Alibaba-NLP/gte-large-en-v1.5
N/A	0.4B	ModernBERT	answerdotai/ModernBERT-large
N/A	0.3B	NomicBert	nomic-ai/nomic-embed-text-v2-moe
N/A	0.1B	NomicBert	nomic-ai/nomic-embed-text-v1
N/A	0.1B	NomicBert	nomic-ai/nomic-embed-text-v1.5
N/A	0.1B	JinaBERT	jinaai/jina-embeddings-v2-base-en
N/A	0.1B	JinaBERT	jinaai/jina-embeddings-v2-base-code
N/A	0.1B	MPNet	sentence-transformers/all-mpnet-base-v2

To explore the list of best performing text embeddings models, visit theMassive Text Embedding Benchmark (MTEB) Leaderboard.

Sequence Classification and Re-Ranking

Text Embeddings Inference currently supports CamemBERT, and XLM-RoBERTa Sequence Classification models with absolute positions.

Below are some examples of the currently supported models:

Task	Model Type	Model ID
Re-Ranking	XLM-RoBERTa	BAAI/bge-reranker-large
Re-Ranking	XLM-RoBERTa	BAAI/bge-reranker-base
Re-Ranking	GTE	Alibaba-NLP/gte-multilingual-reranker-base
Re-Ranking	ModernBert	Alibaba-NLP/gte-reranker-modernbert-base
Sentiment Analysis	RoBERTa	SamLowe/roberta-base-go_emotions

Docker

model=Qwen/Qwen3-Embedding-0.6Bvolume=$PWD/data# share a volume with the Docker container to avoid downloading weights every rundocker run --gpus all -p 8080:80 -v$volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id$model

And then you can make requests like

curl 127.0.0.1:8080/embed \    -X POST \    -d'{"inputs":"What is Deep Learning?"}' \    -H'Content-Type: application/json'

Note: To use GPUs, you need to installtheNVIDIA Container Toolkit.NVIDIA drivers on your machine need to be compatible with CUDA version 12.2 or higher.

To see all options to serve your models:

$text-embeddings-router --helpText Embedding WebserverUsage: text-embeddings-router [OPTIONS]Options:      --model-id <MODEL_ID>          The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `BAAI/bge-large-en-v1.5`. Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of transformers          [env: MODEL_ID=]          [default: BAAI/bge-large-en-v1.5]      --revision <REVISION>          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`          [env: REVISION=]      --tokenization-workers <TOKENIZATION_WORKERS>          Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation. Default to the number of CPU cores on the machine          [env: TOKENIZATION_WORKERS=]      --dtype <DTYPE>          The dtype to be forced upon the model          [env: DTYPE=]          [possible values: float16, float32]      --pooling <POOLING>          Optionally control the pooling method for embedding models.          If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling/config.json` configuration.          If `pooling` is set, it will override the model pooling configuration          [env: POOLING=]          Possible values:          - cls:        Select the CLS token as embedding          - mean:       Apply Mean pooling to the model embeddings          - splade:     Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only available if the loaded model is a `ForMaskedLM` Transformer model          - last-token: Select the last token as embedding      --max-concurrent-requests <MAX_CONCURRENT_REQUESTS>          The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly          [env: MAX_CONCURRENT_REQUESTS=]          [default: 512]      --max-batch-tokens <MAX_BATCH_TOKENS>          **IMPORTANT** This is one critical control to allow maximum usage of the available hardware.          This represents the total amount of potential tokens within a batch.          For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.          Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.          [env: MAX_BATCH_TOKENS=]          [default: 16384]      --max-batch-requests <MAX_BATCH_REQUESTS>          Optionally control the maximum number of individual requests in a batch          [env: MAX_BATCH_REQUESTS=]      --max-client-batch-size <MAX_CLIENT_BATCH_SIZE>          Control the maximum number of inputs that a client can send in a single request          [env: MAX_CLIENT_BATCH_SIZE=]          [default: 32]      --auto-truncate          Automatically truncate inputs that are longer than the maximum supported size          Unused for gRPC servers          [env: AUTO_TRUNCATE=]      --default-prompt-name <DEFAULT_PROMPT_NAME>          The name of the prompt that should be used by default for encoding. If not set, no prompt will be applied.          Must be a key in the `sentence-transformers` configuration `prompts` dictionary.          For example if ``default_prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode.          The argument '--default-prompt-name <DEFAULT_PROMPT_NAME>' cannot be used with '--default-prompt <DEFAULT_PROMPT>`          [env: DEFAULT_PROMPT_NAME=]      --default-prompt <DEFAULT_PROMPT>          The prompt that should be used by default for encoding. If not set, no prompt will be applied.          For example if ``default_prompt`` is "query: " then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode.          The argument '--default-prompt <DEFAULT_PROMPT>' cannot be used with '--default-prompt-name <DEFAULT_PROMPT_NAME>`          [env: DEFAULT_PROMPT=]      --hf-token <HF_TOKEN>          Your Hugging Face Hub token          [env: HF_TOKEN=]      --hostname <HOSTNAME>          The IP address to listen on          [env: HOSTNAME=]          [default: 0.0.0.0]      -p, --port <PORT>          The port to listen on          [env: PORT=]          [default: 3000]      --uds-path <UDS_PATH>          The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC          [env: UDS_PATH=]          [default: /tmp/text-embeddings-inference-server]      --huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>          The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance          [env: HUGGINGFACE_HUB_CACHE=]      --payload-limit <PAYLOAD_LIMIT>          Payload size limit in bytes          Default is 2MB          [env: PAYLOAD_LIMIT=]          [default: 2000000]      --api-key <API_KEY>          Set an api key for request authorization.          By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token.          [env: API_KEY=]      --json-output          Outputs the logs in JSON format (useful for telemetry)          [env: JSON_OUTPUT=]      --disable-spans          [env: DISABLE_SPANS=]      --otlp-endpoint <OTLP_ENDPOINT>          The grpc endpoint for opentelemetry. Telemetry is sent to this endpoint as OTLP over gRPC. e.g. `http://localhost:4317`          [env: OTLP_ENDPOINT=]      --otlp-service-name <OTLP_SERVICE_NAME>          The service name for opentelemetry. e.g. `text-embeddings-inference.server`          [env: OTLP_SERVICE_NAME=]          [default: text-embeddings-inference.server]      --prometheus-port <PROMETHEUS_PORT>          The Prometheus port to listen on          [env: PROMETHEUS_PORT=]          [default: 9000]      --cors-allow-origin <CORS_ALLOW_ORIGIN>          Unused for gRPC servers          [env: CORS_ALLOW_ORIGIN=]      -h, --help          Print help (see a summary with '-h')      -V, --version          Print version

Docker Images

Text Embeddings Inference ships with multiple Docker images that you can use to target a specific backend:

Architecture	Image
CPU	ghcr.io/huggingface/text-embeddings-inference:cpu-1.7
Volta	NOT SUPPORTED
Turing (T4, RTX 2000 series, ...)	ghcr.io/huggingface/text-embeddings-inference:turing-1.7 (experimental)
Ampere 80 (A100, A30)	ghcr.io/huggingface/text-embeddings-inference:1.7
Ampere 86 (A10, A40, ...)	ghcr.io/huggingface/text-embeddings-inference:86-1.7
Ada Lovelace (RTX 4000 series, ...)	ghcr.io/huggingface/text-embeddings-inference:89-1.7
Hopper (H100)	ghcr.io/huggingface/text-embeddings-inference:hopper-1.7 (experimental)

Warning: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.You can turn Flash Attention v1 ON by using theUSE_FLASH_ATTENTION=True environment variable.

API documentation

You can consult the OpenAPI documentation of thetext-embeddings-inference REST API using the/docs route.The Swagger UI is also availableat:https://huggingface.github.io/text-embeddings-inference.

Using a private or gated model

You have the option to utilize theHF_TOKEN environment variable for configuring the token employed bytext-embeddings-inference. This allows you to gain access to protected resources.

For example:

Go tohttps://huggingface.co/settings/tokens
Copy your cli READ token
ExportHF_TOKEN=<your cli READ token>

or with Docker:

model=<your private model>volume=$PWD/data# share a volume with the Docker container to avoid downloading weights every runtoken=<your cli READ token>docker run --gpus all -e HF_TOKEN=$token -p 8080:80 -v$volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id$model

Air gapped deployment

To deploy Text Embeddings Inference in an air-gapped environment, first download the weights and then mount them insidethe container using a volume.

For example:

# (Optional) create a `models` directorymkdir modelscd models# Make sure you have git-lfs installed (https://git-lfs.com)git lfs installgit clone https://huggingface.co/Qwen/Qwen3-Embedding-0.6B# Set the models directory as the volume pathvolume=$PWD# Mount the models directory inside the container with a volume and set the model IDdocker run --gpus all -p 8080:80 -v$volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id /data/Qwen3-Embedding-0.6B

Using Re-rankers models

text-embeddings-inference v0.4.0 added support for CamemBERT, RoBERTa, XLM-RoBERTa, and GTE Sequence Classification models.Re-rankers models are Sequence Classification cross-encoders models with a single class that scores the similaritybetween a query and a text.

Seethis blogpost bythe LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improvedownstream performance.

model=BAAI/bge-reranker-largevolume=$PWD/data# share a volume with the Docker container to avoid downloading weights every rundocker run --gpus all -p 8080:80 -v$volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id$model

And then you can rank the similarity between a query and a list of texts with:

curl 127.0.0.1:8080/rerank \    -X POST \    -d'{"query": "What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \    -H'Content-Type: application/json'

Using Sequence Classification models

You can also use classic Sequence Classification models likeSamLowe/roberta-base-go_emotions:

model=SamLowe/roberta-base-go_emotionsvolume=$PWD/data# share a volume with the Docker container to avoid downloading weights every rundocker run --gpus all -p 8080:80 -v$volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id$model

Once you have deployed the model you can use thepredict endpoint to get the emotions most associated with an input:

curl 127.0.0.1:8080/predict \    -X POST \    -d'{"inputs":"I like you."}' \    -H'Content-Type: application/json'

Using SPLADE pooling

You can choose to activate SPLADE pooling for Bert and Distilbert MaskedLM architectures:

model=naver/efficient-splade-VI-BT-large-queryvolume=$PWD/data# share a volume with the Docker container to avoid downloading weights every rundocker run --gpus all -p 8080:80 -v$volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id$model --pooling splade

Once you have deployed the model you can use the/embed_sparse endpoint to get the sparse embedding:

curl 127.0.0.1:8080/embed_sparse \    -X POST \    -d'{"inputs":"I like you."}' \    -H'Content-Type: application/json'

Distributed Tracing

text-embeddings-inference is instrumented with distributed tracing using OpenTelemetry. You can use this featureby setting the address to an OTLP collector with the--otlp-endpoint argument.

gRPC

text-embeddings-inference offers a gRPC API as an alternative to the default HTTP API for high performancedeployments. The API protobuf definition can befoundhere.

You can use the gRPC API by adding the-grpc tag to any TEI Docker image. For example:

model=Qwen/Qwen3-Embedding-0.6Bvolume=$PWD/data# share a volume with the Docker container to avoid downloading weights every rundocker run --gpus all -p 8080:80 -v$volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7-grpc --model-id$model

grpcurl -d'{"inputs": "What is Deep Learning"}' -plaintext 0.0.0.0:8080 tei.v1.Embed/Embed

Local install

CPU

You can also opt to installtext-embeddings-inference locally.

Firstinstall Rust:

curl --proto'=https' --tlsv1.2 -sSf https://sh.rustup.rs| sh

Then run:

# On x86 with ONNX backend (recommended)cargo install --path router -F ort# On x86 with Intel backendcargo install --path router -F mkl# On M1 or M2cargo install --path router -F metal

You can now launch Text Embeddings Inference on CPU with:

model=Qwen/Qwen3-Embedding-0.6Btext-embeddings-router --model-id$model --port 8080

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

CUDA

GPUs with CUDA compute capabilities < 7.5 are not supported (V100, Titan V, GTX 1000 series, ...).

Make sure you have CUDA and the nvidia drivers installed. NVIDIA drivers on your device need to be compatible with CUDAversion 12.2 or higher.You also need to add the nvidia binaries to your path:

export PATH=$PATH:/usr/local/cuda/bin

Then run:

# This can take a while as we need to compile a lot of cuda kernels# On Turing GPUs (T4, RTX 2000 series ... )cargo install --path router -F candle-cuda-turing -F http --no-default-features# On Ampere and Hoppercargo install --path router -F candle-cuda -F http --no-default-features

You can now launch Text Embeddings Inference on GPU with:

model=Qwen/Qwen3-Embedding-0.6Btext-embeddings-router --model-id$model --port 8080

Docker build

You can build the CPU container with:

docker build.

To build the CUDA containers, you need to know the compute cap of the GPU you will be usingat runtime.

Then you can build the container with:

# Get submodule dependenciesgit submodule update --init# Example for Turing (T4, RTX 2000 series, ...)runtime_compute_cap=75# Example for A100runtime_compute_cap=80# Example for A10runtime_compute_cap=86# Example for Ada Lovelace (RTX 4000 series, ...)runtime_compute_cap=89# Example for H100runtime_compute_cap=90docker build. -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_cap

Apple M1/M2 arm64 architectures

DISCLAIMER

As explained hereMPS-Ready, ARM64 Docker Image, Metal / MPS is notsupported via Docker. As such inference will be CPU bound and most likely pretty slow when using this docker image on anM1/M2 ARM CPU.

docker build . -f Dockerfile --platform=linux/arm64

Examples

About

A blazing fast inference solution for text embeddings models

huggingface.co/docs/text-embeddings-inference/quick_tour

Releases25

v1.7.4 Latest

Jul 7, 2025

+ 24 releases

Packages

Contributors48

+ 34 contributors

Movatterモバイル変換

License

huggingface/text-embeddings-inference

Folders and files

Latest commit

History

Repository files navigation

Text Embeddings Inference

Table of contents

Get Started

Supported Models

Text Embeddings

Sequence Classification and Re-Ranking

Docker

Docker Images

API documentation

Using a private or gated model

Air gapped deployment

Using Re-rankers models

Using Sequence Classification models

Using SPLADE pooling

Distributed Tracing

gRPC

Local install

CPU

CUDA

Docker build

Apple M1/M2 arm64 architectures

DISCLAIMER

Examples

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases25

Packages0

Uh oh!

Uh oh!

Contributors48

Languages

Packages