Ingest data withsemantic_text fields
This page provides instructions for ingesting data intosemantic_text fields. Learn how to index pre-chunked content, usecopy_to and multi-fields to collect values from multiple fields, and perform updates and partial updates to optimize ingestion costs.
To index pre-chunked content, provide your text as an array of strings. Each element in the array represents a single chunk that will be sent directly to the inference service without further chunking.
Disable automatic chunking
Disable automatic chunking in your index mapping by setting
chunking_settings.strategytonone:PUT test-index{ "mappings": { "properties": { "my_semantic_field": { "type": "semantic_text", "chunking_settings": { "strategy": "none" } } } }}- Disables automatic chunking on
my_semantic_field.
- Disables automatic chunking on
Index documents
Index documents with pre-chunked text as an array:
PUT test-index/_doc/1{ "my_semantic_field": ["my first chunk", "my second chunk", ...] ...}- The text is pre-chunked and provided as an array of strings. Each element represents a single chunk.
When providing pre-chunked input:
- Ensure that you set the chunking strategy to
noneto avoid additional processing. - Size each chunk carefully, staying within the token limit of the inference service and the underlying model.
- If a chunk exceeds the model's token limit, the behavior depends on the service:
- Some services (such as OpenAI) will return an error.
- Others (such as
elasticandelasticsearch) will automatically truncate the input.
You can use a singlesemantic_text field to collect values from multiple fields for semantic search. Thesemantic_text field type can serve as the target ofcopy_to fields, be part of amulti-field structure, or containmulti-fields internally.
Usecopy_to to copy values from source fields to asemantic_text field:
PUT test-index{ "mappings": { "properties": { "source_field": { "type": "text", "copy_to": "infer_field" }, "infer_field": { "type": "semantic_text", "inference_id": ".elser-2-elasticsearch" } } }}
Declaresemantic_text as a multi-field:
PUT test-index{ "mappings": { "properties": { "source_field": { "type": "text", "fields": { "infer_field": { "type": "semantic_text", "inference_id": ".elser-2-elasticsearch" } } } } }}
When updating documents that containsemantic_text fields, it's important to understand how inference is triggered:
- Full document updates
- Full document updates re-run inference on all
semantic_textfields, even if their values did not change. This ensures that embeddings remain consistent with the current document state but can increase ingestion costs. - Partial updates using the Bulk API
- Partial updates submitted through theBulk API reuse existing embeddings when you omit
semantic_textfields. inference does not run for omitted fields, which can significantly reduce processing time and cost. - Partial updates using the Update API
- Partial updates submitted through theUpdate API re-run inference on all
semantic_textfields, even when you omit them from thedocobject. Embeddings are re-generated regardless of whether field values changed.
To preserve existing embeddings and avoid unnecessary inference costs:
- Use partial updates with the Bulk API.
- Omit any
semantic_textfields that did not change from thedocobject in your request.
For indices containingsemantic_text fields, updates that use scripts have thefollowing behavior:
- ✅Supported:Update API
- ❌Not supported:Bulk API. Scripted updates will fail even if the script targets non-
semantic_textfields.