Input data format and structure

To build a new index or update an existing index, provide vectors toVector Search in the format and structure described inthe following sections.

Prerequisites

Store your input data in aCloud Storage bucket,in your Google Cloud project.

Input data files should be organized as follows:

  • Each batch of input data files should be under a single Cloud Storage directory.
  • Data files should be placed directly underbatch_root and named with thefollowing suffixes:.csv,.json, and.avro.
  • There is a limit of 5000 objects (files) in the batch root directory.
  • Each data file is interpreted as a set of records. The format of the record isdetermined by the suffix of the filename and those format requirements aredescribed. SeeData file formats.
  • Each record should have anid, a feature vector, and your optional fieldssupported by Vertex AI Feature Store, like restricts and crowding.
  • A subdirectory nameddelete may be present. Each file directly underbatch_root/delete is taken as a text file ofid records with oneid in each line.
  • All other subdirectories are not allowed.
  • Transcoding of gzip-compressed files isn't supported as input data.

Input data processing

  • All records from all data files, including those underdelete, consist of a singlebatch of input.
  • The relative ordering of records within a data file is not important.
  • A single ID should only appear once in a batch. If there is a duplicate withthe same ID, it displays as one vector count.
  • An ID cannot appear both in a regular data file and a delete data file.
  • All IDs from a data file under delete causes it to be removed from the nextindex version.
  • Records from regular data files is included in the next version,overwriting a value in an older index version.

The following are examples of dense, sparse, and hybrid embeddings:

  • Dense embeddings:

    {"id":"1","embedding":[1,1,1]}{"id":"2","embedding":[2,2,2]}
  • Sparse embeddings:

    {"id":"3","sparse_embedding":{"values":[0.1,0.2],"dimensions":[1,4]}}{"id":"4","sparse_embedding":{"values":[-0.4,0.2,-1.3],"dimensions":[10,20,20]}}
  • Hybrid embeddings:

    {"id":"5","embedding":[5,5,-5],"sparse_embedding":{"values":[0.1],"dimensions":[500]}}{"id":"6","embedding":[6,7,-8.1],"sparse_embedding":{"values":[0.1,-0.2],"dimensions":[40,901]}}

The following is an example of a valid input data file organization:

batch_root/feature_file_1.csvfeature_file_2.csvdelete/delete_file.txt

Thefeature_file_1.csv andfeature_file_2.csv files contain records in CSVformat. Thedelete_file.txt file contains a list of record IDs to be deletedfrom the next index version.

Data file formats

JSON

  • Encode the JSON file using UTF-8.
  • Each line of the JSON file will be interpreted as a separate JSON object.
  • Each record must contain anid field to specify the ID of the vector.
  • Each record must contain at least one ofembedding orsparse_embedding.
  • Theembedding field is an array ofN floating point numbers thatrepresents the feature vector, whereN is the dimension of thefeature vector that was configured when the index was created. Thisfield can be used for dense embeddings only.
    • configs.dimensions, which is specified at index creation time, must bethe same length asembeddings.configs.dimensions applies only toembedding, not tosparse_embedding.
  • Thesparse_embedding field is an object withvalues anddimensions fields. Thevalues field is a list of floating pointnumbers that represents the feature vector and thedimensions fieldis a list of integers that represent the dimension in which thecorresponding value is located. For example, a sparse embedding thatlooks like[0,0.1,0,0,0.2] can be represented as"sparse_embedding": {"values": [0.1, 0.2], "dimensions": [1,4]}. Thisfield can be used for sparse embeddings only.
    • The length ofsparse_embedding.values must be the same length assparse_embedding.dimensions. They don't need to be the same length asconfigs.dimensions, which isspecified at index creation time anddoesn't apply tosparse_embedding.
  • An optionalrestricts field can be included that specifies an array ofTokenNamespace objects in restricts. For each object:
    • Specify anamespace field that is theTokenNamespace.namespace.
    • An optionalallow field can be set to an array of strings which arethe list ofTokenNamespace.string_tokens.
    • An optionaldeny field can be set to an array of strings which arethe list ofTokenNamespace.string_blacklist_tokens.
    • The value of the fieldcrowding_tag, if present, must be a string.
  • An optionalnumeric_restricts field can be included that specifies anarray ofNumericRestrictNamespace. For each object:
    • Specify anamespace field that is theNumericRestrictNamespace.namespace.
    • One of the value fieldsvalue_int,value_float, andvalue_double.
    • It must not have a field named op. This field is only for queries.

Avro

  • Use a validAvrofile.
  • To represent a sparse-only datapoint, provide a sparse embedding in thesparse_embedding field and enter an empty list in theembedding field.
  • Make records that conform to the following schema:

    {"type":"record","name":"FeatureVector","fields":[{"name":"id","type":"string"},{"name":"embedding","type":{"type":"array","items":"float"}},{"name":"sparse_embedding","type":["null",{"type":"record","name":"sparse_embedding","fields":[{"name":"values","type":{"type":"array","items":"float"}},{"name":"dimensions","type":{"type":"array","items":"long"}}]}]},{"name":"restricts","type":["null",{"type":"array","items":{"type":"record","name":"Restrict","fields":[{"name":"namespace","type":"string"},{"name":"allow","type":["null",{"type":"array","items":"string"}]},{"name":"deny","type":["null",{"type":"array","items":"string"}]}]}}]},{"name":"numeric_restricts","type":["null",{"type":"array","items":{"name":"NumericRestrict","type":"record","fields":[{"name":"namespace","type":"string"},{"name":"value_int","type":["null","int"],"default":null},{"name":"value_float","type":["null","float"],"default":null},{"name":"value_double","type":["null","double"],"default":null}]}}],"default":null},{"name":"crowding_tag","type":["null","string"]}]}

CSV

  • Format:ID,N feature vector values,Any number of dimension:value sparse values,name=value lists
  • Encode the CSV file using UTF-8.
  • Each line of the CSV must contain exactly one record.
  • The first value in each line must be the vector ID, which must be a validUTF-8 string.
  • Following the ID, at least one of dense embedding or sparse embedding mustbe specified.
  • For a dense embedding, the nextN values represent the feature vector,whereN is the dimension of the feature vector that was configured whenthe index was created.
  • For a sparse embedding, any number ofdimension:value can be specified,in whichvalue is parsed as a float anddimension is parsed as along.
  • For a hybrid embedding that has both dense and sparse embeddings, denseembeddings must be specified before sparse embeddings.
  • Feature vector values must be floating point literals as defined in theJava language spec.
  • Additional values may be in the formname=value.
  • The namecrowding_tag is interpreted as the crowding tag and may onlyappear once in the record.
  • All othername=value pairs are interpreted as token namespace restricts.The same name may be repeated if there are multiple values in anamespace.

    For example,color=red,color=blue represents thisTokenNamespace:

    {  "namespace": "color"  "string_tokens": ["red", "blue"]}
  • If value starts with!, the rest of the string is interpreted as anexcluded value.

    For example,color=!red represents thisTokenNamespace:

    {  "namespace": "color"  "string_blacklist_tokens": ["red"]}
  • #name=numericValue pairs with number type suffix is interpreted asnumeric namespace restricts. Number type suffix isi for int,f forfloat, andd for double. The same name shouldn't be repeated as thereshould be a single value associated per namespace.

    For example,#size=3i represents thisNumericRestrictNamespace:

    {  "namespace": "size"  "value_int": 3}

    #ratio=0.1f represents thisNumericRestrictNamespace:

    {  "namespace": "ratio"  "value_float": 0.1}

    #weight=0.3d represents thisNumericRestriction:

    {  "namespace": "weight"  "value_double": 0.3}
  • The following example is a datapoint withid: "6",embedding: [7,-8.1],sparse_embedding: {values: [0.1, -0.2, 0.5], dimensions: [40,901, 1111]}, crowding tagtest, token allowlist ofcolor: red, blue,token denylist ofcolor: purple, and numeric restrict ofratio withfloat0.1:

    6,7,-8.1,40:0.1,901:-0.2,1111:0.5,crowding_tag=test,color=red,color=blue,color=!purple,ratio=0.1f

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.