Tabular Data #

While arrays (aka:ValueVector) represent a one-dimensional sequence ofhomogeneous values, data often comes in the form of two-dimensional sets ofheterogeneous data (such as database tables, CSV files…). Arrow providesseveral abstractions to handle such data conveniently and efficiently.

Fields#

Fields are used to denote the particular columns of tabular data.A field, i.e. an instance ofField, holds together a field name, a datatype, and some optional key-value metadata.

// Create a column "document" of string type with metadataimportorg.apache.arrow.vector.types.pojo.ArrowType;importorg.apache.arrow.vector.types.pojo.Field;importorg.apache.arrow.vector.types.pojo.FieldType;Map<String,String>metadata=newHashMap<>();metadata.put("A","Id card");metadata.put("B","Passport");metadata.put("C","Visa");Fielddocument=newField("document",newFieldType(true,newArrowType.Utf8(),/*dictionary*/null,metadata),/*children*/null);

Schemas#

ASchema describes the overall structure consisting of any number of columns. It holds a sequence of fields togetherwith some optional schema-wide metadata (in addition to per-field metadata).

// Create a schema describing datasets with two columns:// a int32 column "A" and a utf8-encoded string column "B"importorg.apache.arrow.vector.types.pojo.ArrowType;importorg.apache.arrow.vector.types.pojo.Field;importorg.apache.arrow.vector.types.pojo.FieldType;importorg.apache.arrow.vector.types.pojo.Schema;import staticjava.util.Arrays.asList;Map<String,String>metadata=newHashMap<>();metadata.put("K1","V1");metadata.put("K2","V2");Fielda=newField("A",FieldType.nullable(newArrowType.Int(32,true)),null);Fieldb=newField("B",FieldType.nullable(newArrowType.Utf8()),null);Schemaschema=newSchema(asList(a,b),metadata);

VectorSchemaRoot#

AVectorSchemaRoot is a container for batches of data. Batches flow throughVectorSchemaRoot as part of a pipeline.

Note

VectorSchemaRoot is somewhat analogous to tables or record batches in theother Arrow implementations in that they all are 2D datasets, but theirusage is different.

The recommended usage is to create a single VectorSchemaRoot based on a knownschema and populate data over and over into that root in a stream of batches,rather than creating a new instance each time (seeFlight orArrowFileWriter as examples). Thus at any one point, a VectorSchemaRoot mayhave data or may have no data (say it was transferred downstream or not yetpopulated).

Here is an example of creating a VectorSchemaRoot:

BitVectorbitVector=newBitVector("boolean",allocator);VarCharVectorvarCharVector=newVarCharVector("varchar",allocator);bitVector.allocateNew();varCharVector.allocateNew();for(inti=0;i<10;i++){bitVector.setSafe(i,i%2==0?0:1);varCharVector.setSafe(i,("test"+i).getBytes(StandardCharsets.UTF_8));}bitVector.setValueCount(10);varCharVector.setValueCount(10);List<Field>fields=Arrays.asList(bitVector.getField(),varCharVector.getField());List<FieldVector>vectors=Arrays.asList(bitVector,varCharVector);VectorSchemaRootvectorSchemaRoot=newVectorSchemaRoot(fields,vectors);

Data can be loaded into/unloaded from a VectorSchemaRoot viaVectorLoaderandVectorUnloader. They handle converting between VectorSchemaRoot andArrowRecordBatch (a representation of a RecordBatchIPCmessage). For example:

// create a VectorSchemaRoot root1 and convert its data into recordBatchVectorSchemaRootroot1=newVectorSchemaRoot(fields,vectors);VectorUnloaderunloader=newVectorUnloader(root1);ArrowRecordBatchrecordBatch=unloader.getRecordBatch();// create a VectorSchemaRoot root2 and load the recordBatchVectorSchemaRootroot2=VectorSchemaRoot.create(root1.getSchema(),allocator);VectorLoaderloader=newVectorLoader(root2);loader.load(recordBatch);

A new VectorSchemaRoot can be sliced from an existing root without copyingdata:

// 0 indicates start index (inclusive) and 5 indicated length (exclusive).VectorSchemaRootnewRoot=vectorSchemaRoot.slice(0,5);

Table#

ATable is an immutable tabular data structure, very similar to VectorSchemaRoot, in that it is also built on ValueVectors and schemas. Unlike VectorSchemaRoot, Table is not designed for batch processing. Here is a version of the example above, showing how to create a Table, rather than a VectorSchemaRoot:

BitVectorbitVector=newBitVector("boolean",allocator);VarCharVectorvarCharVector=newVarCharVector("varchar",allocator);bitVector.allocateNew();varCharVector.allocateNew();for(inti=0;i<10;i++){bitVector.setSafe(i,i%2==0?0:1);varCharVector.setSafe(i,("test"+i).getBytes(StandardCharsets.UTF_8));}bitVector.setValueCount(10);varCharVector.setValueCount(10);List<FieldVector>vectors=Arrays.asList(bitVector,varCharVector);Tabletable=newTable(vectors);

See theTable documentation for more information.

On this page

Edit on GitHub

Movatterモバイル変換

Tabular Data#

Fields#

Schemas#

VectorSchemaRoot#

Table#

Tabular Data #