Table #

NOTE: The Table API is experimental and subject to change. See the list of limitations below.

Table is an immutable tabular data structure based onFieldVector. LikeVectorSchemaRoot,Table is a columnar data structure backed by Arrow arrays, or more specifically, byFieldVector objects. It differs fromVectorSchemaRoot mainly in that it is fully immutable and lacks support for batch operations. Anyone processing batches of tabular data in a pipeline should continue to useVectorSchemaRoot. Finally, theTable API is mainly row-oriented, so in some ways it’s more like the JDBC API than theVectorSchemaRoot API, but you can still useFieldReaders to work with data in a columnar fashion.

Mutation in Table and VectorSchemaRoot#

VectorSchemaRoot provides a thin wrapper on the vectors that hold its data. Individual vectors can be retrieved from a vector schema root. These vectors havesetters for modifying their elements, makingVectorSchemaRoot immutable only by convention. The protocol for mutating a vector is documented in theValueVector interface:

values need to be written in order (e.g. index 0, 1, 2, 5)
null vectors start with all values as null before writing anything
for variable width types, the offset vector should be all zeros before writing
you must call setValueCount before a vector can be read
you should never write to a vector once it has been read.

The rules aren’t enforced by the API so the programmer is responsible for ensuring that they are followed. Failure to do so could lead to runtime exceptions.

Table, on the other hand, is immutable. The underlying vectors are not exposed. When a table is created from existing vectors, their memory is transferred to new vectors, so subsequent changes to the original vectors can’t impact the new table’s values.

Features and limitations#

A basic set of table functionality is currently available:

Create a table from vectors orVectorSchemaRoot
Iterate tables by row, or set the current row index directly
Access vector values as primitives, objects, and/or nullableValueHolder instances (depending on type)
Get aFieldReader for any vector
Add and remove vectors, creating new tables
Encode and decode a table’s vectors using dictionary encoding
Export table data for use by native code
Print representative data to TSV strings
Get a table’s schema
Slice tables
Convert table toVectorSchemaRoot

Limitations in the 11.0.0 release:

No supportChunkedArray or any form of row-group. Support for chunked arrays or row groups will be considered for a future release.
No support for the C-Stream API. Support for the streaming API is contingent on chunked array support
No support for creating tables directly from Java POJOs. All data held by a table must be imported via aVectorSchemaRoot, or from collections or arrays of vectors.

The Table API#

LikeVectorSchemaRoot, a table contains aSchema and an ordered collection ofFieldVector objects, but it is designed to be accessed via a row-oriented interface.

Creating a Table from a VectorSchemaRoot#

Tables are created from aVectorSchemaRoot as shown below. The memory buffers holding the data are transferred from the vector schema root to new vectors in the new table, clearing the source vectors in the process. This ensures that the data in your new table is never changed. Since the buffers are transferred rather than copied, this is a very low overhead operation.

Tablet=newTable(someVectorSchemaRoot);

If you now update the vectors held by theVectorSchemaRoot (using some version ofValueVector#setSafe()), it would reflect those changes, but the values in tablet are unchanged.

Creating a Table from FieldVectors#

Tables can be created fromFieldVectors as shown below, using ‘var-arg’ array arguments:

IntVectormyVector=createMyIntVector();VectorSchemaRootvsr1=newVectorSchemaRoot(myVector);

or by passing a collection:

IntVectormyVector=createMyIntVector();List<FieldVector>fvList=List.of(myVector);VectorSchemaRootvsr1=newVectorSchemaRoot(fvList);

It is rarely a good idea to share vectors between multiple vector schema roots, and it would not be a good idea to share them between vector schema roots and tables. Creating aVectorSchemaRoot from a list of vectors does not cause the reference counts for the vectors to be incremented. Unless you manage the counts manually, the code below would lead to more references than reference counts, and that could lead to trouble. There is an implicit assumption that the vectors were created for use byoneVectorSchemaRoot that this code violates.

Don’t do this:

IntVectormyVector=createMyIntVector();// Reference count for myVector = 1VectorSchemaRootvsr1=newVectorSchemaRoot(myVector);// Still one referenceVectorSchemaRootvsr2=newVectorSchemaRoot(myVector);// Ref count is still one, but there are two VSRs with a reference to myVectorvsr2.clear();// Reference count for myVector is 0.

What is happening is that the reference counter works at a lower level than theVectorSchemaRoot interface. A reference counter counts references toArrowBuf instances that control memory buffers. It doesn’t count references to the vectors that hold those ArrowBufs. In the example above, eachArrowBuf is held by one vector, so there is only one reference. This distinction is blurred when you call theVectorSchemaRoot’s clear() method, which frees the memory held by each of the vectors it references even though another instance references the same vectors.

When you create tables from vectors, it’s assumed that there are no external references to those vectors. To be certain, the buffers underlying these vectors are transferred to new vectors in the new table, and the original vectors are cleared.

Don’t do this either, but note the difference from above:

IntVectormyVector=createMyIntVector();// Reference count for myVector = 1Tablet1=newTable(myVector);// myVector is cleared; Table t1 has a new hidden vector with the data from myVectorTablet2=newTable(myVector);// t2 has no rows because myVector was just cleared// t1 continues to have the data from the original vectort2.clear();// no change because t2 is already empty and t1 is independent

With tables, memory is explicitly transferred on instantiation so the buffers held by a table are held byonly that table.

Creating Tables with dictionary-encoded vectors#

Another point of difference is thatVectorSchemaRoot is uninformed about any dictionary-encoding of its vectors, while tables hold an optionalDictionaryProvider instance. If any vectors in the source data are encoded, a DictionaryProvider must be set to un-encode the values.

VectorSchemaRootvsr=myVsr();DictionaryProviderprovider=myProvider();Tablet=newTable(vsr,provider);

InTable, dictionaries are used like they are with vectors. To decode a vector, the user provides the name of the vector to decode and the dictionary id:

Tablet=newTable(vsr,provider);ValueVectordecodedName=t.decode("name",1L);

To encode a vector from a table, a similar approach is used:

Tablet=newTable(vsr,provider);ValueVectorencodedName=t.encode("name",1L);

Freeing memory explicitly#

Tables use off-heap memory that must be freed when it is no longer needed.Table implementsAutoCloseable so the best way to create one is in a try-with-resources block:

try(VectorSchemaRootvsr=myMethodForGettingVsrs();Tablet=newTable(vsr)){// do useful things.}

If you don’t use a try-with-resources block, you must close the table manually:

try{VectorSchemaRootvsr=myMethodForGettingVsrs();Tablet=newTable(vsr);// do useful things.}finally{vsr.close();t.close();}

Manual closing should be performed in a finally block.

Getting the schema#

You get the table’s schema just as you would with a vector schema root:

Schemas=table.getSchema();

Adding and removing vectors#

Table provides facilities for adding and removing vectors modeled on the same functionality inVectorSchemaRoot. These operations return new instances rather than modifying the original instance in-place.

try(Tablet=newTable(vectorList)){IntVectorv3=newIntVector("3",intFieldType,allocator);Tablet2=t.addVector(2,v3);Tablet3=t2.removeVector(1);// don't forget to close t2 and t3}

Slicing tables#

Table supportsslice() operations, where a slice of a source table is a second Table that refers to a single, contiguous range of rows in the source.

try(Tablet=newTable(vectorList)){Tablet2=t.slice(100,200);// creates a slice referencing the values in range (100, 200]...}

This raises the question: If you create a slice withall the values in the source table (as shown below), how would that differ from a new Table constructed with the same vectors as the source?

try(Tablet=newTable(vectorList)){Tablet2=t.slice(0,t.getRowCount());// creates a slice referencing all the values in t// ...}

The difference is that when youconstruct a new table, the buffers are transferred from the source vectors to new vectors in the destination. With a slice, both tables share the same underlying vectors. That’s OK, though, since both tables are immutable.

Using FieldReaders#

You can get aFieldReader for any vector in the Table passing either theField, vector index, or vector name as an argument. The signatures are the same as inVectorSchemaRoot.

FieldReadernameReader=table.getReader("user_name");

Row operations#

Row-based access is supported by theRow object.Row providesget() methods by both vector name and vector position, but noset() operations.

It is important to recognize that rows are NOT reified as objects, but rather operate like a cursor where the data from numerous logical rows in the table can be viewed (one at a time) using the sameRow instance. See “Moving from row-to-row” below for information about navigating through the table.

Getting a row#

CallingimmutableRow() on any table instance returns a newRow instance.

Rowr=table.immutableRow();

Moving from row-to-row#

Since rows are iterable, you can traverse a table using a standard while loop:

Rowr=table.immutableRow();while(r.hasNext()){r.next();// do something useful here}

Table implementsIterable<Row> so you can access rows directly from a table in an enhancedfor loop:

for(Rowrow:table){intage=row.getInt("age");booleannameIsNull=row.isNull("name");...}

Finally, while rows are usually iterated in the order of the underlying data vectors, but they are also positionable using theRow#setPosition() method, so you can skip to a specific row. Row numbers are 0-based.

Rowr=table.immutableRow();intage101=r.setPosition(101);// change position directly to 101

Any changes to position are applied to all the columns in the table.

Note that you must callnext(), orsetPosition() before accessing values via a row. Failure to do so results in a runtime exception.

Read operations using rows#

Methods are available for getting values by vector name and vector index, where index is the 0-based position of the vector in the table. For example, assuming ‘age’ is the 13th vector in ‘table’, the following two gets are equivalent:

Rowr=table.immutableRow();r.next();// position the row at the first valueintage1=r.get("age");// gets the value of vector named 'age' in the table at row 0intage2=r.get(12);// gets the value of the 13th vector in the table at row 0

You can also get value using a nullableValueHolder. For example:

NullableIntHolderholder=newNullableIntHolder();intb=row.getInt("age",holder);

This can be used to retrieve values without creating a new Object for each.

In addition to getting values, you can check if a value is null usingisNull(). This is important if the vector contains any nulls, as asking for a value from a vector can cause NullPointerExceptions in some cases.

booleanname0isNull=row.isNull("name");

You can also get the current row number:

introw=row.getRowNumber();

Reading values as Objects#

For any given vector type, the basicget() method returns a primitive value wherever possible. For example,getTimeStampMicro() returns a long value that encodes the timestamp. To get the LocalDateTime object representing that timestamp in Java, another method with ‘Obj’ appended to the name is provided. For example:

longts=row.getTimeStampMicro();LocalDateTimetsObject=row.getTimeStampMicroObj();

The exception to this naming scheme is for complex vector types (List, Map, Schema, Union, DenseUnion, and ExtensionType). These always return objects rather than primitives so no “Obj” extension is required. It is expected that some users may subclassRow to add getters that are more specific to their needs.

Reading VarChars and LargeVarChars#

Strings in arrow are represented as byte arrays encoded with the UTF-8 charset. You can get either a String result or the actual byte array.

byte[]b=row.getVarChar("first_name");Strings=row.getVarCharObj("first_name");// uses the default encoding (UTF-8)

Converting a Table to a VectorSchemaRoot#

Tables can be converted to vector schema roots using thetoVectorSchemaRoot() method. Buffers are transferred to the vector schema root and the source table is cleared.

VectorSchemaRootroot=myTable.toVectorSchemaRoot();

Working with the C-Data interface#

The ability to work with native code is required for many Arrow features. This section describes how tables can be be exported for use with native code

Exporting works by converting the data to aVectorSchemaRoot instance and using the existing facilities to transfer the data. You could do it yourself, but that isn’t ideal because conversion to a vector schema root breaks the immutability guarantees. Using theexportTable() methods in theData class avoids this concern.

Data.exportTable(bufferAllocator,table,dictionaryProvider,outArrowArray);

If the table contains dictionary-encoded vectors and was constructed with aDictionaryProvider, the provider argument toexportTable() can be omitted and the table’s provider attribute will be used:

Data.exportTable(bufferAllocator,table,outArrowArray);

On this page

Edit on GitHub

Movatterモバイル変換

Table#