Integration Testing#

To ensure Arrow implementations are interoperable between each other,the Arrow project includes cross-language integration tests which areregularly run as Continuous Integration tasks.

The integration tests exercise compliance with several Arrow specifications:theIPC format, theFlight RPC protocol,and theC Data Interface.

Strategy#

Our strategy for integration testing between Arrow implementations is:

  • Test datasets are specified in a custom human-readable,JSON-based format designed exclusivelyfor Arrow’s integration tests.

  • The JSON files are generated by the integration test harness. Differentfiles are used to represent different data types and features, such asnumerics, lists, dictionary encoding, etc. This makes it easier to pinpointincompatibilities than if all data types were represented in a single file.

  • Each implementation provides entry points capable of convertingbetween the JSON and the Arrow in-memory representation, and of exposingArrow in-memory data using the desired format.

  • Each format (whether Arrow IPC, Flight or the C Data Interface) is tested forall supported pairs of (producer, consumer) implementations. The producertypically reads a JSON file, converts it to in-memory Arrow data, and exposesthis data using the format under test. The consumer reads the data in thesaid format and converts it back to Arrow in-memory data; it also readsthe same JSON file as the producer, and validates that both datasets areidentical.

Example: IPC format#

Let’s say we are testing Arrow C++ as a producer and Arrow Java as a consumerof the Arrow IPC format. Testing a JSON file would go as follows:

  1. A C++ executable reads the JSON file, converts it into Arrow in-memory dataand writes an Arrow IPC file (the file paths are typically given on the commandline).

  2. A Java executable reads the JSON file, converts it into Arrow in-memory data;it also reads the Arrow IPC file generated by C++. Finally, it validates thatboth Arrow in-memory datasets are equal.

Example: C Data Interface#

Now, let’s say we are testing Arrow Go as a producer and Arrow C# as a consumerof the Arrow C Data Interface.

  1. The integration testing harness allocates a CArrowArray structure on the heap.

  2. A Go in-process entrypoint (for example a C-compatible function call)reads a JSON file and exports one of itsrecord batchesinto theArrowArray structure.

  3. A C# in-process entrypoint reads the same JSON file, converts thesame record batch into Arrow in-memory data; it also imports therecord batch exported by Arrow Go in theArrowArray structure.It validates that both record batches are equal, and then releases theimported record batch.

  4. Depending on the implementation languages’ abilities, the integrationtesting harness may assert that memory consumption remained identical(i.e., that the exported record batch didn’t leak).

  5. At the end, the integration testing harness deallocates theArrowArraystructure.

Running integration tests#

The integration test data generator and runner are implemented insidetheArchery utility. You need to install theintegrationcomponent of archery:

$pipinstall-e"dev/archery[integration]"

The integration tests are run using thearcheryintegration command.

$archeryintegration--help

In order to run integration tests, you’ll first need to build each componentyou want to include. See the respective developer docs for C++, Java, etc.for instructions on building those.

Some languages may require additional build options to enable integrationtesting. For C++, for example, you need to add-DARROW_BUILD_INTEGRATION=ONto your cmake command.

Depending on which components you have built, you can enable and add them tothe archery test run. For example, if you only have the C++ project builtand want to run the Arrow IPC integration tests, run:

archeryintegration--run-ipc--with-cpp=1

For Java, it may look like:

VERSION=14.0.0-SNAPSHOTexportARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jararcheryintegration--run-ipc--with-cpp=1--with-java=1

To run all tests, including Flight and C Data Interface integration tests, do:

archeryintegration--with-all--run-flight--run-ipc--run-c-data

Note that we run these tests in continuous integration, and the CI job usesDocker Compose. You may also run the Docker Compose job locally, or at leastrefer to it if you have questions about how to build other languages or enablecertain tests.

SeeRunning Docker Builds for more information about the project’sdockercompose configuration.

JSON test data format#

A JSON representation of Arrow columnar data is provided forcross-language integration testing purposes.This representation isnot canonicalbut it provides a human-readable way of verifying language implementations.

Seeherefor some examples of this JSON data.

The high level structure of a JSON integration test files is as follows:

Data file

{"schema":/*Schema*/,"batches":[/*RecordBatch*/],"dictionaries":[/*DictionaryBatch*/],}

All files containschema andbatches, whiledictionaries is onlypresent if there are dictionary type fields in the schema.

Schema

{"fields":[/*Field*/],"metadata":/*Metadata*/}

Field

{"name":"name_of_the_field","nullable":/*boolean*/,"type":/*Type*/,"children":[/*Field*/],"dictionary":{"id":/*integer*/,"indexType":/*Type*/,"isOrdered":/*boolean*/},"metadata":/*Metadata*/}

Thedictionary attribute is present if and only if theField corresponds to adictionary type, and itsid maps onto a column in theDictionaryBatch. In thiscase thetype attribute describes the value type of the dictionary.

For primitive types,children is an empty array.

Metadata

null|[{"key":/*string*/,"value":/*string*/}]

A key-value mapping of custom metadata. It may be omitted or null, in which case it isconsidered equivalent to[] (no metadata). Duplicated keys are not forbidden here.

Type:

{"name":"null|struct|list|largelist|listview|largelistview|fixedsizelist|union|int|floatingpoint|utf8|largeutf8|binary|largebinary|utf8view|binaryview|fixedsizebinary|bool|decimal|date|time|timestamp|interval|duration|map|runendencoded"}

AType will have other fields as defined inSchema.fbsdepending on its name.

Int:

{"name":"int","bitWidth":/*integer*/,"isSigned":/*boolean*/}

FloatingPoint:

{"name":"floatingpoint","precision":"HALF|SINGLE|DOUBLE"}

FixedSizeBinary:

{"name":"fixedsizebinary","byteWidth":/*bytewidth*/}

Decimal:

{"name":"decimal","precision":/*integer*/,"scale":/*integer*/}

Timestamp:

{"name":"timestamp","unit":"$TIME_UNIT","timezone":"$timezone"}

$TIME_UNIT is one of"SECOND|MILLISECOND|MICROSECOND|NANOSECOND"

“timezone” is an optional string.

Duration:

{"name":"duration","unit":"$TIME_UNIT"}

Date:

{"name":"date","unit":"DAY|MILLISECOND"}

Time:

{"name":"time","unit":"$TIME_UNIT","bitWidth":/*integer:32or64*/}

Interval:

{"name":"interval","unit":"YEAR_MONTH|DAY_TIME"}

Union:

{"name":"union","mode":"SPARSE|DENSE","typeIds":[/*integer*/]}

ThetypeIds field inUnion are the codes used to denote which member ofthe union is active in each array slot. Note that in general these discriminants are not identicalto the index of the corresponding child array.

List:

{"name":"list"}

The type that the list is a “list of” will be included in theField’s“children” member, as a singleField there. For example, for a list ofint32,

{"name":"list_nullable","type":{"name":"list"},"nullable":true,"children":[{"name":"item","type":{"name":"int","isSigned":true,"bitWidth":32},"nullable":true,"children":[]}]}

FixedSizeList:

{"name":"fixedsizelist","listSize":/*integer*/}

This type likewise comes with a length-1 “children” array.

Struct:

{"name":"struct"}

TheField’s “children” contains an array ofFields with meaningfulnames and types.

Map:

{"name":"map","keysSorted":/*boolean*/}

TheField’s “children” contains a singlestruct field, which itselfcontains 2 children, named “key” and “value”.

Null:

{"name":"null"}

RunEndEncoded:

{"name":"runendencoded"}

TheField’s “children” should be exactly two child fields. The firstchild must be named “run_ends”, be non-nullable and be either anint16,int32, orint64 type field. The second child must be named “values”,but can be of any type.

Extension types are, as in the IPC format, represented as their underlyingstorage type plus some dedicated field metadata to reconstruct the extensiontype. For example, assuming a “rational” extension type backed by astruct<numer:int32,denom:int32> storage, here is how a “rational” fieldwould be represented:

{"name":"name_of_the_field","nullable":/*boolean*/,"type":{"name":"struct"},"children":[{"name":"numer","type":{"name":"int","bitWidth":32,"isSigned":true}},{"name":"denom","type":{"name":"int","bitWidth":32,"isSigned":true}}],"metadata":[{"key":"ARROW:extension:name","value":"rational"},{"key":"ARROW:extension:metadata","value":"rational-serialized"}]}

RecordBatch:

{"count":/*integernumberofrows*/,"columns":[/*FieldData*/]}

DictionaryBatch:

{"id":/*integer*/,"data":[/*RecordBatch*/]}

FieldData:

{"name":"field_name","count""field_length","$BUFFER_TYPE":/*BufferData*/..."$BUFFER_TYPE":/*BufferData*/"children":[/*FieldData*/]}

The “name” member of aField in theSchema corresponds to the “name”of aFieldData contained in the “columns” of aRecordBatch.For nested types (list, struct, etc.),Field’s “children” each have a“name” that corresponds to the “name” of aFieldData inside the“children” of thatFieldData.ForFieldData inside of aDictionaryBatch, the “name” field does notcorrespond to anything.

Here$BUFFER_TYPE is one ofVALIDITY,OFFSET (forvariable-length types, such as strings and lists),TYPE_ID (for unions),orDATA.

BufferData is encoded based on the type of buffer:

  • VALIDITY: a JSON array of 1 (valid) and 0 (null). Data for non-nullableField still has aVALIDITY array, even though all values are 1.

  • OFFSET: a JSON array of integers for 32-bit offsets orstring-formatted integers for 64-bit offsets.

  • TYPE_ID: a JSON array of integers.

  • DATA: a JSON array of encoded values.

  • VARIADIC_DATA_BUFFERS: a JSON array of data buffers represented ashex encoded strings.

  • VIEWS: a JSON array of encoded views, which are JSON objects with:

    • SIZE: an integer indicating the size of the view,

    • INLINED: an encoded value (this field will be present ifSIZEis smaller than 12, otherwise the next three fields will be present),

    • PREFIX_HEX: the first four bytes of the view encoded as hex,

    • BUFFER_INDEX: the index inVARIADIC_DATA_BUFFERS of the bufferviewed,

    • OFFSET: the offset in the buffer viewed.

The value encoding forDATA is different depending on the logicaltype:

  • For boolean type: an array of 1 (true) and 0 (false).

  • For integer-based types (including timestamps): an array of JSON numbers.

  • For 64-bit integers: an array of integers formatted as JSON strings,so as to avoid loss of precision.

  • For floating point types: an array of JSON numbers. Values are limitedto 3 decimal places to avoid loss of precision.

  • For binary types, an array of uppercase hex-encoded strings, so asto represent arbitrary binary data.

  • For UTF-8 string types, an array of JSON strings.

For “list” and “largelist” types,BufferData hasVALIDITY andOFFSET, and the rest of the data is inside “children”. These childFieldData contain all of the same attributes as non-child data, so inthe example of a list ofint32, the child data hasVALIDITY andDATA.

For “fixedsizelist”, there is noOFFSET member because the offsets areimplied by the field’s “listSize”.

Note that the “count” for these child data may not match the parent “count”.For example, if aRecordBatch has 7 rows and contains aFixedSizeListoflistSize 4, then the data inside the “children” of thatFieldDatawill have count 28.

For “null” type,BufferData does not contain any buffers.

Archery Integration Test Cases#

This list can make it easier to understand what manual testing may need tobe done for any future Arrow Format changes by knowing what cases the automatedintegration testing actually tests.

There are two types of integration test cases: the ones populated on the flyby the data generator in the Archery utility, andgold files that existin thearrow-testingrepository.

Data Generator Tests#

This is the high-level description of the cases which are generated andtested using thearcheryintegration command (seeget_generated_json_filesindatagen.py):

  • Primitive Types- No Batches- Various Primitive Values- Batches with Zero Length- String and Binary Large offset cases

  • Null Type* Trivial Null batches

  • Decimal128

  • Decimal256

  • DateTime with various units

  • Durations with various units

  • Intervals- MonthDayNano interval is a separate case

  • Map Types- Non-Canonical Maps

  • Nested Types- Lists- Structs- Lists with Large Offsets

  • Unions

  • Custom Metadata

  • Schemas with Duplicate Field Names

  • Dictionary Types- Signed indices- Unsigned indices- Nested dictionaries

  • Run end encoded

  • Binary view and string view

  • List view and large list view

  • Extension Types

Gold File Integration Tests#

Pre-generated json and arrow IPC files (both file and stream format) existin thearrow-testing repositoryin thedata/arrow-ipc-stream/integration directory. These serve asgold files that are assumed to be correct for use in testing. They arereferenced byrunner.py in the code for theArcheryutility. Below are the test cases which are covered by them:

  • Backwards Compatibility

    • The following cases are tested using the 0.14.1 format:

      • datetime

      • decimals

      • dictionaries

      • intervals

      • maps

      • nested types (list, struct)

      • primitives

      • primitive with no batches

      • primitive with zero length batches

    • The following is tested for 0.17.1 format:

      • unions

  • Endianness

    • The following cases are tested with both Little Endian and Big Endian versions for auto conversion

      • custom metadata

      • datetime

      • decimals

      • decimal256

      • dictionaries

      • dictionaries with unsigned indices

      • record batches with duplicate fieldnames

      • extension types

      • interval types

      • map types

      • non-canonical map data

      • nested types (lists, structs)

      • nested dictionaries

      • nested large offset types

      • nulls

      • primitive data

      • large offset binary and strings

      • primitives with no batches included

      • primitive batches with zero length

      • recursive nested types

      • union types

  • Compression tests

    • LZ4

    • ZSTD

  • Batches with Shared Dictionaries

Generating new Gold Files#

From time to time, it is desirable to add new gold files, for example when theColumnar format or the IPC specification is update. Archery provides a dedicatedoption to do that.

It is recommended to generate gold files using a well-known version of a Arrowimplementation. For example, if a build of Arrow C++ exists in./build/release/,one can generate new gold files in the/tmp/gold-files directory using thefollowing command:

exportARROW_CPP_EXE_PATH=./build/release/archeryintegration--with-cpp1--write-gold-files=/tmp/gold-files