cloudquery/cloudqueryPublic

NotificationsYou must be signed in to change notification settings
Fork552
Star6.3k

feat: Improve ClickHouse plugin performance by using HTTP interface with ArrowStream format #20621

New issue

Open

feat: Improve ClickHouse plugin performance by using HTTP interface with ArrowStream format#20621

Labels

kind/feat

Description

ei-grad

opened

on Apr 24, 2025

Which problem is this feature request solving?

This feature request proposes switching the data insertion mechanism in the CloudQuery ClickHouse destination plugin from the current native protocol (clickhouse-go) approach to using ClickHouse's HTTP interface with theFORMAT ArrowStream. The goal is to significantly improve insertion performance and reduce plugin-side resource consumption (CPU, memory) by eliminating the current Go-based intermediate type conversion layer (typeconv/ch/values) and leveraging ClickHouse's native, optimized C++ Arrow parser directly.

Current Behavior & Problem

The plugin currently receives data asapache/arrow-goarrow.Record batches via theWriteTableBatch method.
Before sending data to ClickHouse using theclickhouse-go/v2 driver's batch interface (driver.Batch), the plugin performs a conversion step within thetypeconv/ch/values package (specificallyvalues.FromArray andvalues.BatchAddRecords).
This conversion translates eacharrow.Array chunk into a corresponding Go slice type (e.g.,[]*int32,[]*time.Time,[]map[string]any,[][]*string) that theclickhouse-go driver'sAppend method expects. This effectively follows anArrow (plugin) -> Go Slice (plugin) -> CH Native (driver) data flow before reaching the server.
Problem: This intermediateArrow -> Go Slice conversion step within the Go plugin (typeconv/ch/values) can be a significant performance bottleneck and source of high resource usage:
- CPU Intensive: It involves iterating over potentially large Arrow arrays and performing type-specific conversions. Reflection is used for complex types (lists, maps, structs), adding overhead.
- Memory Intensive: It requires allocating Go slices forevery column chunk in the batch. For complex or nested types, these allocations can become substantial.
- Redundant Work: ClickHouse itself has highly optimized native capabilities to ingest data directly from the Arrow IPC Stream format. The current plugin performs work (Arrow -> Go Slice) that the database could potentially do more efficiently.

Alternatives Considered

Optimizingtypeconv/ch/values: While possible (e.g., using code generation instead of reflection), it's unlikely to match the performance of ClickHouse's native C++ Arrow parser and still incurs Go allocation overhead for the intermediate slices.
Usingclickhouse-go's HTTP Interface: Theclickhouse-go/v2 librarydoes have HTTP support, but currently only implements theNative format over HTTP for compatibility reasons. Support for other formats likeArrow over HTTP is planned for a future v3 but is low priority and not yet implemented. Therefore, relying on the existing driver's HTTP capabilities is not currently a viable alternative for sending Arrow data.

Success Criteria

Measurable improvement in insertion throughput (rows/sec or MB/sec) compared to the current implementation under various load conditions (simple vs. complex schemas, small vs. large batches).
Measurable reduction in plugin CPU and peak memory usage during insertion operations.
Successful migration and data insertion for all currently supported data types and schema configurations via the new HTTP transport.

Describe the solution you'd like

Proposed Behavior

Replace Transport: Replace theclickhouse-go native protocol connection (clickhouse.Conn) for insertion operations with Go's standardnet/http client, configured for efficiency (connection pooling, keep-alives, HTTP/2).
UtilizeFORMAT ArrowStream: Modify theINSERT query generation to useFORMAT ArrowStream.
Direct Arrow Serialization: InWriteTableBatch (or a refactored equivalent):
- Take the incoming[]arrow.Record batch.
- Use theapache/arrow-go/v18/arrow/ipc package (ipc.NewWriter) to serialize the batch directly into the Arrow IPC Stream format.
- Crucially: The schema written to the IPC stream must becanonized based on the target ClickHouse table schema, using Arrow types that ClickHouse'sArrowStream parser expects for input (referencing ClickHouse documentation for type mapping, e.g., SDKUUIDType -> ArrowStringType, SDKJSONType -> ArrowStringType). The existingtypeconv.CanonizedTable logic needs adaptation for this mapping.
- Stream the serialized Arrow IPC data as the body of an HTTPPOST request using memory-efficient techniques likeio.Pipe.
- Optionally support request body compression (gzip) viaContent-Encoding.
Offload Parsing: ClickHouse receives the Arrow stream via HTTP and uses its native parser to ingest the data directly.

Clarification on Conversion

Conversion Still Occurs: It's important to acknowledge that the conversion from the Arrow format to ClickHouse's internal columnar representationstill needs to happen.
Shifted Responsibility: This feature requestshifts this conversion responsibility from the CloudQuery Go plugin (Arrow -> Go Slice -> CH Native) to the ClickHouse server itself (Arrow IPC Stream -> CH Internal).
Cost Difference: The directArrow -> CH Internal conversion performed by the ClickHouse server (typically implemented in optimized C++) is expected to besignificantly cheaper in terms of CPU and memory overhead compared to the current intermediateArrow -> Go Slice conversion performed within the Go plugin, which involves Go runtime overhead, potential reflection, and explicit Go slice allocations.

Expected Benefits

Improved Insertion Throughput: Significant reduction in plugin-side CPU usage by eliminating the Go conversion step should lead to higher insertion rates, especially for large batches or complex schemas.
Reduced Plugin Memory Usage: Eliminating the allocation of large Go slices intypeconv/ch/values should lower the plugin's peak memory footprint during writes. Streaming the request body viaio.Pipe further enhances memory efficiency.
Leverage Native ClickHouse Optimizations: Directly utilizes Clickhouse's likely highly optimized C++ implementation for Arrow data ingestion.
Simplified Plugin Code (Runtime): While canonization logic remains, the complex runtimeconversion code in Go (typeconv/ch/values) is removed.

Implementation Considerations & Challenges

Transport Layer Rewrite: Requires replacingclickhouse.Conn usage for writes with a properly configurednet/http.Client (handling TLS, pooling, timeouts, auth).
Configuration Changes: Thespec.Spec needs updates to handle HTTP connection parameters (URL, user, pass, etc.) instead of/alongside the DSN.spec.Options needs rewriting.
Serialization Logic: Implement Arrow IPC Stream serialization usingarrow/ipc andio.Pipe for efficient streaming.
Schema Canonization Adaptation: The logic intypeconv.CanonizedTable/CanonizedField must be updated to map from SDK Arrow types to the specific Arrow base types expected by ClickHouse'sArrowStream input format for the corresponding target ClickHouse column type.
DDL and Schema Introspection: All operations currently relying onconn.Exec orconn.Query (e.g.,MigrateTables,getTableDefinitions,getPartitionKeyAndSortingKey) must be rewritten to use HTTP requests (e.g.,POST /?query=CREATE TABLE... FORMAT ...,POST /?query=SELECT ... FROM system.columns ... FORMAT JSON). This includes parsing the responses (e.g., JSON) from these administrative queries. This is a substantial change.
Error Handling: Implement robust handling for HTTP status codes and parsing error messages from ClickHouse HTTP responses.
Compression: Decide on and potentially implement optional request body compression (gzip).
Testing: Thorough performance benchmarking (throughput, CPU, memory) against the current implementation is essential to validate the benefits across different data shapes and volumes. Functional testing for all types and DDL operations is critical.

Pull request (optional)

I can submit a pull request

Metadata

Assignees

No one assigned

Labels

kind/feat

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve ClickHouse plugin performance by using HTTP interface with ArrowStream format #20621

Description

Which problem is this feature request solving?

Current Behavior & Problem

Alternatives Considered

Success Criteria

Describe the solution you'd like

Proposed Behavior

Clarification on Conversion

Expected Benefits

Implementation Considerations & Challenges

Pull request (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions