Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse

License

NotificationsYou must be signed in to change notification settings

chdb-io/chdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Build X86PyPIDownloadsDiscordTwitter

chDB

chDB is an in-process SQL OLAP Engine powered by ClickHouse1For more details:The birth of chDB

Features

  • In-process SQL OLAP Engine, powered by ClickHouse
  • No need to install ClickHouse
  • Minimized data copy from C++ to Python withpython memoryview
  • Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats,samples
  • Support Python DB API 2.0,example

Arch

Get Started

Get started withchdb using ourInstallation and Usage Examples


Installation

Currently, chDB supports Python 3.8+ on macOS and Linux (x86_64 and ARM64).

pip install chdb

Usage

Run in command line

python3 -m chdb SQL [OutputFormat]

python3 -m chdb"SELECT 1,'abc'" Pretty

Data Input

The following methods are available to access on-disk and in-memory data formats:

🗂️ Connection based API (recommended)

importchdb# Create a connection (in-memory by default)conn=chdb.connect(":memory:")# Or use file-based: conn = chdb.connect("test.db")# Create a cursorcur=conn.cursor()# Execute queriescur.execute("SELECT number, toString(number) as str FROM system.numbers LIMIT 3")# Fetch data in different waysprint(cur.fetchone())# Single row: (0, '0')print(cur.fetchmany(2))# Multiple rows: ((1, '1'), (2, '2'))# Get column informationprint(cur.column_names())# ['number', 'str']print(cur.column_types())# ['UInt64', 'String']# Use the cursor as an iteratorcur.execute("SELECT number FROM system.numbers LIMIT 3")forrowincur:print(row)# Always close resources when donecur.close()conn.close()

For more details, seeexamples/connect.py.

🗂️ Query On File

(Parquet, CSV, JSON, Arrow, ORC and 60+)

You can execute SQL and return desired format data.

importchdbres=chdb.query('select version()','Pretty');print(res)

Work with Parquet or CSV

# See more data type format in tests/format_output.pyres=chdb.query('select * from file("data.parquet", Parquet)','JSON');print(res)res=chdb.query('select * from file("data.csv", CSV)','CSV');print(res)print(f"SQL read{res.rows_read()} rows,{res.bytes_read()} bytes, storage read{res.storage_rows_read()} rows,{res.storage_bytes_read()} bytes, elapsed{res.elapsed()} seconds")

Pandas dataframe output

# See more in https://clickhouse.com/docs/en/interfaces/formatschdb.query('select * from file("data.parquet", Parquet)','Dataframe')

🗂️ Query On Table

(Pandas DataFrame, Parquet file/bytes, Arrow bytes)

Query On Pandas DataFrame

importchdb.dataframeascdfimportpandasaspd# Join 2 DataFramesdf1=pd.DataFrame({'a': [1,2,3],'b': ["one","two","three"]})df2=pd.DataFrame({'c': [1,2,3],'d': ["①","②","③"]})ret_tbl=cdf.query(sql="select * from __tbl1__ t1 join __tbl2__ t2 on t1.a = t2.c",tbl1=df1,tbl2=df2)print(ret_tbl)# Query on the DataFrame Tableprint(ret_tbl.query('select b, sum(a) from __table__ group by b'))# Pandas DataFrames are automatically registered as temporary tables in ClickHousechdb.query("SELECT * FROM Python(df1) t1 JOIN Python(df2) t2 ON t1.a = t2.c").show()

🗂️ Query with Stateful Session

fromchdbimportsessionaschs## Create DB, Table, View in temp session, auto cleanup when session is deleted.sess=chs.Session()sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic")sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = Log;")sess.query("INSERT INTO db_xxx.log_table_xxx VALUES ('a', 1), ('b', 3), ('c', 2), ('d', 5);")sess.query("CREATE VIEW db_xxx.view_xxx AS SELECT * FROM db_xxx.log_table_xxx LIMIT 4;")print("Select from view:\n")print(sess.query("SELECT * FROM db_xxx.view_xxx","Pretty"))

see also:test_stateful.py.

🗂️ Query with Python DB-API 2.0

importchdb.dbapiasdbapiprint("chdb driver version: {0}".format(dbapi.get_client_info()))conn1=dbapi.connect()cur1=conn1.cursor()cur1.execute('select version()')print("description: ",cur1.description)print("data: ",cur1.fetchone())cur1.close()conn1.close()

🗂️ Query with UDF (User Defined Functions)

fromchdb.udfimportchdb_udffromchdbimportquery@chdb_udf()defsum_udf(lhs,rhs):returnint(lhs)+int(rhs)print(query("select sum_udf(12,22)"))

Some notes on chDB Python UDF(User Defined Function) decorator.

  1. The function should be stateless. So, only UDFs are supported, not UDAFs(User Defined Aggregation Function).
  2. Default return type is String. If you want to change the return type, you can pass in the return type as an argument.The return type should be one of the following:https://clickhouse.com/docs/en/sql-reference/data-types
  3. The function should take in arguments of type String. As the input is TabSeparated, all arguments are strings.
  4. The function will be called for each line of input. Something like this:
    def sum_udf(lhs, rhs):    return int(lhs) + int(rhs)for line in sys.stdin:    args = line.strip().split('\t')    lhs = args[0]    rhs = args[1]    print(sum_udf(lhs, rhs))    sys.stdout.flush()
  5. The function should be pure python function. You SHOULD import all python modules used IN THE FUNCTION.
    def func_use_json(arg):    import json    ...
  6. Python interpertor used is the same as the one used to run the script. Get fromsys.executable

see also:test_udf.py.

🗂️ Streaming Query

Process large datasets with constant memory usage through chunked streaming.

fromchdbimportsessionaschssess=chs.Session()# Example 1: Basic example of using streaming queryrows_cnt=0withsess.send_query("SELECT * FROM numbers(200000)","CSV")asstream_result:forchunkinstream_result:rows_cnt+=chunk.rows_read()print(rows_cnt)# 200000# Example 2: Manual iteration with fetch()rows_cnt=0stream_result=sess.send_query("SELECT * FROM numbers(200000)","CSV")whileTrue:chunk=stream_result.fetch()ifchunkisNone:breakrows_cnt+=chunk.rows_read()print(rows_cnt)# 200000# Example 3: Early cancellation demorows_cnt=0stream_result=sess.send_query("SELECT * FROM numbers(200000)","CSV")whileTrue:chunk=stream_result.fetch()ifchunkisNone:breakifrows_cnt>0:stream_result.close()breakrows_cnt+=chunk.rows_read()print(rows_cnt)# 65409# Example 4: Using PyArrow RecordBatchReader for batch export and integration with other librariesimportpyarrowaspafromdeltalakeimportwrite_deltalake# Get streaming result in arrow formatstream_result=sess.send_query("SELECT * FROM numbers(100000)","Arrow")# Create RecordBatchReader with custom batch size (default rows_per_batch=1000000)batch_reader=stream_result.record_batch(rows_per_batch=10000)# Use RecordBatchReader with external libraries like Delta Lakewrite_deltalake(table_or_uri="./my_delta_table",data=batch_reader,mode="overwrite")stream_result.close()sess.close()

Important Note: When using streaming queries, if theStreamingResult is not fully consumed (due to errors or early termination), you must explicitly callstream_result.close() to release resources, or use thewith statement for automatic cleanup. Failure to do so may block subsequent queries.

For more details, seetest_streaming_query.py andtest_arrow_record_reader_deltalake.py.

🗂️ Python Table Engine

Query on Pandas DataFrame

importchdbimportpandasaspddf=pd.DataFrame(    {"a": [1,2,3,4,5,6],"b": ["tom","jerry","auxten","tom","jerry","auxten"],"dict_col": [            {'id':1,'tags': ['urgent','important'],'metadata': {'created':'2024-01-01'}},            {'id':2,'tags': ['normal'],'metadata': {'created':'2024-02-01'}},            {'id':3,'name':'tom'},            {'id':4,'value':'100'},            {'id':5,'value':101},            {'id':6,'value':102},        ],    })chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()chdb.query("SELECT dict_col.id FROM Python(df) WHERE dict_col.value='100'").show()

Query on Arrow Table

importchdbimportpyarrowaspaarrow_table=pa.table(    {"a": [1,2,3,4,5,6],"b": ["tom","jerry","auxten","tom","jerry","auxten"],"dict_col": [            {'id':1,'value':'tom'},            {'id':2,'value':'jerry'},            {'id':3,'value':'auxten'},            {'id':4,'value':'tom'},            {'id':5,'value':'jerry'},            {'id':6,'value':'auxten'},        ],    })chdb.query("SELECT b, sum(a) FROM Python(arrow_table) GROUP BY b ORDER BY b").show()chdb.query("SELECT dict_col.id FROM Python(arrow_table) WHERE dict_col.value='tom'").show()

Query on chdb.PyReader class instance

  1. You must inherit from chdb.PyReader class and implement theread method.
  2. Theread method should:
    1. return a list of lists, the first demension is the column, the second dimension is the row, the columns order should be the same as the first argcol_names ofread.
    2. return an empty list when there is no more data to read.
    3. be stateful, the cursor should be updated in theread method.
  3. An optionalget_schema method can be implemented to return the schema of the table. The prototype isdef get_schema(self) -> List[Tuple[str, str]]:, the return value is a list of tuples, each tuple contains the column name and the column type. The column type should be one of the following:https://clickhouse.com/docs/en/sql-reference/data-types
importchdbclassmyReader(chdb.PyReader):def__init__(self,data):self.data=dataself.cursor=0super().__init__(data)defread(self,col_names,count):print("Python func read",col_names,count,self.cursor)ifself.cursor>=len(self.data["a"]):self.cursor=0return []block= [self.data[col]forcolincol_names]self.cursor+=len(block[0])returnblockdefget_schema(self):return [            ("a","int"),            ("b","str"),            ("dict_col","json")        ]reader=myReader(    {"a": [1,2,3,4,5,6],"b": ["tom","jerry","auxten","tom","jerry","auxten"],"dict_col": [            {'id':1,'tags': ['urgent','important'],'metadata': {'created':'2024-01-01'}},            {'id':2,'tags': ['normal'],'metadata': {'created':'2024-02-01'}},            {'id':3,'name':'tom'},            {'id':4,'value':'100'},            {'id':5,'value':101},            {'id':6,'value':102}        ],    })chdb.query("SELECT b, sum(a) FROM Python(reader) GROUP BY b ORDER BY b").show()chdb.query("SELECT dict_col.id FROM Python(reader) WHERE dict_col.value='100'").show()

see also:test_query_py.py andtest_query_json.py.

JSON Type Inference

chDB automatically converts Python dictionary objects to ClickHouse JSON types from these sources:

  1. Pandas DataFrame

    • Columns withobject dtype are sampled (default 10,000 rows) to detect JSON structures.
    • Control sampling via SQL settings:
      SET pandas_analyze_sample=10000-- Default samplingSET pandas_analyze_sample=0-- Force String typeSET pandas_analyze_sample=-1-- Force JSON type
    • Columns are converted toString if sampling finds non-dictionary values.
  2. chdb.PyReader

    • Implement custom schema mapping inget_schema():
      defget_schema(self):return [        ("c1","JSON"),# Explicit JSON mapping        ("c2","String")    ]
    • Column types declared as "JSON" will bypass auto-detection.

When converting Python dictionary objects to JSON columns:

  1. Nested Structures

    • Recursively process nested dictionaries, lists, tuples and NumPy arrays.
  2. Primitive Types

    • Automatic type recognition for basic types such as integers, floats, strings, and booleans, and more.
  3. Complex Objects

    • Non-primitive types will be converted to strings.

Limitations

  1. Column types supported: pandas.Series, pyarrow.array, chdb.PyReader
  2. Data types supported: Int, UInt, Float, String, Date, DateTime, Decimal
  3. Python Object type will be converted to String
  4. Pandas DataFrame performance is all of the best, Arrow Table is better than PyReader

For more examples, seeexamples andtests.


Demos and Examples

Benchmark

Documentation

Events

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make aregreatly appreciated.There are something you can help:

  • Help test and report bugs
  • Help improve documentation
  • Help improve code quality and performance

Bindings

We welcome bindings for other languages, please refer tobindings for more details.

Version Guide

Please refer toVERSION-GUIDE.md for more details.

Paper

License

Apache 2.0, seeLICENSE for more information.

Acknowledgments

chDB is mainly based onClickHouse1for trade mark and other reasons, I named it chDB.

Contact


Footnotes

  1. ClickHouse® is a trademark of ClickHouse Inc. All trademarks, service marks, and logos mentioned or depicted are the property of their respective owners. The use of any third-party trademarks, brand names, product names, and company names does not imply endorsement, affiliation, or association with the respective owners.2

Sponsor this project

 

[8]ページ先頭

©2009-2025 Movatter.jp