ClearTax/duckdb-excelPublic

forked fromduckdb/duckdb-excel

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Excel extension for DuckDB

License

MIT license

0 stars 16 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.github/workflows		.github/workflows
duckdb @ 71c5c07		duckdb @ 71c5c07
extension-ci-tools @ 71d2002		extension-ci-tools @ 71d2002
src/excel		src/excel
test		test
vcpkg_ports		vcpkg_ports
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
extension_config.cmake		extension_config.cmake
vcpkg.json		vcpkg.json

Repository files navigation

DuckDB Excel Extension

This extension adds support for theTEXT function for formatting numbers from Microsoft Excel.

Example usage:

SELECTtext(1234567.897,'$#.##')AS result;┌────────────┐│   result   ││varchar   │├────────────┤│ $1234567.9 │└────────────┘

Documentation

See theExcel page in the DuckDB documentation.

XLSX Files

Reading XLSX Files

.xlsx files can be read using theread_xlsx function. The following named parameters are supported.

Options:

Option	Type	Default	Description
`header`	`BOOLEAN`	automatically inferred	Whether to treat the first row as containing the names of the resulting columns
`sheet`	`VARCHAR`	automatically inferred	The name of the sheet in the xlsx file to read. Default is the first sheet.
`all_varchar`	`BOOLEAN`	`false`	Whether to read all cells as containing`VARCHAR`s.
`ignore_errors`	`BOOLEAN`	`false`	Whether to ignore errors and silently replace cells that cant be cast to the corresponding inferred column type with`NULL`'s.
`range`	`VARCHAR`	automatically inferred	The range of cells to read. For example,`A1:B2` reads the cells from A1 to B2. If not specified the resulting range will be inferred as rectangular region of cells between the first row of consecutive non-empty cells and the first empty row spanning the same columns
`stop_at_empty`	`BOOLEAN`	`false/true`	Whether to stop reading the file when an empty row is encountered. If an explicit`range` option is provided, this is`false` by default, otherwise`true`
`empty_as_varchar`	`BOOLEAN`	`false`	Whether to treat empty cells as`VARCHAR` instead of`DOUBLE` when trying to automatically infer column types

Example usage:

SELECT*FROM read_xlsx('test.xlsx', header'true');----┌────────┬────────┐│   a    │   b    ││ double │ double │├────────┼────────┤│1.0 │2.0 ││3.0 │4.0 │└────────┴────────┘-- Alternatively, we can use a xlsx file as a "replacement scans" and select from it immediately-- but without being able to pass options.SELECT*FROM'test.xlsx';----┌────────┬────────┐│   a    │   b    ││ double │ double │├────────┼────────┤│1.0 │2.0 ││3.0 │4.0 │└────────┴────────┘

Writing XLSX Files

Writing.xlsx files is supported using theCOPY statement withXLSX given as the format. The following additional parameters are supported.

Options:

Option	Type	Default	Description
`header`	`BOOLEAN`	`false`	Whether to write the column names as the first row in the sheet
`sheet`	`VARCHAR`	`Sheet1`	The name of the sheet in the xlsx file to write.
`sheet_row_limit`	`INTEGER`	`1048576`	The maximum number of rows in a sheet. An error is thrown if this limit is exceeded.

Example usage:

CREATETABLEtestASSELECT*FROM (VALUES (1,2), (3,4))AS t(a, b);COPY test TO'test.xlsx' (format'xlsx', header'true');

Type Conversions and Inference

Because XLSX files only really support storing strings and numbers, the equivalent ofVARCHAR andDOUBLE, the following type conversions are applied when writing XLSX files.

Numeric types are cast toDOUBLE when writing to an XLSX file.
Temporal types (TIMESTAMP,DATE,TIME, etc.) are converted to excel "serial" numbers, that is the number of days since 1900-01-01 for dates and the fraction of a day for times. These are then styled with a "number format" so that they appear as dates or times in Excel.
TIMESTAMP_TZ andTIME_TZ are cast to UTCTIMESTAMP andTIME respectively, with the timezone information being lost.
BOOLEANs are converted to1 and0, with a "number format" applied to make them appear asTRUE andFALSE in Excel.
All other types are cast toVARCHAR and then written as text cells.

When reading XLSX files, almost everything is read as eitherDOUBLE orVARCHAR depending on the Excel cell type. However, there are some caveats.

We try to inferTIMESTAMP,TIME,DATE andBOOLEAN types when possible based on the cell format.
We infer text cells containingTRUE andFALSE asBOOLEAN, but that is the only type-inference we do that is based on the actual content of the cell.
Empty cells are considered to beDOUBLE by default, unless theempty_as_varchar option is set totrue, in which case they are typed asVARCHAR.

If theall_varchar option is set totrue, none of the above applies and all cells are read asVARCHAR.

When no types are specified explicitly, (e.g. when using theread_xlsx function instead ofCOPY TO ... FROM '<file>.xlsx')the types of the resulting columns are inferred based on the first "data" row in the sheet, that is:

If no explicit range is given
- The first row after the header if a header is found or forced by theheader option
- The first non-empty row in the sheet if no header is found or forced
If an explicit range is given
- The second row of the range if a header is found in the first row or forced by theheader option
- The first row of the range if no header is found or forced

This can sometimes lead to issues if the first "data row" is not representative of the rest of the sheet (e.g. it contains empty cells) in which case theignore_errors orempty_as_varchar options can be used to work around this.Alternatively, when theCOPY TO ... FROM '<file>.xlsx' syntax is used, no type inference is done and the types of the resulting columns are determined by the types of the columns in the table being copied to. All cells will simply be converted by casting fromDOUBLE orVARCHAR to the target column type.

About

Excel extension for DuckDB

Releases

No releases published

Packages

No packages published

Languages

C++97.9%
CMake1.9%
Other0.2%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DuckDB Excel Extension

Documentation

XLSX Files

Reading XLSX Files

Writing XLSX Files

Type Conversions and Inference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

ClearTax/duckdb-excel

Folders and files

Latest commit

History

Repository files navigation

DuckDB Excel Extension

Documentation

XLSX Files

Reading XLSX Files

Writing XLSX Files

Type Conversions and Inference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages