importinto: scanning large amount of compressed files is slow #64770

New issue

Open

#64769

Open

importinto: scanning large amount of compressed files is slow#64770

#64769

Labels

component/importtype/enhancementThe issue or PR belongs to an enhancement.

Description

joechenrh

opened

on Nov 29, 2025

Enhancement

In some cases, the size of single file of import data may be small, causing too much files need to scan. And it may consume more than one hour to read about 1 million files. Below is the part of the log, as you can see, it costs about 40 minutes to generate subtasks.

[2025/11/29 05:37:51.623 +00:00] [INFO] [scheduler.go:309] ["on next subtasks batch"] [keyspaceName=SYSTEM] [task-id=1] [task-key=xxx] [curr-step=init] [next-step=encode] [node-count=18] [table-id=15][2025/11/29 06:14:22.097 +00:00] [INFO] [table_import.go:420] ["populate chunks start"] [keyspaceName=SYSTEM] [task-id=1] [task-key=xxx] [curr-step=init] [next-step=encode] [node-count=18] [table-id=15]

There are two thing we can improve here:

Skip reading files before submitting task with global sort, since we will read the files again on dxf scheduler, and we only used total file size to calculate dxf node resource.
We can use part of the files to estimate a overall compression ratio, to avoid opening every file.

Metadata

Assignees

No one assigned

Labels

component/importtype/enhancementThe issue or PR belongs to an enhancement.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

importinto: scanning large amount of compressed files is slow #64770

Description

Enhancement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions