- Notifications
You must be signed in to change notification settings - Fork6.1k
Open
Description
Enhancement
In some cases, the size of single file of import data may be small, causing too much files need to scan. And it may consume more than one hour to read about 1 million files. Below is the part of the log, as you can see, it costs about 40 minutes to generate subtasks.
[2025/11/29 05:37:51.623 +00:00] [INFO] [scheduler.go:309] ["on next subtasks batch"] [keyspaceName=SYSTEM] [task-id=1] [task-key=xxx] [curr-step=init] [next-step=encode] [node-count=18] [table-id=15][2025/11/29 06:14:22.097 +00:00] [INFO] [table_import.go:420] ["populate chunks start"] [keyspaceName=SYSTEM] [task-id=1] [task-key=xxx] [curr-step=init] [next-step=encode] [node-count=18] [table-id=15]There are two thing we can improve here:
- Skip reading files before submitting task with global sort, since we will read the files again on dxf scheduler, and we only used total file size to calculate dxf node resource.
- We can use part of the files to estimate a overall compression ratio, to avoid opening every file.