NotificationsYou must be signed in to change notification settings
Fork33.7k
Star70.4k

Reduce copies when reading files in pyio, match behavior of _io #129005

Closed

Reduce copies when reading files in pyio, match behavior of _io#129005

Labels

performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

Description

cmaloney

opened

on Jan 18, 2025

Feature or enhancement

Proposal:

Currently_pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.

Details from test_fileio run

$ ./python -mtest -M8g -uall test_largefile -m test_large_read -vvv== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian== Python build: debug== cwd:<$HOME>/python/build/build/test_python_worker_32392æ== CPU count: 32== encodings: locale=UTF-8 FS=utf-8== resources: allUsing random seed: 17400566130:00:00 load avg: 0.53 Run 1test sequentiallyin a single process0:00:00 load avg: 0.53 [1/1] test_largefiletest_large_read (test.test_largefile.CLargeFileTest.test_large_read) ...  ... expected peak memory use: 4.7G ... process data size: 2.3Goktest_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ...  ... expected peak memory use: 4.7G ... process data size: 2.3G ... process data size: 4.3G ... process data size: 4.7Gok----------------------------------------------------------------------Ran 2 testsin 3.711sOK== Tests result: SUCCESS ==1test OK.Total duration: 3.7 secTotal tests: run=2 (filtered)Totaltest files: run=1/1 (filtered)Result: SUCCESS

Plan:

Switch to~~os.readv()~~os.readinto() to do readinto like C_Py_read used by_io does.os.read() can't take a buffer to use. This aligns behavior between_io.FileIO.readall and_pyio.FileIO.readall.os.readv works well today and takes a caller allocated buffer rather than needing to add a newos API.readv(2) mirrors the behavior and errors ofread(2), so this should keep the same end behavior.
Update_pyio.BufferedIO to not force a copy of the buffer for readall when its internal buffer is empty. Currently italways slices its internal buffer then adds the result of_pyio.FileIO.readall to it.

For iterating, I'm using a small tracemalloc script to find where copies are:

from_pyioimportopenimporttracemallocwithopen("README.rst",'rb')asfile:tracemalloc.start()data=file.read()snap=tracemalloc.take_snapshot()stats=snap.statistics('lineno')forstatinstats:print(stat)

Loose Ends

os.readv seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio requirereadv, but can do conditional code if needed. If makingreadv non-optional generally is feasible, happy to work on that.
- os.readv is not supported on WASI, so need to add conditional code.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Assignees

No one assigned

Labels

performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reduce copies when reading files in pyio, match behavior of _io #129005

Description

Feature or enhancement

Proposal:

Plan:

Loose Ends

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions