Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.7k
Closed
Description
Feature or enhancement
Proposal:
Currently_pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.
Details from test_fileio run
$ ./python -mtest -M8g -uall test_largefile -m test_large_read -vvv== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian== Python build: debug== cwd:<$HOME>/python/build/build/test_python_worker_32392æ== CPU count: 32== encodings: locale=UTF-8 FS=utf-8== resources: allUsing random seed: 17400566130:00:00 load avg: 0.53 Run 1test sequentiallyin a single process0:00:00 load avg: 0.53 [1/1] test_largefiletest_large_read (test.test_largefile.CLargeFileTest.test_large_read) ... ... expected peak memory use: 4.7G ... process data size: 2.3Goktest_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ... ... expected peak memory use: 4.7G ... process data size: 2.3G ... process data size: 4.3G ... process data size: 4.7Gok----------------------------------------------------------------------Ran 2 testsin 3.711sOK== Tests result: SUCCESS ==1test OK.Total duration: 3.7 secTotal tests: run=2 (filtered)Totaltest files: run=1/1 (filtered)Result: SUCCESS
Plan:
- Switch to
os.readv()os.readinto()to do readinto like C_Py_readused by_iodoes.os.read()can't take a buffer to use. This aligns behavior between_io.FileIO.readalland_pyio.FileIO.readall.os.readvworks well today and takes a caller allocated buffer rather than needing to add a newosAPI.readv(2)mirrors the behavior and errors ofread(2), so this should keep the same end behavior. - Update
_pyio.BufferedIOto not force a copy of the buffer for readall when its internal buffer is empty. Currently italways slices its internal buffer then adds the result of_pyio.FileIO.readallto it.
For iterating, I'm using a small tracemalloc script to find where copies are:
from_pyioimportopenimporttracemallocwithopen("README.rst",'rb')asfile:tracemalloc.start()data=file.read()snap=tracemalloc.take_snapshot()stats=snap.statistics('lineno')forstatinstats:print(stat)
Loose Ends
os.readvseems to be well supported but is currently guarded by a configure check. I'd like to just make pyio requirereadv, but can do conditional code if needed. If makingreadvnon-optional generally is feasible, happy to work on that.os.readvis not supported on WASI, so need to add conditional code.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
- gh-129005: Avoid copy in _pyio.FileIO.readinto #129006
- gh-129005: Avoid copy in _pyio.FileIO.readinto #129324
- gh-129005: Align FileIO.readall allocation #129424
- gh-129005: _pyio.BufferedIO remove copy on readall #129454
- gh-129005: Align FileIO.readall allocation #129458
- gh-129005: Remove copy in
_pyio.FileIO.readall()#129496 - Revert "gh-129005: _pyio.BufferedIO remove copy on readall (#129454)" #129500
- gh-129005: Fix buffer expansion in _pyio.FileIO.readall #129541
- Revert "gh-129005: Align FileIO.readall() allocation (#129458)" #129572
- gh-129005: Update _pyio.BytesIO to use bytearray.resize on write #129702
- gh-129005: Align FileIO.readall between _pyio and _io #129705
- gh-129005: Move bytearray to use bytes as a buffer #130563
- gh-129005: Remove copies from _pyio using take_bytes #141539