Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

tarfile's cache balloons in memory when streaming a big tarfile #102120

Closed
Labels
stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error
@spenczar

Description

@spenczar

Bug report

I've got a bunch of tar files containing millions of small files. Never mind how we got here - I need to process those tars, handling each of the files inside. Furthermore, I need to process a lot of these tars, and I'd like to do it relatively quickly on cheapish hardware, so I'm at least a little sensitive to memory consumption.

The natural thing to do is to iterate through the tarfile, extracting each file one at a time, and carefully closing them when done:

withtarfile.open(filepath,"r:gz")astar:formemberintar:file_buf=tar.extractfile(member)try:handle(file_buf)finally:file_buf.close()

This looks like it should handle each small file and discard it when done, so memory should stay pretty tame. I was very surprised to discover that this actually uses gigabytes of memory. That's fixed if you do this:

withtarfile.open(filepath,"r:gz")astar:formemberintar:file_buf=tar.extractfile(member)try:handle(file_buf)finally:file_buf.close()tar.members= []# evil!

That works because tarinfo.TarFile has a cache,self.members. That cache is appended to inTarFile.next(), which in turn is used inTarFile.__iter__.

That cache is storingTarInfos, which are headers describing each file. They're not very large, but with lots and lots of files, those headers can add up.

The net result is that it's not possible to stream a tarfile's contents without memory growing linearly with the number of files in the tarfile. This has been partially addressed in the past (see#46334, from way back in 2008), but never fully resolved. It shows up inStackOverflow and probably elsewhere, with a clumsy recommended solution of resettingtar.members each time, but there ought to be a better way.

Your environment

CPython 3.10, mostly; I don't think OS or architecture etc are relevant.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp