Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork33.3k
GH-98363: Have batched() return tuples#100118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
netlifybot commentedDec 8, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
✅ Deploy Preview forpython-cpython-preview ready!
To edit notification comments on pull requests, go to yourNetlify site settings. |
The Checking I think it might be safer to call |
I thought about as well, but what we're doing is reasonable. It just leaves the tuple over-allocated which does not defeat any subsequent reuse. The |
rhettinger commentedDec 8, 2022 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Hmph, I pushed the green button without giving you a chance to respond. Sorry about that. If you feel strongly about |
No worries. I agree that there is no genuine "safety" concern, just a question of memory overuse here and there. This was my fear: someone may be doing an operation that involves very high per-batch overhead, and so they choose a large batch size, say 0.1% of working memory. I've written code roughly like this before, and probably would have used BATCHSIZE=1_000_000defget_data(db_cursor,ids):forbatchinbatched(ids,BATCHSIZE):db_cursor.executemany(SQL_GET_DATA,batch)yieldfromdb_cursor.fetchall()deffile_pairs(): ...forin_file,out_fileinfile_pairs():ids=get_data_from_file(in_file)forrowinget_data(db_cursor,ids):append_data_to_file(out_file,row) If there are many files with Some mitigating factors though:
But this data-dependence leads to another issue: the code seems like it ought to take a large constant amount of memory, but for some obscure inputs (many small leftover batches), it could take 40000x as much memory, caught up in mysterious places like the I don't really know how to assess how probable these are, especially with the LIFO nature of our freelists, so my thought was to do the more predictable thing and be as conservative as possible with memory usage, giving things back to the operating system as soon as possible. But maybe my usage of huge batches is playing with fire anyway, so it would be understandable to keep the code as is for that reason. Another option would be to only do a Thoughts? |
The scenario seems bogus, but I can't divert any more time to this. So, I'veput the resize in. I quite liked that |
Uh oh!
There was an error while loading.Please reload this page.
On python-ideas, I got some pushback on the decision to return lists. After rereading the original thread and looking at the use cases with fresh eyes, it looks like tuples are a marginally better choice. Using a tuple better matches tools like enumerate, zip, zip_longest, pairwise, and the combinatoric iterators. Tuples are better suited for cases where use hashing is desired.
And tuples will be a better fit if we ever add zip_longest style
fillvaluesupport.The notion that people might want to mutate the result proved to be dubious. A quick code search for
more-iterools.chunked()turned-up no examples. If the need did arise, a list comprehension would likely be better than being tempted to mutate in place.Mostly this is a usability/consistency decision, but there are slight performance benefits as well. Tuples have the data in the tuple object rather than in a separate data array. So tuples require only one memory allocation instead of two. Tuples use less memory than lists (two fewer fields and no losses due to data array alignment). Lookups are slightly faster because lists have one additional indirection step. Also, tuples have better cache performance — their unibody design lets them fit into a single cache line rather than two for lists.