Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork11.9k
ENH: implementation of array_reduce_ex#12011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
d383f8a to57c4304Comparenumpy/core/src/multiarray/methods.c Outdated
| /* if the python version is below 3.8, the pickle module does not provide | ||
| * built-in support for protocol 5. We try importing the pickle5 | ||
| * backport instead */ | ||
| if (PY_VERSION_HEX<0x03080000){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Rather than a runtime condition, it should be possible to use a build time preprocessor directive here.
pierreglaserSep 25, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes, although it was not necessary for the code to compile - definitely can do it in further iterations, but did not want to generate any misunderstandings :)
numpy/core/src/multiarray/methods.c Outdated
| { | ||
| intprotocol; | ||
| PyObject*ret=NULL,*numeric_mod=NULL,*from_buffer_func=NULL; | ||
| PyObject*buffer_tuple=NULL,*pickle_module=NULL,*pickle_class=NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
spaces here, fixing it.
ogrisel commentedSep 27, 2018
Thank you very much for the new benchmarks. They look good: there is no more spurious memory copies when pickling and loading contiguous numpy arrays when using protocol 5 either with in-band or out-of-band buffers. |
ogrisel commentedSep 27, 2018
Next steps:
|
ahaldane commentedSep 30, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I don't think we can modify the signature of Otherwise looks fine so far, at inital skim. |
numpy/core/src/multiarray/methods.c Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It's not necessary to try to importpickle5 whenprotocol < 5. In particulardata.dump(f) should always work for Python 3.6 and 3.7 even if thepickle5 is not installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We actually need an explicit test for those cases.
e5a1de5 to40b81bfComparenumpy/core/numeric.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
sys.version_info[1] in [6, 7]
numpy/core/tests/test_dtype.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This test should not raise ImportError if pickle5 is not installed in Python 3.7 or 3.6: pickle5 should be an optional dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
specifically this file or all tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
To avoid replicating complex import everywhere, you could probably do:
fromnumpy.core.numericimportpickle
numpy/core/tests/test_records.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Same comment here: pickle5 should never be a mandatory test dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
fromnumpy.core.numericimportpickle
numpy/core/tests/test_ufunc.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't think the local import is necessary. I think this can be moved to the top level as:
from numpy.core.numeric import pickletools/travis-test.sh Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Instead of hardcoding the installation ofpickle5 based on the Python version number, please use an explicit environment variable
if ["$INSTALL_PICKLE5"=="1" ];then pip install pickle5fi
and then set that variable on some of the Python 3.6 and/or 3.7 builds in.travis.yml.
a5faa92 tob415bacCompare
ogrisel left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Another batch of comments:
.appveyor.yml Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
No need to set the variable when it's 0-valued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
ActuallyINSTALL_PICKLE5=0 will not work as expected with the following bash conditions fromtools/travis-test.sh:
if [ -n "$INSTALL_PICKLE5" ]; then pip install pickle5fiso allINSTALL_PICKLE5: 0 occurrences must be removed from this configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actually this is appveyor. But still I think we should not have to setINSTALL_PICKLE5: 0 explicitly.
numpy/core/src/multiarray/methods.c Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would change this comment to:
The PickleBuffer class from version 5 of the pickle protocol can only be used for arrays backed by a contiguous data buffer. For all other cases we fallback to the generic array_reduce method that involves using a temporary bytes allocation. However we do not call array_reduce directly but instead lookup and call the__reduce__ method to make sure that it's possible customize pickling in sub-classes.
numpy/core/tests/test_multiarray.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
nitpick: I think the style for binary operators in numpy is to use whitespaces around operators:
forprotoinrange(2,pickle.HIGHEST_PROTOCOL+1): ...
this issue is repeated many times.
pierreglaserOct 3, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I believe you, however:
- I don't see any mention of it in the docs (https://docs.scipy.org/doc/numpy-1.15.0/dev/index.html )
- PEP8 does not explicitly recommendsystematically adding spaces between the
+binary operator (https://www.python.org/dev/peps/pep-0008/)
I can change it, but there was no easy way for me to infer that Ihad to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Yes I agree this is not mandatory. But I looked around and it seems that the numpy convention is to use spaces more often than not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
numpy/core/tests/test_ufunc.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
unused import?
shippable.yml Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If the python version is 3.5 or 3.8 this command will fail.
The test should pass if pickle5 is not there. I think this can be removed now.
ogrisel left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Some more comments:
numpy/core/tests/test_multiarray.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Instead of passing the test, this should be skipped with pytest:
@pytest.mark.skipif(pickle.HIGHEST_PROTOCOL<5,reason="requires pickle protocol 5")deftest_array_with_dtype_has_object_pickling(self):my_object=object() ...
numpy/core/tests/test_multiarray.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Same here, let's use@pytest.mark.skipif:
@pytest.mark.skipif(pickle.HIGHEST_PROTOCOL<5,reason="requires pickle protocol 5")deftest_pickle_with_buffercallback(self): ...
numpy/core/tests/test_multiarray.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would rename this totest_record_array_with_object_dtype. There is already "pickling" in the test class name.
ogrisel commentedOct 3, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
By playing this PR I noticed the following: >>>importnumpyasnp>>>importpickle5>>>a=np.arange(12) Let's mark the array read-only and then pickle it: >>>a.flags.writeable=False>>>pickle5.loads(a.dumps(protocol=2)).flags.writeableTrue The writeable flag is lost: it means that currently (in numpy master and all past versions), a readonly array would be loaded back as a read-write array. I think this is a bug of numpy. (I actually checked that I could reproduce the behavior with the last stable release of numpy). On the contrary, the >>>pickle5.loads(a.dumps(protocol=5)).flags.writeableFalse @ahaldane@charris do you confirm that the fact that a |
ahaldane commentedOct 3, 2018
My initial impression is it's not a bug: We don't preserve the writeable flag by default when we construct a new array from another one by |
numpy/core/numeric.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I had not realized that you could reshape with order. So it's better to passorder itself directly as an argument to_frombuffer instead of thetranspose flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would renamec_contiguous_array toflat_array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
But actually that would just become a one liner:
def_frombuffer(buffer,dtype,shape,order):returnfrombuffer(buffer,dtype=dtype).reshape(shape,order=order)
numpy/core/src/multiarray/methods.c Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We cannot do this, without setting a proper error message.
For instance with Python 3.5 I get the following with the current implementation:
>>>importnumpyasnp>>>a=np.arange(12)>>>s=a.dumps(protocol=6)Traceback (mostrecentcalllast):File"<ipython-input-9-2e271e7e7bc8>",line1,in<module>s=a.dumps(protocol=6)ValueError:pickleprotocolmustbe<=4>>>s=a.dumps(protocol=5)Traceback (mostrecentcalllast):File"<ipython-input-10-7689e2059d65>",line1,in<module>s=a.dumps(protocol=5)SystemError:<built-inmethoddumpsofnumpy.ndarrayobjectat0x7f2a181c22b0>returnedNULLwithoutsettinganerror
For Python <= 3.5, we need to raise:
ValueError("pickle protocol must be <= %d"%pickle.HIGEST_PROTOCOL)
ogrisel commentedOct 3, 2018
@ahaldane thanks for the feedback. However I hope that you don't consider a bug that this PR would preserve the writeable flag for contiguous array with the protocol 5? The fact that this behavior is not consistent if the array is contiguous or not might be surprising to the users though. |
ahaldane commentedOct 3, 2018
OK, I just reread the PEP and I see what you're getting at: Now that an unpickled ndarray might be just a view of the original pickled ndarray, we have to maintain the writeable flags to avoid creating writeable views of arrays which have writeable=False. That seems like a strong argument for maintaining the writeable flag. One might object that it's a bit strange that whether or not an unpickled ndarray is writeable depends on the pickle protocol and whether it was saved to disk or not. But on the other hand, maybe it's easy to explain to people: Protocol <5 always returns a copy, so it is always writeable, while protocol 5 can return a view in which case users should expect it to sometimes be unwriteable. That's the same as many other numpy operations where copies are writeable while views might not be, eg fancy indexing vs slice indexing. In summary, the way you have it sounds fine. |
7c021e3 to3878df8Comparepierreglaser commentedOct 5, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
So a little clarification on the error handling:
For Otherwise, we let pickle raise its own error messages, that tell the right information. With the code used right now, there will be no This is all being tested by the way, yielding a drop in the coverage because some tests are mutually exclusive on the python version, but the coverage report is specified using only python 3.6 |
ogrisel commentedOct 5, 2018
We might want to enable coverage collection on Python 3.5 for this reason, no? |
numpy/core/tests/test_dtype.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You changed the style of the spacing around the operator for no reason. Please stay consistent with the existing code style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'll go all over therange(...) iterations on the pickle protocol and change them all to one unique style.
numpy/core/tests/test_multiarray.py Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is an implementation detail not worth testing in my opinion. Instead please check that you can loads the pickles and check that the dtypes are correct.
tylerjereddy commentedOct 5, 2018
We'd just have to be a little cautious since some projects (i.e., SciPy) appear to have reached the codecov data limit, and then nothing works. |
mattip commentedOct 8, 2018
ogrisel commentedOct 9, 2018
Also the changes in the |
37dd980 to4084decComparepierreglaser commentedOct 10, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
UPDATE:
ifsys.version_info[0]>=3:ifsys.version_info[1]in (6,7):try:importpickle5aspickleexceptImportError:importpickleelse:importpickleelse:importcPickleaspickle each time we import pickle in python, we import itonce this way, in Now,
This is what the first commit is about. The second commit implements Happy to hear your suggestions, |
4084dec to7fce5eaComparecharris commentedOct 10, 2018
Needs a release note. |
7fce5ea to64a855fComparecharris commentedOct 10, 2018
Any thoughts about adding this to the NumPy API as some point? |
ogrisel commentedOct 11, 2018
Great work. Thanks everyone! @pitrou are you confident that PEP 574 will be accepted and your protocol 5 branch be merged in CPython before the release of Python 3.8? Otherwise some assumptions made in the code of this PR might be slightly off although they should not cause any breakage (I believe).
Can you be more specific? I don't think I understand what you are suggesting. |
pitrou commentedOct 11, 2018 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I don't think the PEP will be controversial. The main uncertainty at this point is around the schedule for choosing and putting the new governance model in place. But I don't expect 3.8 to be feature-frozen before that is done anyway :-) |
| array=np.arange(10) | ||
| buffers= [] | ||
| bytes_string=pickle.dumps(array,buffer_callback=buffers.append, | ||
| protocol=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@pitrou in the current draft of the PEP states thatbuffer_callback theexample can accept a list of buffers withbuffer_callback=buffers.extend. However, the implementation we test here only accepts a single buffer at a time. This is in line with this section of the PEP:https://www.python.org/dev/peps/pep-0574/#passing-a-sequence-of-buffers-in-buffer-callback
I think the example of the PEP should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks! You're right, I forgot to fix this.
ogrisel commentedOct 11, 2018
For reference, using the current numpy master and the latest stable release of pandas, I confirm that pickle protocol 5 makes it possible to have no-copy semantics for pandas dataframes without changing anything in pandas: >>>importpandasaspd>>>importnumpyasnp>>>importpickle5aspickle>>>df=pd.DataFrame({'a':np.random.randn(3),'b':np.random.randint(0,2,3)})>>>buffers= []>>>df_pkl=pickle.dumps(df,protocol=5,buffer_callback=buffers.append)>>>buffers[<pickle.PickleBufferobjectat0x7f32f7eea5c8>,<pickle.PickleBufferobjectat0x7f32f7e5f848>]>>>pickle.loads(df_pkl,buffers=buffers)ab0-0.577441110.168650120.7832610 |
Uh oh!
There was an error while loading.Please reload this page.
Follow up on#11161.
In order to enable no-copy pickling as mentioned in thePEP 574 proposal,
numpy arrays need to have:
__reduce_ex__method, returning metadata and a buffer on their internal datadtypeand ashapeThe first one is mostly inspired from the
array_reducemethod incore/src/multiarray/methods.cThe second one is a wrapper written in python around
np.frombuffer(..).reshape(..)Here is the decrease in peak memory and cpu time:

pinging@ogrisel and@pitrou as of now.