Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: implementation of array_reduce_ex#12011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
charris merged 2 commits intonumpy:masterfrompierreglaser:implement-reduce-ex
Oct 10, 2018

Conversation

@pierreglaser
Copy link
Contributor

@pierreglaserpierreglaser commentedSep 21, 2018
edited
Loading

Follow up on#11161.
In order to enable no-copy pickling as mentioned in thePEP 574 proposal,
numpy arrays need to have:

  • a__reduce_ex__ method, returning metadata and a buffer on their internal data
  • a method that reconstructs an array from a buffer, adtype and ashape

The first one is mostly inspired from thearray_reduce method incore/src/multiarray/methods.c

The second one is a wrapper written in python aroundnp.frombuffer(..).reshape(..)

Here is the decrease in peak memory and cpu time:
results_c14eb6060b70ec1786bd869ef0879c8b0c8c6df8

pinging@ogrisel and@pitrou as of now.

nicktimko reacted with hooray emojiaetelani and ankush reacted with rocket emoji
/* if the python version is below 3.8, the pickle module does not provide
* built-in support for protocol 5. We try importing the pickle5
* backport instead */
if (PY_VERSION_HEX<0x03080000){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Rather than a runtime condition, it should be possible to use a build time preprocessor directive here.

Copy link
ContributorAuthor

@pierreglaserpierreglaserSep 25, 2018
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, although it was not necessary for the code to compile - definitely can do it in further iterations, but did not want to generate any misunderstandings :)

{
intprotocol;
PyObject*ret=NULL,*numeric_mod=NULL,*from_buffer_func=NULL;
PyObject*buffer_tuple=NULL,*pickle_module=NULL,*pickle_class=NULL;
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

spaces here, fixing it.

@ogrisel
Copy link
Contributor

Thank you very much for the new benchmarks. They look good: there is no more spurious memory copies when pickling and loading contiguous numpy arrays when using protocol 5 either with in-band or out-of-band buffers.

@ogrisel
Copy link
Contributor

Next steps:

pierreglaser reacted with thumbs up emoji

@ahaldane
Copy link
Member

ahaldane commentedSep 30, 2018
edited
Loading

I don't think we can modify the signature ofPyArray_FromBuffer: It is part of the numpy C-api which we are very reluctant to modify, so it's better to go back to the wrapper function you had before.

Otherwise looks fine so far, at inital skim.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's not necessary to try to importpickle5 whenprotocol < 5. In particulardata.dump(f) should always work for Python 3.6 and 3.7 even if thepickle5 is not installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We actually need an explicit test for those cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

sys.version_info[1] in [6, 7]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This test should not raise ImportError if pickle5 is not installed in Python 3.7 or 3.6: pickle5 should be an optional dependency.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

specifically this file or all tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

To avoid replicating complex import everywhere, you could probably do:

fromnumpy.core.numericimportpickle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Same comment here: pickle5 should never be a mandatory test dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

fromnumpy.core.numericimportpickle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think the local import is necessary. I think this can be moved to the top level as:

from numpy.core.numeric import pickle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of hardcoding the installation ofpickle5 based on the Python version number, please use an explicit environment variable

if ["$INSTALL_PICKLE5"=="1" ];then    pip install pickle5fi

and then set that variable on some of the Python 3.6 and/or 3.7 builds in.travis.yml.

@pierreglaserpierreglaser changed the titleWIP: implementation of array_reduce_exENH: implementation of array_reduce_exOct 3, 2018
Copy link
Contributor

@ogriselogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Another batch of comments:

.appveyor.yml Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No need to set the variable when it's 0-valued.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

ActuallyINSTALL_PICKLE5=0 will not work as expected with the following bash conditions fromtools/travis-test.sh:

if [ -n "$INSTALL_PICKLE5" ]; then    pip install pickle5fi

so allINSTALL_PICKLE5: 0 occurrences must be removed from this configuration file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actually this is appveyor. But still I think we should not have to setINSTALL_PICKLE5: 0 explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would change this comment to:

The PickleBuffer class from version 5 of the pickle protocol can only be used for arrays backed by a contiguous data buffer. For all other cases we fallback to the generic array_reduce method that involves using a temporary bytes allocation. However we do not call array_reduce directly but instead lookup and call the__reduce__ method to make sure that it's possible customize pickling in sub-classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nitpick: I think the style for binary operators in numpy is to use whitespaces around operators:

forprotoinrange(2,pickle.HIGHEST_PROTOCOL+1):            ...

this issue is repeated many times.

Copy link
ContributorAuthor

@pierreglaserpierreglaserOct 3, 2018
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believe you, however:

I can change it, but there was no easy way for me to infer that Ihad to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes I agree this is not mandatory. But I looked around and it seems that the numpy convention is to use spaces more often than not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

unused import?

shippable.yml Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the python version is 3.5 or 3.8 this command will fail.

The test should pass if pickle5 is not there. I think this can be removed now.

Copy link
Contributor

@ogriselogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some more comments:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of passing the test, this should be skipped with pytest:

@pytest.mark.skipif(pickle.HIGHEST_PROTOCOL<5,reason="requires pickle protocol 5")deftest_array_with_dtype_has_object_pickling(self):my_object=object()     ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Same here, let's use@pytest.mark.skipif:

@pytest.mark.skipif(pickle.HIGHEST_PROTOCOL<5,reason="requires pickle protocol 5")deftest_pickle_with_buffercallback(self):    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would rename this totest_record_array_with_object_dtype. There is already "pickling" in the test class name.

@ogrisel
Copy link
Contributor

ogrisel commentedOct 3, 2018
edited
Loading

By playing this PR I noticed the following:

>>>importnumpyasnp>>>importpickle5>>>a=np.arange(12)

Let's mark the array read-only and then pickle it:

>>>a.flags.writeable=False>>>pickle5.loads(a.dumps(protocol=2)).flags.writeableTrue

The writeable flag is lost: it means that currently (in numpy master and all past versions), a readonly array would be loaded back as a read-write array. I think this is a bug of numpy. (I actually checked that I could reproduce the behavior with the last stable release of numpy).

On the contrary, thePickleBuffer class of the protocol 5 propagates the readonly/writeable state and therefore:

>>>pickle5.loads(a.dumps(protocol=5)).flags.writeableFalse

@ahaldane@charris do you confirm that the fact that adumps/loads cycle does not preserve the writeable flag can be considered a bug?

@ahaldane
Copy link
Member

My initial impression is it's not a bug: We don't preserve the writeable flag by default when we construct a new array from another one byb = np.array(a), even though we retain other properties like the memory order. To me doing asave + load is like constructing a new buffer from an old one, like thenp.array call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I had not realized that you could reshape with order. So it's better to passorder itself directly as an argument to_frombuffer instead of thetranspose flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would renamec_contiguous_array toflat_array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

But actually that would just become a one liner:

def_frombuffer(buffer,dtype,shape,order):returnfrombuffer(buffer,dtype=dtype).reshape(shape,order=order)

Copy link
Contributor

@ogriselogriselOct 3, 2018
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We cannot do this, without setting a proper error message.

For instance with Python 3.5 I get the following with the current implementation:

>>>importnumpyasnp>>>a=np.arange(12)>>>s=a.dumps(protocol=6)Traceback (mostrecentcalllast):File"<ipython-input-9-2e271e7e7bc8>",line1,in<module>s=a.dumps(protocol=6)ValueError:pickleprotocolmustbe<=4>>>s=a.dumps(protocol=5)Traceback (mostrecentcalllast):File"<ipython-input-10-7689e2059d65>",line1,in<module>s=a.dumps(protocol=5)SystemError:<built-inmethoddumpsofnumpy.ndarrayobjectat0x7f2a181c22b0>returnedNULLwithoutsettinganerror

For Python <= 3.5, we need to raise:

ValueError("pickle protocol must be <= %d"%pickle.HIGEST_PROTOCOL)

@ogrisel
Copy link
Contributor

@ahaldane thanks for the feedback. However I hope that you don't consider a bug that this PR would preserve the writeable flag for contiguous array with the protocol 5?

The fact that this behavior is not consistent if the array is contiguous or not might be surprising to the users though.

@ahaldane
Copy link
Member

OK, I just reread the PEP and I see what you're getting at: Now that an unpickled ndarray might be just a view of the original pickled ndarray, we have to maintain the writeable flags to avoid creating writeable views of arrays which have writeable=False.

That seems like a strong argument for maintaining the writeable flag.

One might object that it's a bit strange that whether or not an unpickled ndarray is writeable depends on the pickle protocol and whether it was saved to disk or not. But on the other hand, maybe it's easy to explain to people: Protocol <5 always returns a copy, so it is always writeable, while protocol 5 can return a view in which case users should expect it to sometimes be unwriteable. That's the same as many other numpy operations where copies are writeable while views might not be, eg fancy indexing vs slice indexing.

In summary, the way you have it sounds fine.

ogrisel reacted with thumbs up emoji

@pierreglaser
Copy link
ContributorAuthor

pierreglaser commentedOct 5, 2018
edited
Loading

So a little clarification on the error handling:
I think we need personalized messages for python 3.6 and 3.7, alerting users that the highest protocol possible is 5, and not 4, as pickle would do. Usingnumpy.ndarray.dumps,numpy.ndarray.dump, andnumpy.ndarray.__reduce_ex__

  • ifprotocol > 5, aValueError will be raised saying'pickle protocol must be <=5'
  • ifprotocol == 5, andpickle5 is not installed, anImportError will be raised telling the user thatpickle5 is needed

For__reduce_ex__, we raise aValueError ifprotocol > 5, and if( protocol == 5 and sys.version_info[:2] < (3, 6)

Otherwise, we let pickle raise its own error messages, that tell the right information. With the code used right now, there will be noSystemError anymore, only pickle error messages, or the messages stated above.

This is all being tested by the way, yielding a drop in the coverage because some tests are mutually exclusive on the python version, but the coverage report is specified using only python 3.6

@ogrisel
Copy link
Contributor

This is all being tested by the way, yielding a drop in the coverage because some tests are mutually exclusive on the python version, but the coverage report is specified using only python 3.6

We might want to enable coverage collection on Python 3.5 for this reason, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You changed the style of the spacing around the operator for no reason. Please stay consistent with the existing code style.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'll go all over therange(...) iterations on the pickle protocol and change them all to one unique style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is an implementation detail not worth testing in my opinion. Instead please check that you can loads the pickles and check that the dtypes are correct.

@tylerjereddy
Copy link
Contributor

codecov aggregates all the coverage stats and reports the union of coverage reports.

We'd just have to be a little cautious since some projects (i.e., SciPy) appear to have reached the codecov data limit, and then nothing works.

@mattip
Copy link
Member

Now that#12090,#12091 have been merged, this should be rebased off master and squashed to a smaller number (1?) of relevant commits.

pierreglaser reacted with thumbs up emoji

@ogrisel
Copy link
Contributor

Also the changes in thendarray.dump(s) should be removed from this PR.

@pierreglaserpierreglaserforce-pushed theimplement-reduce-ex branch 2 times, most recently from37dd980 to4084decCompareOctober 10, 2018 12:17
@pierreglaser
Copy link
ContributorAuthor

pierreglaser commentedOct 10, 2018
edited
Loading

UPDATE:
Just rebased the PR. I made 2 commits though (and not one), because there still were some cleanups to do about pickle that were not included in#12090. Most notably:

ifsys.version_info[0]>=3:ifsys.version_info[1]in (6,7):try:importpickle5aspickleexceptImportError:importpickleelse:importpickleelse:importcPickleaspickle

each time we import pickle in python, we import itonce this way, innumpy.core.numeric, and then each timepickle is used in other files, we import it fromnumpy.core.numeric. The only time we importpickle another way is innumpy/core/setup.py

Now,numpy should be cleanpickle-wise, meaning:

  • there is no morenp.ndarray.dump(s) call in the test suite.
  • there is a consistent pickle importing behavior
  • each time pickle.dumps is tested, it is tested for every protocol.

This is what the first commit is about.

The second commit implementsarray_reduce_ex, as well as some tests to check the good behavior of pickle protocol 5 on numpy arrays. This PR can thus be easily broken out in two smaller PRs, one for each commit.
Preventively, I created the PR (#12133) for the first commit only, on which this branch can be furtherly rebased on.

Happy to hear your suggestions,
Pierre

@charris
Copy link
Member

Needs a release note.

@charrischarris added component: numpy._core component: numpy.mamasked arrays 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes labelsOct 10, 2018
@charrischarris removed the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes labelOct 10, 2018
@charrischarris merged commit2ed08ba intonumpy:masterOct 10, 2018
@charris
Copy link
Member

Any thoughts about adding this to the NumPy API as some point?

@ogrisel
Copy link
Contributor

Great work. Thanks everyone!

@pitrou are you confident that PEP 574 will be accepted and your protocol 5 branch be merged in CPython before the release of Python 3.8? Otherwise some assumptions made in the code of this PR might be slightly off although they should not cause any breakage (I believe).

Any thoughts about adding this to the NumPy API as some point?

Can you be more specific? I don't think I understand what you are suggesting.

@pitrou
Copy link
Member

pitrou commentedOct 11, 2018
edited
Loading

@pitrou are you confident that PEP 574 will be accepted and your protocol 5 branch be merged in CPython before the release of Python 3.8?

I don't think the PEP will be controversial. The main uncertainty at this point is around the schedule for choosing and putting the new governance model in place. But I don't expect 3.8 to be feature-frozen before that is done anyway :-)

array=np.arange(10)
buffers= []
bytes_string=pickle.dumps(array,buffer_callback=buffers.append,
protocol=5)
Copy link
Contributor

@ogriselogriselOct 11, 2018
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@pitrou in the current draft of the PEP states thatbuffer_callback theexample can accept a list of buffers withbuffer_callback=buffers.extend. However, the implementation we test here only accepts a single buffer at a time. This is in line with this section of the PEP:https://www.python.org/dev/peps/pep-0574/#passing-a-sequence-of-buffers-in-buffer-callback

I think the example of the PEP should be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks! You're right, I forgot to fix this.

@ogrisel
Copy link
Contributor

For reference, using the current numpy master and the latest stable release of pandas, I confirm that pickle protocol 5 makes it possible to have no-copy semantics for pandas dataframes without changing anything in pandas:

>>>importpandasaspd>>>importnumpyasnp>>>importpickle5aspickle>>>df=pd.DataFrame({'a':np.random.randn(3),'b':np.random.randint(0,2,3)})>>>buffers= []>>>df_pkl=pickle.dumps(df,protocol=5,buffer_callback=buffers.append)>>>buffers[<pickle.PickleBufferobjectat0x7f32f7eea5c8>,<pickle.PickleBufferobjectat0x7f32f7e5f848>]>>>pickle.loads(df_pkl,buffers=buffers)ab0-0.577441110.168650120.7832610

@pitrou
Copy link
Member

@ogrisel This is great :-) I suspect@mrocklin will be interested.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@mattipmattipmattip left review comments

@tylerjereddytylerjereddytylerjereddy left review comments

+2 more reviewers

@ogriselogriselogrisel left review comments

@pitroupitroupitrou left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

7 participants

@pierreglaser@ogrisel@ahaldane@mattip@tylerjereddy@charris@pitrou

[8]ページ先頭

©2009-2025 Movatter.jp