Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: lib: Allow usecols to be a callable in loadtxt().#21800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
WarrenWeckesser wants to merge15 commits intonumpy:main
base:main
Choose a base branch
Loading
fromWarrenWeckesser:usecols-callable

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesserWarrenWeckesser commentedJun 20, 2022
edited by seberg
Loading

This is the C version of#15995.

Actually, it is a subset of that PR. The new feature added here is to allowusecols to be callable. In this PR, I haven't included the option forusecols to be a slice, and I haven't included the convenience functionskip. Any column selection that could be done with a slice can also be done with a callable, and there are selections that a callable can make that cannot be done with a slice--i.e. the callable option is the most flexible. The smaller scope of this PR should make it easier to review; handling a slice can be added in a follow-up if there is interest.

Theskip function in#15995 is just a Python function, so it could simply be copied over from that PR to here if folks think it is important. Of course, it too could be added in a follow-up PR.

(seberg: xrefgh-13878)

Copy link
Member

@sebergseberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't have a strong opinion on the API addition, but it seems good to me and does allow further additions pretty nicely (by doing them in Python).

A few small comments on the code, the only bigger one is that I think we need to reject (or add logic?) for changes in the number of columnsn, since that would affect the result of the function.

// create the usecols array.
PyObject *seq = PyObject_CallFunction(usecols_obj, "n",
current_num_fields);
if (PyErr_Occurred()) {
Copy link
Member

@sebergsebergJun 20, 2022
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
if (PyErr_Occurred()) {
if (seq==NULL) {

"sequence of ints, but it returned an instance of "
"type '%s'", Py_TYPE(seq)->tp_name);
Py_DECREF(seq);
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am wondering if the "length check" can't be just part of the helper? We use the same error in any case?
Could even usePySequence_Fast in the helper, but that doesn't really matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That would mean that the helper should probably returnnum_usecols and fill inusecols (passed by pointer).

Py_DECREF(tmp);
}
return arr;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we move this toconversion_utils.c? There is similar functionality there already, and now that it is split out, maybe it is time to just put it there.

// This function owns usecols if a callable usecols_obj was given.
PyMem_Free(usecols);
usecols = NULL; // An overabundance of caution...
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be ugly to move all usecols handling here? Its fine, but conditional ownership tens to be an anti-pattern.

Another way to spell would be to haveusecol_from_func = ... andusecol = usecol_from_func, so the owner is unconditionallyusecol_from_func and not "usecol" itself.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be ugly to move all usecols handling here?

No, I think that would be fine. In fact, with a bit more of the argument processing code moved intoread_rows(), I think the function_readtext_from_stream() can be eliminated, and_load_from_filelike() can callread_rows() directly. I'll give that shot, and push to the PR if it looks reasonable.

seberg reacted with thumbs up emoji
with pytest.raises(RuntimeError,
match='the number of fields in the given dtype'):
np.loadtxt(txt, usecols=lambda n: [0, 2, 3], dtype=dt)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We are missing a test (and I think also the logic) to reject a change in the number of columns when a callable is given. This is a bit of a tricky case!

The problem is that ifn changes, the return value of theusecol function might also change! While normally, when usecols are given we do not have to worry about that at all.

The example would be something like:

np.loadtxt(["1 2", "1 2 3"], usecols=lambda n: range(n)[-2:])

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As you can see in the code, I've been working under the assumption that once the first row has been read, the value ofn is determined once and for all. If given, the user-defined callable is called justonce, after the first row is read. If a structured dtype is given,n is the number of fields in the dtype; otherwisen is the number of columns found in the first row of the file. That's because I didn't think we had such strong support for ragged text files. I wish we didn't, and I don't recall the history of how we got here, but if that's where we are now, so be it. (There is no formal specification for the format of the text files that we support, so the scope ofloadtxt tends to creep in fits and starts.)

It seems that a simple way to handle this is to cache the results of the callableusecols in a dictionary, and look up the result after each line is parsed. We only actually call the function when a new number of columns is encountered. There will have to be a check added to ensure that the length of the sequence returned by subsequent calls is the same as that of the first call. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't mind raising an error. We already do ifusecols isnot given. We do have oddly strong support for changing number of columns, simply becauseloadtxt used to use list indexingfor col in usecol: line.split(delimiter)[col].

Yes, you could cacheusecols(n) with changingn (in the unlikely event that it chagned). Or... you just raise an error :).

I would be OK even deprecating that whole thing. Just dropping support fornegative usecols together withragged columns, might simplify things also.

@sebergseberg added the 62 - Python APIChanges or additions to the Python API. Mailing list should usually be notified. labelJun 20, 2022
@charrischarris added the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes labelJun 26, 2022
@charris
Copy link
Member

Needs a release note.

Copy link
Member

@sebergseberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just a comment in case you are looking for whats failing (since I don't think the size of intp and py_ssize_t should ever differ, if they do we probably have more to fix).

I do think some use ofNPY_UNLIKELY is probably worthwhile here (think of a single column CSV file). Maybe enough so to show up in the benchmarks, but did not try.

}
}
else {
if (usecols_iscallable && current_num_fields != prev_num_fields) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Probably makes sense to reorganize the conditions so that the outermost can beif (NPY_UNLIKELY(current_num_fields != prev_num_fields)) {.
There is a lot of code here now in a relatively hot path, so lets help the compiler shovel it to the side somewhere.

(Could move most of this into a helper as well probably).

assert(homogeneous || num_field_types == num_usecols);
actual_num_fields = num_usecols;
}
else if (!homogeneous) {
assert(usecols == NULL ||num_field_types == num_usecols);
assert(num_field_types == num_usecols);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This assert is failing (if you are looking for that).num_usecols is not necessarily defined here, is it?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, the result of an incomplete clean up. I just removed the assert().

seberg reacted with thumbs up emoji
@WarrenWeckesserWarrenWeckesser marked this pull request as draftFebruary 19, 2023 21:32
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@sebergsebergseberg left review comments

Assignees
No one assigned
Labels
01 - Enhancement56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes62 - Python APIChanges or additions to the Python API. Mailing list should usually be notified.component: numpy.lib
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@WarrenWeckesser@charris@seberg

[8]ページ先頭

©2009-2025 Movatter.jp