Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork10.9k
ENH: lib: Allow usecols to be a callable in loadtxt().#21800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
5b4165b
to8360279
Compare8360279
to5ffae65
CompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't have a strong opinion on the API addition, but it seems good to me and does allow further additions pretty nicely (by doing them in Python).
A few small comments on the code, the only bigger one is that I think we need to reject (or add logic?) for changes in the number of columnsn
, since that would affect the result of the function.
// create the usecols array. | ||
PyObject *seq = PyObject_CallFunction(usecols_obj, "n", | ||
current_num_fields); | ||
if (PyErr_Occurred()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
if (PyErr_Occurred()) { | |
if (seq==NULL) { |
"sequence of ints, but it returned an instance of " | ||
"type '%s'", Py_TYPE(seq)->tp_name); | ||
Py_DECREF(seq); | ||
goto error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I am wondering if the "length check" can't be just part of the helper? We use the same error in any case?
Could even usePySequence_Fast
in the helper, but that doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That would mean that the helper should probably returnnum_usecols
and fill inusecols
(passed by pointer).
Py_DECREF(tmp); | ||
} | ||
return arr; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can we move this toconversion_utils.c
? There is similar functionality there already, and now that it is split out, maybe it is time to just put it there.
// This function owns usecols if a callable usecols_obj was given. | ||
PyMem_Free(usecols); | ||
usecols = NULL; // An overabundance of caution... | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Would it be ugly to move all usecols handling here? Its fine, but conditional ownership tens to be an anti-pattern.
Another way to spell would be to haveusecol_from_func = ...
andusecol = usecol_from_func
, so the owner is unconditionallyusecol_from_func
and not "usecol" itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Would it be ugly to move all usecols handling here?
No, I think that would be fine. In fact, with a bit more of the argument processing code moved intoread_rows()
, I think the function_readtext_from_stream()
can be eliminated, and_load_from_filelike()
can callread_rows()
directly. I'll give that shot, and push to the PR if it looks reasonable.
with pytest.raises(RuntimeError, | ||
match='the number of fields in the given dtype'): | ||
np.loadtxt(txt, usecols=lambda n: [0, 2, 3], dtype=dt) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We are missing a test (and I think also the logic) to reject a change in the number of columns when a callable is given. This is a bit of a tricky case!
The problem is that ifn
changes, the return value of theusecol
function might also change! While normally, when usecols are given we do not have to worry about that at all.
The example would be something like:
np.loadtxt(["1 2", "1 2 3"], usecols=lambda n: range(n)[-2:])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
As you can see in the code, I've been working under the assumption that once the first row has been read, the value ofn
is determined once and for all. If given, the user-defined callable is called justonce, after the first row is read. If a structured dtype is given,n
is the number of fields in the dtype; otherwisen
is the number of columns found in the first row of the file. That's because I didn't think we had such strong support for ragged text files. I wish we didn't, and I don't recall the history of how we got here, but if that's where we are now, so be it. (There is no formal specification for the format of the text files that we support, so the scope ofloadtxt
tends to creep in fits and starts.)
It seems that a simple way to handle this is to cache the results of the callableusecols
in a dictionary, and look up the result after each line is parsed. We only actually call the function when a new number of columns is encountered. There will have to be a check added to ensure that the length of the sequence returned by subsequent calls is the same as that of the first call. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't mind raising an error. We already do ifusecols
isnot given. We do have oddly strong support for changing number of columns, simply becauseloadtxt
used to use list indexingfor col in usecol: line.split(delimiter)[col]
.
Yes, you could cacheusecols(n)
with changingn
(in the unlikely event that it chagned). Or... you just raise an error :).
I would be OK even deprecating that whole thing. Just dropping support fornegative usecols together withragged columns, might simplify things also.
Needs a release note. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Just a comment in case you are looking for whats failing (since I don't think the size of intp and py_ssize_t should ever differ, if they do we probably have more to fix).
I do think some use ofNPY_UNLIKELY
is probably worthwhile here (think of a single column CSV file). Maybe enough so to show up in the benchmarks, but did not try.
} | ||
} | ||
else { | ||
if (usecols_iscallable && current_num_fields != prev_num_fields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Probably makes sense to reorganize the conditions so that the outermost can beif (NPY_UNLIKELY(current_num_fields != prev_num_fields)) {
.
There is a lot of code here now in a relatively hot path, so lets help the compiler shovel it to the side somewhere.
(Could move most of this into a helper as well probably).
assert(homogeneous || num_field_types == num_usecols); | ||
actual_num_fields = num_usecols; | ||
} | ||
else if (!homogeneous) { | ||
assert(usecols == NULL ||num_field_types == num_usecols); | ||
assert(num_field_types == num_usecols); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This assert is failing (if you are looking for that).num_usecols
is not necessarily defined here, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ah, the result of an incomplete clean up. I just removed the assert().
Uh oh!
There was an error while loading.Please reload this page.
This is the C version of#15995.
Actually, it is a subset of that PR. The new feature added here is to allow
usecols
to be callable. In this PR, I haven't included the option forusecols
to be a slice, and I haven't included the convenience functionskip
. Any column selection that could be done with a slice can also be done with a callable, and there are selections that a callable can make that cannot be done with a slice--i.e. the callable option is the most flexible. The smaller scope of this PR should make it easier to review; handling a slice can be added in a follow-up if there is interest.The
skip
function in#15995 is just a Python function, so it could simply be copied over from that PR to here if folks think it is important. Of course, it too could be added in a follow-up PR.(seberg: xrefgh-13878)