NotificationsYou must be signed in to change notification settings
Fork11.1k
Star29.9k

ENH: lib: Allow usecols to be a callable in loadtxt().#21800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

WarrenWeckesser wants to merge15 commits intonumpy:mainfromWarrenWeckesser:usecols-callable

Closed

ENH: lib: Allow usecols to be a callable in loadtxt().#21800

WarrenWeckesser wants to merge15 commits intonumpy:mainfromWarrenWeckesser:usecols-callable

Conversation

Copy link

Member

WarrenWeckesser commentedJun 20, 2022•
edited by seberg
Loading

This is the C version of#15995.

Actually, it is a subset of that PR. The new feature added here is to allowusecols to be callable. In this PR, I haven't included the option forusecols to be a slice, and I haven't included the convenience functionskip. Any column selection that could be done with a slice can also be done with a callable, and there are selections that a callable can make that cannot be done with a slice--i.e. the callable option is the most flexible. The smaller scope of this PR should make it easier to review; handling a slice can be added in a follow-up if there is interest.

Theskip function in#15995 is just a Python function, so it could simply be copied over from that PR to here if folks think it is important. Of course, it too could be added in a follow-up PR.

(seberg: xrefgh-13878)

WarrenWeckesser added 01 - Enhancement component: numpy.lib labels

Jun 20, 2022

WarrenWeckesser force-pushed theusecols-callable branch from5b4165b to8360279Compare

June 20, 2022 06:41

ENH: lib: Allow usecols to be a callable in loadtxt().

5ffae65

WarrenWeckesser force-pushed theusecols-callable branch from8360279 to5ffae65Compare

June 20, 2022 14:23

seberg reviewed

Jun 20, 2022

View reviewed changes

Copy link

Member

seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't have a strong opinion on the API addition, but it seems good to me and does allow further additions pretty nicely (by doing them in Python).

A few small comments on the code, the only bigger one is that I think we need to reject (or add logic?) for changes in the number of columnsn, since that would affect the result of the function.

numpy/core/src/multiarray/textreading/rows.c Outdated

		// create the usecols array.
		PyObject *seq = PyObject_CallFunction(usecols_obj, "n",
		current_num_fields);
		if (PyErr_Occurred()) {

Copy link

Member

sebergJun 20, 2022•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	if (PyErr_Occurred()) {
	if (seq==NULL) {

numpy/core/src/multiarray/textreading/rows.c

		"sequence of ints, but it returned an instance of "
		"type '%s'", Py_TYPE(seq)->tp_name);
		Py_DECREF(seq);
		goto error;

Copy link

Member

sebergJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am wondering if the "length check" can't be just part of the helper? We use the same error in any case?
Could even usePySequence_Fast in the helper, but that doesn't really matter.

Copy link

Member

sebergJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That would mean that the helper should probably returnnum_usecols and fill inusecols (passed by pointer).

numpy/core/src/multiarray/textreading/seq_to_ssize_c_array.c Outdated

		Py_DECREF(tmp);
		}
		return arr;
		}

Copy link

Member

sebergJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can we move this toconversion_utils.c? There is similar functionality there already, and now that it is split out, maybe it is time to just put it there.

numpy/core/src/multiarray/textreading/rows.c Outdated

		// This function owns usecols if a callable usecols_obj was given.
		PyMem_Free(usecols);
		usecols = NULL; // An overabundance of caution...
		}

Copy link

Member

sebergJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be ugly to move all usecols handling here? Its fine, but conditional ownership tens to be an anti-pattern.

Another way to spell would be to haveusecol_from_func = ... andusecol = usecol_from_func, so the owner is unconditionallyusecol_from_func and not "usecol" itself.

Copy link

MemberAuthor

WarrenWeckesserJun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be ugly to move all usecols handling here?

No, I think that would be fine. In fact, with a bit more of the argument processing code moved intoread_rows(), I think the function_readtext_from_stream() can be eliminated, and_load_from_filelike() can callread_rows() directly. I'll give that shot, and push to the PR if it looks reasonable.

numpy/lib/tests/test_loadtxt.py

		with pytest.raises(RuntimeError,
		match='the number of fields in the given dtype'):
		np.loadtxt(txt, usecols=lambda n: [0, 2, 3], dtype=dt)

Copy link

Member

sebergJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We are missing a test (and I think also the logic) to reject a change in the number of columns when a callable is given. This is a bit of a tricky case!

The problem is that ifn changes, the return value of theusecol function might also change! While normally, when usecols are given we do not have to worry about that at all.

The example would be something like:

np.loadtxt(["1 2", "1 2 3"], usecols=lambda n: range(n)[-2:])

Copy link

MemberAuthor

WarrenWeckesserJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As you can see in the code, I've been working under the assumption that once the first row has been read, the value ofn is determined once and for all. If given, the user-defined callable is called justonce, after the first row is read. If a structured dtype is given,n is the number of fields in the dtype; otherwisen is the number of columns found in the first row of the file. That's because I didn't think we had such strong support for ragged text files. I wish we didn't, and I don't recall the history of how we got here, but if that's where we are now, so be it. (There is no formal specification for the format of the text files that we support, so the scope ofloadtxt tends to creep in fits and starts.)

It seems that a simple way to handle this is to cache the results of the callableusecols in a dictionary, and look up the result after each line is parsed. We only actually call the function when a new number of columns is encountered. There will have to be a check added to ensure that the length of the sequence returned by subsequent calls is the same as that of the first call. What do you think?

Copy link

Member

sebergJun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't mind raising an error. We already do ifusecols isnot given. We do have oddly strong support for changing number of columns, simply becauseloadtxt used to use list indexingfor col in usecol: line.split(delimiter)[col].

Yes, you could cacheusecols(n) with changingn (in the unlikely event that it chagned). Or... you just raise an error :).

I would be OK even deprecating that whole thing. Just dropping support fornegative usecols together withragged columns, might simplify things also.

seberg added the 62 - Python APIChanges or additions to the Python API. Mailing list should usually be notified. label

Jun 20, 2022

Fix check of the result of PyObject_CallFunction()

10f1d08

WarrenWeckesser mentioned this pull request

Jun 23, 2022

BUG: lib:loadtxt treatsusecols='' the same asusecols=[].#21839

Closed

charris added the 56 - Needs Release Note.Needs an entry in doc/release/upcoming_changes label

Jun 26, 2022

Copy link

Member

charris commentedJun 26, 2022

Needs a release note.

WarrenWeckesser added7 commits

June 28, 2022 13:48

Lots of updates. (Squash this commit!)

0d8d3c5

BUG: Fix free_conv_funcs() declaration and a few compiler warnings.

571a5e6

BUG: remove unnecessary (and incorrect) clause from an assert()

6d69407

Initialize a var to avoid a compiler warning.

3e30cd8

Ensure PY_SSIZE_T_CLEAN is set before including Python.h

231321e

TMP: Check sizes of integers

d48333c

TMP: Move code that checks sizes of integers

b283709

seberg reviewed

Jun 29, 2022

View reviewed changes

Copy link

Member

seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just a comment in case you are looking for whats failing (since I don't think the size of intp and py_ssize_t should ever differ, if they do we probably have more to fix).

I do think some use ofNPY_UNLIKELY is probably worthwhile here (think of a single column CSV file). Maybe enough so to show up in the benchmarks, but did not try.

numpy/core/src/multiarray/textreading/rows.c Outdated

		}
		}
		else {
		if (usecols_iscallable && current_num_fields != prev_num_fields) {

Copy link

Member

sebergJun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Probably makes sense to reorganize the conditions so that the outermost can beif (NPY_UNLIKELY(current_num_fields != prev_num_fields)) {.
There is a lot of code here now in a relatively hot path, so lets help the compiler shovel it to the side somewhere.

(Could move most of this into a helper as well probably).

numpy/core/src/multiarray/textreading/rows.c Outdated

		assert(homogeneous \|\| num_field_types == num_usecols);
		actual_num_fields = num_usecols;
		}
		else if (!homogeneous) {
		assert(usecols == NULL \|\|num_field_types == num_usecols);
		assert(num_field_types == num_usecols);

Copy link

Member

sebergJun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This assert is failing (if you are looking for that).num_usecols is not necessarily defined here, is it?

Copy link

MemberAuthor

WarrenWeckesserJun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, the result of an incomplete clean up. I just removed the assert().

WarrenWeckesser added6 commits

June 28, 2022 21:54

Revert previous TMP commit.

700dcb7

Remove invalid assert()

9c7bf60

get_usecols_arr_from_callable now handles the initial and subsequent …

69ab6c3

…calls.

Expand some code comments.

b7e02dd

Split a test in test_loadtxt.py

28c32ca

Merge branch 'main' into usecols-callable

f3a344a

WarrenWeckesser marked this pull request as draft

February 19, 2023 21:32

Copy link

Member

bmwoodruff commentedJun 11, 2025

Based on a discussion at the triage meeting, we decided to close inactive draft PRs that are more than 1 year old. Feel free to open a new PR to continue working on this.

bmwoodruff closed this

Jun 11, 2025

Labels

01 - Enhancement 56 - Needs Release Note.

Needs an entry in doc/release/upcoming_changes

62 - Python API

Changes or additions to the Python API. Mailing list should usually be notified.

component: numpy.lib

4 participants

Movatterモバイル変換

Uh oh!

ENH: lib: Allow usecols to be a callable in loadtxt().#21800

ENH: lib: Allow usecols to be a callable in loadtxt().#21800

Uh oh!

Conversation

WarrenWeckesser commentedJun 20, 2022• edited by sebergLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

sebergJun 20, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charris commentedJun 26, 2022

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmwoodruff commentedJun 11, 2025

Uh oh!

Uh oh!

WarrenWeckesser commentedJun 20, 2022•
edited by seberg
Loading

sebergJun 20, 2022•
edited
Loading