Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Documentation of iterative_train_test_split incomplete#282

Answeredbyzbeloki
edufonseca asked this question inQ&A
Discussion options

iterative_train_test_split is briefly documentedhere (at the bottom), but the input paramsX,y are not explained. I tried passingyas a list of lists, encoding the labels as categorical integers, eg

[[2], [0,3], [1], [0,2,3]]

but it crashed.

By debugging the example providedhere,X,y turn out to bescipy.sparse.lil_matrix. Is this the only format allowed?

Any indication on the possible formats forX,y initerative_train_test_split? Thanks

You must be logged in to vote

As it states in theREADME, X and y must be matrices of two dimensions.

For instance, if you have a pandas column that you want to use as X, you should first convert it to a numpy array of shape (n, 1):

X = df.text.to_numpy()X.shape# (3,) -> bad! we need to add a new axisX = X[..., np.newaxis]# (3,1) -> good

To prepare the y parameter you can use MultiLabelBinarize from scikit-learn:

labels = [['white', 'black'], ['blue'], ['blue', 'white', 'pink']]mlb = MultiLabelBinarizer()y = mlb.fit_transform(labels)y.shape# (3,4)

Replies: 6 comments

Comment options

I'm having this issue as well. I've tried converting my inputs to a list of lists, a np.array of lists, a np.array or np.arrays, etc.

I can only get the example to work with the test example, which will work for non-sparse matrices:

from skmultilearn.model_selection.iterative_stratification import iterative_train_test_splitfrom skmultilearn.dataset import load_datasetX,y, _, _ = load_dataset('scene', 'undivided')X_train, y_train, X_test, y_test = iterative_train_test_split(    X.A,    y.A,    test_size = 0.2)

^^^ This works fine for me

In this case, we have

print(type(X.A))print(X.A.shape)X.A

Return

<class 'numpy.ndarray'>(2407, 294)array([[0.646467, 0.666435, 0.685047, ..., 0.247298, 0.014025, 0.029709],       [0.770156, 0.767255, 0.761053, ..., 0.137833, 0.082672, 0.03632 ],       [0.793984, 0.772096, 0.76182 , ..., 0.051125, 0.112506, 0.083924],       ...,       [0.952281, 0.944987, 0.905556, ..., 0.0319  , 0.017547, 0.019734],       [0.88399 , 0.899004, 0.901019, ..., 0.256158, 0.226332, 0.22307 ],       [0.974915, 0.866425, 0.818144, ..., 0.005131, 0.025059, 0.004033]])

And

print(type(y.A))print(y.A.shape)y.A

return

<class 'numpy.ndarray'>(2407, 6)array([[1, 0, 0, 0, 1, 0],       [1, 0, 0, 0, 0, 1],       [1, 0, 0, 0, 0, 0],       ...,       [0, 0, 0, 0, 0, 1],       [0, 0, 0, 0, 0, 1],       [0, 0, 0, 0, 0, 1]])

However, with my own data,

print(type(df_train["text"].values))print(df_train["text"].values.shape)df_train["text"].values

Which returns

<class 'numpy.ndarray'>(23455,)array(['Wholeheartedly support these protests &amp; acts of civil disobedience &amp; will join when I can! #Ferguson #AllLivesMatter http://t.co/D8Phc8UakE',       'This Sandra Bland situation man no disrespect rest her soul , but people die everyday in a unjustified matter #AllLivesMatter',       'Commitment to peace, healing and loving neighbors. Give us strength and patience. #PortlandPride #AllLivesMatter #Peace',       ...,       'After losing the election to 2 unisex names, maybe it is time for the GOP to support Marriage Equality and Civil Unions. #Sandy #Christie',       '@FoxNews:Price gouging, looting and rage: #Sandy crimes stories grow http://t.co/zL3iI, Good Luck with their Gun Control Laws and 0 cops!',       "Might devastated #Sandy victims lose the oppurtunity to vote, thus having their rights violated? Looting their vote. It shouldn't happen."],      dtype=object)

And

print(type(df_train["labels"].values))print(df_train["labels"].values.shape)df_train["labels"].values

Which returns

<class 'numpy.ndarray'>(23455,)array([list([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]),       list([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]),       list([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), ...,       list([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]),       list([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]),       list([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])], dtype=object)

And coded another way as

print(type(df_train_labels_split))print(df_train_labels_split.shape)df_train_labels_split

Which returns

<class 'numpy.ndarray'>(23455, 11)array([[0, 0, 0, ..., 0, 0, 0],       [1, 1, 1, ..., 0, 0, 0],       [1, 0, 0, ..., 1, 0, 0],       ...,       [0, 0, 1, ..., 0, 0, 0],       [0, 0, 0, ..., 0, 0, 0],       [0, 1, 0, ..., 0, 0, 0]])

^^^ All of these give me errors:

X_train, y_train, X_test, y_test = iterative_train_test_split(    df_train["text"].values,    df_train_labels_split,    test_size = 0.2)

Throws

---------------------------------------------------------------------------IndexError                                Traceback (most recent call last)<ipython-input-67-d7e6efda299e> in <module>      1 # Get multi-label train/test splits of data      2 from sklearn.model_selection import train_test_split----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split(      4     df_train["text"].values,      5     df_train_labels_split,~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size)     93     train_indexes, test_indexes = next(stratifier.split(X, y))     94 ---> 95     X_train, y_train = X[train_indexes, :], y[train_indexes, :]     96     X_test, y_test = X[test_indexes, :], y[test_indexes, :]     97 IndexError: too many indices for array

^^^ The number of rows matches perfectly, so this is really unclear

And

X_train, y_train, X_test, y_test = iterative_train_test_split(    df_train["text"].values,    df_train["labels"].values,    test_size = 0.2)

Gives me

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-69-75472797614a> in <module>      1 # Get multi-label train/test splits of data      2 from sklearn.model_selection import train_test_split----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split(      4     df_train["text"].values,      5     df_train["labels"].values,~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size)     91      92     stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])---> 93     train_indexes, test_indexes = next(stratifier.split(X, y))     94      95     X_train, y_train = X[train_indexes, :], y[train_indexes, :]~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)    334                 .format(self.n_splits, n_samples))    335 --> 336         for train, test in super().split(X, y, groups):    337             yield train, test    338 ~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)     78         X, y, groups = indexable(X, y, groups)     79         indices = np.arange(_num_samples(X))---> 80         for test_index in self._iter_test_masks(X, y, groups):     81             train_index = indices[np.logical_not(test_index)]     82             test_index = indices[test_index]~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)     90         By default, delegates to _iter_test_indices(X, y, groups)     91         """---> 92         for test_index in self._iter_test_indices(X, y, groups):     93             test_mask = np.zeros(_num_samples(X), dtype=np.bool)     94             test_mask[test_index] = True~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _iter_test_indices(self, X, y, groups)    339     340         rows, rows_used, all_combinations, per_row_combinations, samples_with_combination, folds = \--> 341             self._prepare_stratification(y)    342     343         self._distribute_positive_evidence(rows_used, folds, samples_with_combination, per_row_combinations)~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _prepare_stratification(self, y)    236     237         """--> 238         self.n_samples, self.n_labels = y.shape    239         self.desired_samples_per_fold = np.array([self.percentage_per_fold[i] * self.n_samples    240                                                   for i in range(self.n_splits)])ValueError: not enough values to unpack (expected 2, got 1)

I think this isn't an issue with my data, as

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(    df_train["text"].values,    df_train["labels"].values,    test_size = 0.2)

Runs successfully. I'd really love to use this package, but these errors and the documentation gaps are really preventing me from doing so. Any advice would be great!

Also, as a side note, it's a little odd to me that sklearn's returned params areX_train, X_test, y_train, y_test while the multilearn returns areX_train, y_train, X_test, y_test

@edufonseca, did you ever find a solution?

You must be logged in to vote
0 replies
Comment options

@AlexMRuch no, I did not. It's a pity. It'd be great to have this work.

You must be logged in to vote
0 replies
Comment options

Yeah, looks like the last update was a year ago. Wonder if the package is dead :-(

You must be logged in to vote
0 replies
Comment options

You may simply customize that functioniterative_train_test_split for pandas Series with Text data as below:

from skmultilearn.model_selection import IterativeStratificationdef iterative_train_test_split(X, y, test_size):    stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])    train_indexes, test_indexes = next(stratifier.split(X, y))    X_train, y_train = X.iloc[train_indexes], y[train_indexes, :]    X_test, y_test = X.iloc[test_indexes], y[test_indexes, :]    return X_train, y_train, X_test, y_test
You must be logged in to vote
0 replies
Comment options

@AlexMRuch You may try this, and look if it works

X_train, y_train, X_test, y_test = iterative_train_test_split(    df_train[["text"]].values,    df_train[["labels"]].values,    test_size = 0.2)

I also got some error when using this method but using double bracket solved the error for me

You must be logged in to vote
0 replies
Comment options

As it states in theREADME, X and y must be matrices of two dimensions.

For instance, if you have a pandas column that you want to use as X, you should first convert it to a numpy array of shape (n, 1):

X = df.text.to_numpy()X.shape# (3,) -> bad! we need to add a new axisX = X[..., np.newaxis]# (3,1) -> good

To prepare the y parameter you can use MultiLabelBinarize from scikit-learn:

labels = [['white', 'black'], ['blue'], ['blue', 'white', 'pink']]mlb = MultiLabelBinarizer()y = mlb.fit_transform(labels)y.shape# (3,4)
You must be logged in to vote
0 replies
Answer selected byChristianSch
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
5 participants
@edufonseca@zbeloki@AlexMRuch@kevin-yauris@valeriich
Converted from issue

This discussion was converted from issue #160 on March 14, 2023 17:04.


[8]ページ先頭

©2009-2025 Movatter.jp