scikit-multilearn/scikit-multilearnPublic

NotificationsYou must be signed in to change notification settings
Fork176
Star954

Documentation of iterative_train_test_split incomplete#282

Answeredbyzbeloki

edufonseca asked this question inQ&A

edufonseca

Mar 13, 2019

· 6 comments

AnsweredbyzbelokiReturn to top

Discussion options

edufonseca
Mar 13, 2019

iterative_train_test_split is briefly documentedhere (at the bottom), but the input paramsX,y are not explained. I tried passingyas a list of lists, encoding the labels as categorical integers, eg

[[2], [0,3], [1], [0,2,3]]

but it crashed.

By debugging the example providedhere,X,y turn out to bescipy.sparse.lil_matrix. Is this the only format allowed?

Any indication on the possible formats forX,y initerative_train_test_split? Thanks

You must be logged in to vote

Answered by zbeloki

Mar 6, 2023

As it states in theREADME, X and y must be matrices of two dimensions.

For instance, if you have a pandas column that you want to use as X, you should first convert it to a numpy array of shape (n, 1):

X = df.text.to_numpy()X.shape# (3,) -> bad! we need to add a new axisX = X[..., np.newaxis]# (3,1) -> good

To prepare the y parameter you can use MultiLabelBinarize from scikit-learn:

labels = [['white', 'black'], ['blue'], ['blue', 'white', 'pink']]mlb = MultiLabelBinarizer()y = mlb.fit_transform(labels)y.shape# (3,4)

View full answer

Replies: 6 comments

Comment options

AlexMRuch
May 23, 2020

I'm having this issue as well. I've tried converting my inputs to a list of lists, a np.array of lists, a np.array or np.arrays, etc.

I can only get the example to work with the test example, which will work for non-sparse matrices:

from skmultilearn.model_selection.iterative_stratification import iterative_train_test_splitfrom skmultilearn.dataset import load_datasetX,y, _, _ = load_dataset('scene', 'undivided')X_train, y_train, X_test, y_test = iterative_train_test_split(    X.A,    y.A,    test_size = 0.2)

^^^ This works fine for me

In this case, we have

print(type(X.A))print(X.A.shape)X.A

Return

<class 'numpy.ndarray'>(2407, 294)array([[0.646467, 0.666435, 0.685047, ..., 0.247298, 0.014025, 0.029709],       [0.770156, 0.767255, 0.761053, ..., 0.137833, 0.082672, 0.03632 ],       [0.793984, 0.772096, 0.76182 , ..., 0.051125, 0.112506, 0.083924],       ...,       [0.952281, 0.944987, 0.905556, ..., 0.0319  , 0.017547, 0.019734],       [0.88399 , 0.899004, 0.901019, ..., 0.256158, 0.226332, 0.22307 ],       [0.974915, 0.866425, 0.818144, ..., 0.005131, 0.025059, 0.004033]])

And

print(type(y.A))print(y.A.shape)y.A

return

<class 'numpy.ndarray'>(2407, 6)array([[1, 0, 0, 0, 1, 0],       [1, 0, 0, 0, 0, 1],       [1, 0, 0, 0, 0, 0],       ...,       [0, 0, 0, 0, 0, 1],       [0, 0, 0, 0, 0, 1],       [0, 0, 0, 0, 0, 1]])

However, with my own data,

print(type(df_train["text"].values))print(df_train["text"].values.shape)df_train["text"].values

Which returns

<class 'numpy.ndarray'>(23455,)array(['Wholeheartedly support these protests &amp; acts of civil disobedience &amp; will join when I can! #Ferguson #AllLivesMatter http://t.co/D8Phc8UakE',       'This Sandra Bland situation man no disrespect rest her soul , but people die everyday in a unjustified matter #AllLivesMatter',       'Commitment to peace, healing and loving neighbors. Give us strength and patience. #PortlandPride #AllLivesMatter #Peace',       ...,       'After losing the election to 2 unisex names, maybe it is time for the GOP to support Marriage Equality and Civil Unions. #Sandy #Christie',       '@FoxNews:Price gouging, looting and rage: #Sandy crimes stories grow http://t.co/zL3iI, Good Luck with their Gun Control Laws and 0 cops!',       "Might devastated #Sandy victims lose the oppurtunity to vote, thus having their rights violated? Looting their vote. It shouldn't happen."],      dtype=object)

And

print(type(df_train["labels"].values))print(df_train["labels"].values.shape)df_train["labels"].values

Which returns

<class 'numpy.ndarray'>(23455,)array([list([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]),       list([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]),       list([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), ...,       list([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]),       list([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]),       list([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])], dtype=object)

And coded another way as

print(type(df_train_labels_split))print(df_train_labels_split.shape)df_train_labels_split

Which returns

<class 'numpy.ndarray'>(23455, 11)array([[0, 0, 0, ..., 0, 0, 0],       [1, 1, 1, ..., 0, 0, 0],       [1, 0, 0, ..., 1, 0, 0],       ...,       [0, 0, 1, ..., 0, 0, 0],       [0, 0, 0, ..., 0, 0, 0],       [0, 1, 0, ..., 0, 0, 0]])

^^^ All of these give me errors:

X_train, y_train, X_test, y_test = iterative_train_test_split(    df_train["text"].values,    df_train_labels_split,    test_size = 0.2)

Throws

---------------------------------------------------------------------------IndexError                                Traceback (most recent call last)<ipython-input-67-d7e6efda299e> in <module>      1 # Get multi-label train/test splits of data      2 from sklearn.model_selection import train_test_split----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split(      4     df_train["text"].values,      5     df_train_labels_split,~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size)     93     train_indexes, test_indexes = next(stratifier.split(X, y))     94 ---> 95     X_train, y_train = X[train_indexes, :], y[train_indexes, :]     96     X_test, y_test = X[test_indexes, :], y[test_indexes, :]     97 IndexError: too many indices for array

^^^ The number of rows matches perfectly, so this is really unclear

And

X_train, y_train, X_test, y_test = iterative_train_test_split(    df_train["text"].values,    df_train["labels"].values,    test_size = 0.2)

Gives me

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-69-75472797614a> in <module>      1 # Get multi-label train/test splits of data      2 from sklearn.model_selection import train_test_split----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split(      4     df_train["text"].values,      5     df_train["labels"].values,~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size)     91      92     stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])---> 93     train_indexes, test_indexes = next(stratifier.split(X, y))     94      95     X_train, y_train = X[train_indexes, :], y[train_indexes, :]~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)    334                 .format(self.n_splits, n_samples))    335 --> 336         for train, test in super().split(X, y, groups):    337             yield train, test    338 ~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)     78         X, y, groups = indexable(X, y, groups)     79         indices = np.arange(_num_samples(X))---> 80         for test_index in self._iter_test_masks(X, y, groups):     81             train_index = indices[np.logical_not(test_index)]     82             test_index = indices[test_index]~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)     90         By default, delegates to _iter_test_indices(X, y, groups)     91         """---> 92         for test_index in self._iter_test_indices(X, y, groups):     93             test_mask = np.zeros(_num_samples(X), dtype=np.bool)     94             test_mask[test_index] = True~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _iter_test_indices(self, X, y, groups)    339     340         rows, rows_used, all_combinations, per_row_combinations, samples_with_combination, folds = \--> 341             self._prepare_stratification(y)    342     343         self._distribute_positive_evidence(rows_used, folds, samples_with_combination, per_row_combinations)~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _prepare_stratification(self, y)    236     237         """--> 238         self.n_samples, self.n_labels = y.shape    239         self.desired_samples_per_fold = np.array([self.percentage_per_fold[i] * self.n_samples    240                                                   for i in range(self.n_splits)])ValueError: not enough values to unpack (expected 2, got 1)

I think this isn't an issue with my data, as

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(    df_train["text"].values,    df_train["labels"].values,    test_size = 0.2)

Runs successfully. I'd really love to use this package, but these errors and the documentation gaps are really preventing me from doing so. Any advice would be great!

Also, as a side note, it's a little odd to me that sklearn's returned params areX_train, X_test, y_train, y_test while the multilearn returns areX_train, y_train, X_test, y_test

@edufonseca, did you ever find a solution?

You must be logged in to vote

0 replies

Comment options

edufonseca
May 25, 2020
Author

@AlexMRuch no, I did not. It's a pity. It'd be great to have this work.

You must be logged in to vote

0 replies

Comment options

AlexMRuch
May 25, 2020

Yeah, looks like the last update was a year ago. Wonder if the package is dead :-(

You must be logged in to vote

0 replies

Comment options

valeriich
Sep 7, 2020

You may simply customize that functioniterative_train_test_split for pandas Series with Text data as below:

from skmultilearn.model_selection import IterativeStratificationdef iterative_train_test_split(X, y, test_size):    stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])    train_indexes, test_indexes = next(stratifier.split(X, y))    X_train, y_train = X.iloc[train_indexes], y[train_indexes, :]    X_test, y_test = X.iloc[test_indexes], y[test_indexes, :]    return X_train, y_train, X_test, y_test

You must be logged in to vote

0 replies

Comment options

kevin-yauris
Sep 8, 2020

@AlexMRuch You may try this, and look if it works

X_train, y_train, X_test, y_test = iterative_train_test_split(    df_train[["text"]].values,    df_train[["labels"]].values,    test_size = 0.2)

I also got some error when using this method but using double bracket solved the error for me

You must be logged in to vote

0 replies

Comment options

zbeloki
Mar 6, 2023

As it states in theREADME, X and y must be matrices of two dimensions.

For instance, if you have a pandas column that you want to use as X, you should first convert it to a numpy array of shape (n, 1):

X = df.text.to_numpy()X.shape# (3,) -> bad! we need to add a new axisX = X[..., np.newaxis]# (3,1) -> good

To prepare the y parameter you can use MultiLabelBinarize from scikit-learn:

labels = [['white', 'black'], ['blue'], ['blue', 'white', 'pink']]mlb = MultiLabelBinarizer()y = mlb.fit_transform(labels)y.shape# (3,4)

You must be logged in to vote

0 replies

Answer selected byChristianSch

Movatterモバイル変換

Documentation of iterative_train_test_split incomplete#282

Uh oh!

edufonsecaMar 13, 2019

Replies: 6 comments

Uh oh!

Uh oh!

AlexMRuchMay 23, 2020

Uh oh!

edufonsecaMay 25, 2020 Author

Uh oh!

AlexMRuchMay 25, 2020

Uh oh!

Uh oh!

valeriichSep 7, 2020

Uh oh!

kevin-yaurisSep 8, 2020

Uh oh!

Uh oh!

zbelokiMar 6, 2023

Uh oh!

edufonseca
Mar 13, 2019

AlexMRuch
May 23, 2020

edufonseca
May 25, 2020
Author

AlexMRuch
May 25, 2020

valeriich
Sep 7, 2020

kevin-yauris
Sep 8, 2020

zbeloki
Mar 6, 2023