I'm having this issue as well. I've tried converting my inputs to a list of lists, a np.array of lists, a np.array or np.arrays, etc. I can only get the example to work with the test example, which will work for non-sparse matrices: from skmultilearn.model_selection.iterative_stratification import iterative_train_test_splitfrom skmultilearn.dataset import load_datasetX,y, _, _ = load_dataset('scene', 'undivided')X_train, y_train, X_test, y_test = iterative_train_test_split( X.A, y.A, test_size = 0.2)
^^^ This works fine for me In this case, we have print(type(X.A))print(X.A.shape)X.A
Return <class 'numpy.ndarray'>(2407, 294)array([[0.646467, 0.666435, 0.685047, ..., 0.247298, 0.014025, 0.029709], [0.770156, 0.767255, 0.761053, ..., 0.137833, 0.082672, 0.03632 ], [0.793984, 0.772096, 0.76182 , ..., 0.051125, 0.112506, 0.083924], ..., [0.952281, 0.944987, 0.905556, ..., 0.0319 , 0.017547, 0.019734], [0.88399 , 0.899004, 0.901019, ..., 0.256158, 0.226332, 0.22307 ], [0.974915, 0.866425, 0.818144, ..., 0.005131, 0.025059, 0.004033]])
And print(type(y.A))print(y.A.shape)y.A
return <class 'numpy.ndarray'>(2407, 6)array([[1, 0, 0, 0, 1, 0], [1, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], ..., [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 1]])
However, with my own data, print(type(df_train["text"].values))print(df_train["text"].values.shape)df_train["text"].values
Which returns <class 'numpy.ndarray'>(23455,)array(['Wholeheartedly support these protests & acts of civil disobedience & will join when I can! #Ferguson #AllLivesMatter http://t.co/D8Phc8UakE', 'This Sandra Bland situation man no disrespect rest her soul , but people die everyday in a unjustified matter #AllLivesMatter', 'Commitment to peace, healing and loving neighbors. Give us strength and patience. #PortlandPride #AllLivesMatter #Peace', ..., 'After losing the election to 2 unisex names, maybe it is time for the GOP to support Marriage Equality and Civil Unions. #Sandy #Christie', '@FoxNews:Price gouging, looting and rage: #Sandy crimes stories grow http://t.co/zL3iI, Good Luck with their Gun Control Laws and 0 cops!', "Might devastated #Sandy victims lose the oppurtunity to vote, thus having their rights violated? Looting their vote. It shouldn't happen."], dtype=object)
And print(type(df_train["labels"].values))print(df_train["labels"].values.shape)df_train["labels"].values
Which returns <class 'numpy.ndarray'>(23455,)array([list([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]), list([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]), list([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), ..., list([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]), list([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]), list([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0])], dtype=object)
And coded another way as print(type(df_train_labels_split))print(df_train_labels_split.shape)df_train_labels_split
Which returns <class 'numpy.ndarray'>(23455, 11)array([[0, 0, 0, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0], [1, 0, 0, ..., 1, 0, 0], ..., [0, 0, 1, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 1, 0, ..., 0, 0, 0]])
^^^ All of these give me errors: X_train, y_train, X_test, y_test = iterative_train_test_split( df_train["text"].values, df_train_labels_split, test_size = 0.2)
Throws ---------------------------------------------------------------------------IndexError Traceback (most recent call last)<ipython-input-67-d7e6efda299e> in <module> 1 # Get multi-label train/test splits of data 2 from sklearn.model_selection import train_test_split----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split( 4 df_train["text"].values, 5 df_train_labels_split,~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size) 93 train_indexes, test_indexes = next(stratifier.split(X, y)) 94 ---> 95 X_train, y_train = X[train_indexes, :], y[train_indexes, :] 96 X_test, y_test = X[test_indexes, :], y[test_indexes, :] 97 IndexError: too many indices for array
^^^ The number of rows matches perfectly, so this is really unclear And X_train, y_train, X_test, y_test = iterative_train_test_split( df_train["text"].values, df_train["labels"].values, test_size = 0.2)
Gives me ---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-69-75472797614a> in <module> 1 # Get multi-label train/test splits of data 2 from sklearn.model_selection import train_test_split----> 3 X_train, y_train, X_test, y_test = iterative_train_test_split( 4 df_train["text"].values, 5 df_train["labels"].values,~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in iterative_train_test_split(X, y, test_size) 91 92 stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[test_size, 1.0-test_size])---> 93 train_indexes, test_indexes = next(stratifier.split(X, y)) 94 95 X_train, y_train = X[train_indexes, :], y[train_indexes, :]~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups) 334 .format(self.n_splits, n_samples)) 335 --> 336 for train, test in super().split(X, y, groups): 337 yield train, test 338 ~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups) 78 X, y, groups = indexable(X, y, groups) 79 indices = np.arange(_num_samples(X))---> 80 for test_index in self._iter_test_masks(X, y, groups): 81 train_index = indices[np.logical_not(test_index)] 82 test_index = indices[test_index]~/anaconda3/envs/transformers/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups) 90 By default, delegates to _iter_test_indices(X, y, groups) 91 """---> 92 for test_index in self._iter_test_indices(X, y, groups): 93 test_mask = np.zeros(_num_samples(X), dtype=np.bool) 94 test_mask[test_index] = True~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _iter_test_indices(self, X, y, groups) 339 340 rows, rows_used, all_combinations, per_row_combinations, samples_with_combination, folds = \--> 341 self._prepare_stratification(y) 342 343 self._distribute_positive_evidence(rows_used, folds, samples_with_combination, per_row_combinations)~/anaconda3/envs/transformers/lib/python3.8/site-packages/skmultilearn/model_selection/iterative_stratification.py in _prepare_stratification(self, y) 236 237 """--> 238 self.n_samples, self.n_labels = y.shape 239 self.desired_samples_per_fold = np.array([self.percentage_per_fold[i] * self.n_samples 240 for i in range(self.n_splits)])ValueError: not enough values to unpack (expected 2, got 1)
I think this isn't an issue with my data, as from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split( df_train["text"].values, df_train["labels"].values, test_size = 0.2)
Runs successfully. I'd really love to use this package, but these errors and the documentation gaps are really preventing me from doing so. Any advice would be great! Also, as a side note, it's a little odd to me that sklearn's returned params areX_train, X_test, y_train, y_test while the multilearn returns areX_train, y_train, X_test, y_test @edufonseca, did you ever find a solution? |