fetch_20newsgroups#
- sklearn.datasets.fetch_20newsgroups(*,data_home=None,subset='train',categories=None,shuffle=True,random_state=42,remove=(),download_if_missing=True,return_X_y=False,n_retries=3,delay=1.0)[source]#
Load the filenames and data from the 20 newsgroups dataset (classification).
Download it if necessary.
Classes
20
Samples total
18846
Dimensionality
1
Features
text
Read more in theUser Guide.
- Parameters:
- data_homestr or path-like, default=None
Specify a download and cache folder for the datasets. If None,all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
- subset{‘train’, ‘test’, ‘all’}, default=’train’
Select the dataset to load: ‘train’ for the training set, ‘test’for the test set, ‘all’ for both, with shuffled ordering.
- categoriesarray-like, dtype=str, default=None
If None (default), load all the categories.If not None, list of category names to load (other categoriesignored).
- shufflebool, default=True
Whether or not to shuffle the data: might be important for models thatmake the assumption that the samples are independent and identicallydistributed (i.i.d.), such as stochastic gradient descent.
- random_stateint, RandomState instance or None, default=42
Determines random number generation for dataset shuffling. Pass an intfor reproducible output across multiple function calls.SeeGlossary.
- removetuple, default=()
May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each ofthese are kinds of text that will be detected and removed from thenewsgroup posts, preventing classifiers from overfitting onmetadata.
‘headers’ removes newsgroup headers, ‘footers’ removes blocks at theends of posts that look like signatures, and ‘quotes’ removes linesthat appear to be quoting another post.
‘headers’ follows an exact standard; the other filters are not alwayscorrect.
- download_if_missingbool, default=True
If False, raise an OSError if the data is not locally availableinstead of trying to download the data from the source site.
- return_X_ybool, default=False
If True, returns
(data.data,data.target)instead of a Bunchobject.Added in version 0.22.
- n_retriesint, default=3
Number of retries when HTTP errors are encountered.
Added in version 1.5.
- delayfloat, default=1.0
Number of seconds between retries.
Added in version 1.5.
- Returns:
- bunch
Bunch Dictionary-like object, with the following attributes.
- datalist of shape (n_samples,)
The data list to learn.
- target: ndarray of shape (n_samples,)
The target labels.
- filenames: list of shape (n_samples,)
The path to the location of the data.
- DESCR: str
The full description of the dataset.
- target_names: list of shape (n_classes,)
The names of target classes.
- (data, target)tuple if
return_X_y=True A tuple of two ndarrays. The first contains a 2D array of shape(n_samples, n_classes) with each row representing one sample and eachcolumn representing the features. The second array of shape(n_samples,) contains the target samples.
Added in version 0.22.
- bunch
Examples
>>>fromsklearn.datasetsimportfetch_20newsgroups>>>cats=['alt.atheism','sci.space']>>>newsgroups_train=fetch_20newsgroups(subset='train',categories=cats)>>>list(newsgroups_train.target_names)['alt.atheism', 'sci.space']>>>newsgroups_train.filenames.shape(1073,)>>>newsgroups_train.target.shape(1073,)>>>newsgroups_train.target[:10]array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])
Gallery examples#
Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
Biclustering documents with the Spectral Co-clustering algorithm
Column Transformer with Heterogeneous Data Sources
Sample pipeline for text feature extraction and evaluation
Classification of text documents using sparse features
