load_svmlight_file #

sklearn.datasets.load_svmlight_file(f,*,n_features=None,dtype=<class'numpy.float64'>,multilabel=False,zero_based='auto',query_id=False,offset=0,length=-1)[source]#

Load datasets in the svmlight / libsvm format into sparse CSR matrix.

This format is a text-based format, with one sample per line. It doesnot store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variableto predict.

This format is used as the default format for both svmlight and thelibsvm command line programs.

Parsing a text based source can be expensive. When repeatedlyworking on the same dataset, it is recommended to wrap thisloader with joblib.Memory.cache to store a memmapped backup of theCSR results of the first call and benefit from the near instantaneousloading of memmapped structures for the subsequent calls.

In case the file contains a pairwise preference constraint (knownas “qid” in the svmlight format) these are ignored unless thequery_id parameter is set to True. These pairwise preferenceconstraints can be used to constraint the combination of sampleswhen using pairwise loss functions (as is the case in somelearning to rank problems) so that only pairs with the samequery_id value are considered.

This implementation is written in Cython and is reasonably fast.However, a faster API-compatible loader is also available at:mblondel/svmlight-loader

Parameters:

fstr, path-like, file-like or int: (Path to) a file to load. If a path ends in “.gz” or “.bz2”, it willbe uncompressed on the fly. If an integer is passed, it is assumed tobe a file descriptor. A file-like or file descriptor will not be closedby this function. A file-like object must be opened in binary mode.
Changed in version 1.2:Path-like objects are now accepted.
n_featuresint, default=None: The number of features to use. If None, it will be inferred. Thisargument is useful to load several files that are subsets of abigger sliced dataset: each subset might not have examples ofevery feature, hence the inferred shape might vary from oneslice to another.n_features is only required ifoffset orlength are passed anon-default value.
dtypenumpy data type, default=np.float64: Data type of dataset to be loaded. This will be the data type of theoutput numpy arraysX andy.
multilabelbool, default=False: Samples may have several labels each (seehttps://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html).
zero_basedbool or “auto”, default=”auto”: Whether column indices in f are zero-based (True) or one-based(False). If column indices are one-based, they are transformed tozero-based to match Python/NumPy conventions.If set to “auto”, a heuristic check is applied to determine this fromthe file contents. Both kinds of files occur “in the wild”, but theyare unfortunately not self-identifying. Using “auto” or True shouldalways be safe when nooffset orlength is passed.Ifoffset orlength are passed, the “auto” mode falls backtozero_based=True to avoid having the heuristic check yieldinconsistent results on different segments of the file.
query_idbool, default=False: If True, will return the query_id array for each file.
offsetint, default=0: Ignore the offset first bytes by seeking forward, thendiscarding the following bytes up until the next new linecharacter.
lengthint, default=-1: If strictly positive, stop reading any new line of data once theposition in the file has reached the (offset + length) bytes threshold.

Returns:

Xscipy.sparse matrix of shape (n_samples, n_features): The data matrix.
yndarray of shape (n_samples,), or a list of tuples of length n_samples: The target. It is a list of tuples whenmultilabel=True, else andarray.
query_idarray of shape (n_samples,): The query_id for each sample. Only returned when query_id is set toTrue.

This Page

Show Source

Movatterモバイル変換

load_svmlight_file#

This Page

load_svmlight_file #