Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Download Microsoft EdgeMore info about Internet Explorer and Microsoft Edge
Table of contentsExit focus mode

Prepare data for building a model

  • 2023-04-07
Feedback

In this article

Learn how to use ML.NET to prepare data for additional processing or building a model.

Data is often unclean and sparse. ML.NET machine learning algorithms expect input or features to be in a single numerical vector. Similarly, the value to predict (label), especially when it's categorical data, has to be encoded. Therefore one of the goals of data preparation is to get the data into the format expected by ML.NET algorithms.

Split data into training & test sets

The following section outlines common problems when training a model known as overfitting and underfitting. Splitting your data and validation your models using a held out set can help you identify and mitigate these problems.

Overfitting & underfitting

Overfitting and underfitting are the two most common problems you encounter when training a model. Underfitting means the selected trainer is not capable enough to fit training dataset and usually result in a high loss during training and low score/metric on test dataset. To resolve this you need to either select a more powerful model or perform more feature engineering. Overfitting is the opposite, which happens when model learns the training data too well. This usually results in low loss metric during training but high loss on test dataset.

A good analogy for these concepts is studying for an exam. Let's say you knew the questions and answers ahead of time. After studying, you take the test and get a perfect score. Great news! However, when you're given the exam again with the questions rearranged and with slightly different wording you get a lower score. That suggests you memorized the answers and didn't actually learn the concepts you were being tested on. This is an example of overfitting. Underfitting is the opposite where the study materials you were given don't accurately represent what you're evaluated on for the exam. As a result, you resort to guessing the answers since you don't have enough knowledge to answer correctly.

Split data

Take the following input data and load it into anIDataView calleddata:

var homeDataList = new HomeData[]{    new()    {        NumberOfBedrooms = 1f,        Price = 100_000f    },    new()    {        NumberOfBedrooms = 2f,        Price = 300_000f    },    new()    {        NumberOfBedrooms = 6f,        Price = 600_000f    },    new()    {        NumberOfBedrooms = 3f,        Price = 300_000f    },    new()    {        NumberOfBedrooms = 2f,        Price = 200_000f    }};

To split data into train / test sets, use theTrainTestSplit(IDataView, Double, String, Nullable<Int32>) method.

// Apply filterTrainTestData dataSplit = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);

ThetestFraction parameter is used to take 0.2 or 20% of the dataset for testing. The remaining 80% is used for training.

The result isDataOperationsCatalog.TrainTestData with two IDataViews which you can access viaTrainSet andTestSet.

Filter data

Sometimes, not all data in a dataset is relevant for analysis. An approach to remove irrelevant data is filtering. TheDataOperationsCatalog contains a set of filter operations that take in anIDataView containing all of the data and return anIDataView containing only the data points of interest. It's important to note that because filter operations are not anIEstimator orITransformer like those in theTransformsCatalog, they cannot be included as part of anEstimatorChain orTransformerChain data preparation pipeline.

Take the following input data and load it into anIDataView calleddata:

HomeData[] homeDataList = new HomeData[]{    new ()    {        NumberOfBedrooms=1f,        Price=100000f    },    new ()    {        NumberOfBedrooms=2f,        Price=300000f    },    new ()    {        NumberOfBedrooms=6f,        Price=600000f    }};

To filter data based on the value of a column, use theFilterRowsByColumn method.

// Apply filterIDataView filteredData = mlContext.Data.FilterRowsByColumn(data, "Price", lowerBound: 200000, upperBound: 1000000);

The sample above takes rows in the dataset with a price between 200000 and 1000000. The result of applying this filter would return only the last two rows in the data and exclude the first row because its price is 100000 and not between the specified range.

Replace missing values

Missing values are a common occurrence in datasets. One approach to dealing with missing values is to replace them with the default value for the given type if any or another meaningful value such as the mean value in the data.

Take the following input data and load it into anIDataView calleddata:

HomeData[] homeDataList = new HomeData[]{    new ()    {        NumberOfBedrooms=1f,        Price=100000f    },    new ()    {        NumberOfBedrooms=2f,        Price=300000f    },    new ()    {        NumberOfBedrooms=6f,        Price=float.NaN    }};

Notice that the last element in the list has a missing value forPrice. To replace the missing values in thePrice column, use theReplaceMissingValues method to fill in that missing value.

Important

ReplaceMissingValue only works with numerical data.

// Define replacement estimatorvar replacementEstimator = mlContext.Transforms.ReplaceMissingValues("Price", replacementMode: MissingValueReplacingEstimator.ReplacementMode.Mean);// Fit data to estimator// Fitting generates a transformer that applies the operations of defined by estimatorITransformer replacementTransformer = replacementEstimator.Fit(data);// Transform dataIDataView transformedData = replacementTransformer.Transform(data);

ML.NET supports variousreplacement modes. The sample above uses theMean replacement mode, which fills in the missing value with that column's average value. The replacement's result fills in thePrice property for the last element in the data with 200,000 since it's the average of 100,000 and 300,000.

Use normalizers

Normalization is a data preprocessing technique used to scale features to be in the same range, usually between 0 and 1, so that they can be more accurately processed by a machine learning algorithm. For example, the ranges for age and income vary significantly with age generally being in the range of 0-100 and income generally being in the range of zero to thousands. Visit thetransforms page for a more detailed list and description of normalization transforms.

Min-max normalization

Take the following input data and load it into anIDataView calleddata:

HomeData[] homeDataList = new HomeData[]{    new ()    {        NumberOfBedrooms = 2f,        Price = 200000f    },    new ()    {        NumberOfBedrooms = 1f,        Price = 100000f    }};

Normalization can be applied to columns with single numerical values as well as vectors. Normalize the data in thePrice column using min-max normalization with theNormalizeMinMax method.

// Define min-max estimatorvar minMaxEstimator = mlContext.Transforms.NormalizeMinMax("Price");// Fit data to estimator// Fitting generates a transformer that applies the operations of defined by estimatorITransformer minMaxTransformer = minMaxEstimator.Fit(data);// Transform dataIDataView transformedData = minMaxTransformer.Transform(data);

The original price values[200000,100000] are converted to[ 1, 0.5 ] using theMinMax normalization formula that generates output values in the range of 0-1.

Binning

Binning converts continuous values into a discrete representation of the input. For example, suppose one of your features is age. Instead of using the actual age value, binning creates ranges for that value. 0-18 could be one bin, another could be 19-35 and so on.

Take the following input data and load it into anIDataView calleddata:

HomeData[] homeDataList = new HomeData[]{    new ()    {        NumberOfBedrooms=1f,        Price=100000f    },    new ()    {        NumberOfBedrooms=2f,        Price=300000f    },    new ()    {        NumberOfBedrooms=6f,        Price=600000f    }};

Normalize the data into bins using theNormalizeBinning method. ThemaximumBinCount parameter enables you to specify the number of bins needed to classify your data. In this example, data will be put into two bins.

// Define binning estimatorvar binningEstimator = mlContext.Transforms.NormalizeBinning("Price", maximumBinCount: 2);// Fit data to estimator// Fitting generates a transformer that applies the operations of defined by estimatorvar binningTransformer = binningEstimator.Fit(data);// Transform DataIDataView transformedData = binningTransformer.Transform(data);

The result of binning creates bin bounds of[0,200000,Infinity]. Therefore the resulting bins are[0,1,1] because the first observation is between 0-200000 and the others are greater than 200000 but less than infinity.

Work with categorical data

One of the most common types of data is categorical data. Categorical data has a finite number of categories. For example, the states of the USA, or a list of the types of animals found in a set of pictures. Whether the categorical data are features or labels, they must be mapped onto a numerical value so they can be used to generate a machine learning model. There are a number of ways of working with categorical data in ML.NET, depending on the problem you are solving.

Key value mapping

In ML.NET, a key is an integer value that represents a category. Key value mapping is most often used to map string labels into unique integer values for training, then back to their string values when the model is used to make a prediction.

The transforms used to perform key value mapping areMapValueToKey andMapKeyToValue.

MapValueToKey adds a dictionary of mappings in the model, so thatMapKeyToValue can perform the reverse transform when making a prediction.

One hot encoding

One hot encoding takes a finite set of values and maps them onto integers whose binary representation has a single1 value in unique positions in the string. One hot encoding can be the best choice if there is no implicit ordering of the categorical data. The following table shows an example with zip codes as raw values.

Raw valueOne hot encoded value
9805200...01
9810000...10
......
9810910...00

The transform to convert categorical data to one-hot encoded numbers isOneHotEncoding.

Hashing

Hashing is another way to convert categorical data to numbers. A hash function maps data of an arbitrary size (a string of text for example) onto a number with a fixed range. Hashing can be a fast and space-efficient way of vectorizing features. One notable example of hashing in machine learning is email spam filtering where, instead of maintaining a dictionary of known words, every word in the email is hashed and added to a large feature vector. Using hashing in this way avoids the problem of malicious spam filtering circumvention by the use of words that are not in the dictionary.

ML.NET providesHash transform to perform hashing on text, dates, and numerical data. Like value key mapping, the outputs of the hash transform are key types.

Work with text data

Like categorical data, text data needs to be transformed into numerical features before using it to build a machine learning model. Visit thetransforms page for a more detailed list and description of text transforms.

Using data like the data below that has been loaded into anIDataView:

ReviewData[] reviews = new ReviewData[]{    new ReviewData    {        Description="This is a good product",        Rating=4.7f    },    new ReviewData    {        Description="This is a bad product",        Rating=2.3f    }};

ML.NET provides theFeaturizeText transform that takes a text's string value and creates a set of features from the text, by applying a series of individual transforms.

// Define text transform estimatorvar textEstimator  = mlContext.Transforms.Text.FeaturizeText("Description");// Fit data to estimator// Fitting generates a transformer that applies the operations of defined by estimatorITransformer textTransformer = textEstimator.Fit(data);// Transform dataIDataView transformedData = textTransformer.Transform(data);

The resulting transform converts the text values in theDescription column to a numerical vector that looks similar to the output below:

[ 0.2041241, 0.2041241, 0.2041241, 0.4082483, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0, 0, 0, 0, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0 ]

The transforms that make upFeaturizeText can also be applied individually for finer grain control over feature generation.

// Define text transform estimatorvar textEstimator = mlContext.Transforms.Text.NormalizeText("Description")    .Append(mlContext.Transforms.Text.TokenizeIntoWords("Description"))    .Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Description"))    .Append(mlContext.Transforms.Conversion.MapValueToKey("Description"))    .Append(mlContext.Transforms.Text.ProduceNgrams("Description"))    .Append(mlContext.Transforms.NormalizeLpNorm("Description"));

textEstimator contains a subset of operations performed by theFeaturizeText method. The benefit of a more complex pipeline is control and visibility over the transformations applied to the data.

Using the first entry as an example, the following is a detailed description of the results produced by the transformation steps defined bytextEstimator:

Original Text: This is a good product

TransformDescriptionResult
1. NormalizeTextConverts all letters to lowercase by defaultthis is a good product
2. TokenizeWordsSplits string into individual words["this","is","a","good","product"]
3. RemoveDefaultStopWordsRemoves stop words likeis anda.["good","product"]
4. MapValueToKeyMaps the values to keys (categories) based on the input data[1,2]
5. ProduceNGramsTransforms text into sequence of consecutive words[1,1,1,0,0]
6. NormalizeLpNormScale inputs by their lp-norm[ 0.577350529, 0.577350529, 0.577350529, 0, 0 ]
Collaborate with us on GitHub
The source for this content can be found on GitHub, where you can also create and review issues and pull requests. For more information, seeour contributor guide.

Feedback

Was this page helpful?

YesNo

In this article

Was this page helpful?

YesNo