This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can trysigning in orchanging directories.
Access to this page requires authorization. You can trychanging directories.
Learn how to use ML.NET to prepare data for additional processing or building a model.
Data is often unclean and sparse. ML.NET machine learning algorithms expect input or features to be in a single numerical vector. Similarly, the value to predict (label), especially when it's categorical data, has to be encoded. Therefore one of the goals of data preparation is to get the data into the format expected by ML.NET algorithms.
The following section outlines common problems when training a model known as overfitting and underfitting. Splitting your data and validation your models using a held out set can help you identify and mitigate these problems.
Overfitting and underfitting are the two most common problems you encounter when training a model. Underfitting means the selected trainer is not capable enough to fit training dataset and usually result in a high loss during training and low score/metric on test dataset. To resolve this you need to either select a more powerful model or perform more feature engineering. Overfitting is the opposite, which happens when model learns the training data too well. This usually results in low loss metric during training but high loss on test dataset.
A good analogy for these concepts is studying for an exam. Let's say you knew the questions and answers ahead of time. After studying, you take the test and get a perfect score. Great news! However, when you're given the exam again with the questions rearranged and with slightly different wording you get a lower score. That suggests you memorized the answers and didn't actually learn the concepts you were being tested on. This is an example of overfitting. Underfitting is the opposite where the study materials you were given don't accurately represent what you're evaluated on for the exam. As a result, you resort to guessing the answers since you don't have enough knowledge to answer correctly.
Take the following input data and load it into anIDataView
calleddata
:
var homeDataList = new HomeData[]{ new() { NumberOfBedrooms = 1f, Price = 100_000f }, new() { NumberOfBedrooms = 2f, Price = 300_000f }, new() { NumberOfBedrooms = 6f, Price = 600_000f }, new() { NumberOfBedrooms = 3f, Price = 300_000f }, new() { NumberOfBedrooms = 2f, Price = 200_000f }};
To split data into train / test sets, use theTrainTestSplit(IDataView, Double, String, Nullable<Int32>) method.
// Apply filterTrainTestData dataSplit = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
ThetestFraction
parameter is used to take 0.2 or 20% of the dataset for testing. The remaining 80% is used for training.
The result isDataOperationsCatalog.TrainTestData with two IDataViews which you can access viaTrainSet andTestSet.
Sometimes, not all data in a dataset is relevant for analysis. An approach to remove irrelevant data is filtering. TheDataOperationsCatalog
contains a set of filter operations that take in anIDataView
containing all of the data and return anIDataView containing only the data points of interest. It's important to note that because filter operations are not anIEstimator
orITransformer
like those in theTransformsCatalog
, they cannot be included as part of anEstimatorChain
orTransformerChain
data preparation pipeline.
Take the following input data and load it into anIDataView
calleddata
:
HomeData[] homeDataList = new HomeData[]{ new () { NumberOfBedrooms=1f, Price=100000f }, new () { NumberOfBedrooms=2f, Price=300000f }, new () { NumberOfBedrooms=6f, Price=600000f }};
To filter data based on the value of a column, use theFilterRowsByColumn
method.
// Apply filterIDataView filteredData = mlContext.Data.FilterRowsByColumn(data, "Price", lowerBound: 200000, upperBound: 1000000);
The sample above takes rows in the dataset with a price between 200000 and 1000000. The result of applying this filter would return only the last two rows in the data and exclude the first row because its price is 100000 and not between the specified range.
Missing values are a common occurrence in datasets. One approach to dealing with missing values is to replace them with the default value for the given type if any or another meaningful value such as the mean value in the data.
Take the following input data and load it into anIDataView
calleddata
:
HomeData[] homeDataList = new HomeData[]{ new () { NumberOfBedrooms=1f, Price=100000f }, new () { NumberOfBedrooms=2f, Price=300000f }, new () { NumberOfBedrooms=6f, Price=float.NaN }};
Notice that the last element in the list has a missing value forPrice
. To replace the missing values in thePrice
column, use theReplaceMissingValues
method to fill in that missing value.
Important
ReplaceMissingValue
only works with numerical data.
// Define replacement estimatorvar replacementEstimator = mlContext.Transforms.ReplaceMissingValues("Price", replacementMode: MissingValueReplacingEstimator.ReplacementMode.Mean);// Fit data to estimator// Fitting generates a transformer that applies the operations of defined by estimatorITransformer replacementTransformer = replacementEstimator.Fit(data);// Transform dataIDataView transformedData = replacementTransformer.Transform(data);
ML.NET supports variousreplacement modes. The sample above uses theMean
replacement mode, which fills in the missing value with that column's average value. The replacement's result fills in thePrice
property for the last element in the data with 200,000 since it's the average of 100,000 and 300,000.
Normalization is a data preprocessing technique used to scale features to be in the same range, usually between 0 and 1, so that they can be more accurately processed by a machine learning algorithm. For example, the ranges for age and income vary significantly with age generally being in the range of 0-100 and income generally being in the range of zero to thousands. Visit thetransforms page for a more detailed list and description of normalization transforms.
Take the following input data and load it into anIDataView
calleddata
:
HomeData[] homeDataList = new HomeData[]{ new () { NumberOfBedrooms = 2f, Price = 200000f }, new () { NumberOfBedrooms = 1f, Price = 100000f }};
Normalization can be applied to columns with single numerical values as well as vectors. Normalize the data in thePrice
column using min-max normalization with theNormalizeMinMax
method.
// Define min-max estimatorvar minMaxEstimator = mlContext.Transforms.NormalizeMinMax("Price");// Fit data to estimator// Fitting generates a transformer that applies the operations of defined by estimatorITransformer minMaxTransformer = minMaxEstimator.Fit(data);// Transform dataIDataView transformedData = minMaxTransformer.Transform(data);
The original price values[200000,100000]
are converted to[ 1, 0.5 ]
using theMinMax
normalization formula that generates output values in the range of 0-1.
Binning converts continuous values into a discrete representation of the input. For example, suppose one of your features is age. Instead of using the actual age value, binning creates ranges for that value. 0-18 could be one bin, another could be 19-35 and so on.
Take the following input data and load it into anIDataView
calleddata
:
HomeData[] homeDataList = new HomeData[]{ new () { NumberOfBedrooms=1f, Price=100000f }, new () { NumberOfBedrooms=2f, Price=300000f }, new () { NumberOfBedrooms=6f, Price=600000f }};
Normalize the data into bins using theNormalizeBinning
method. ThemaximumBinCount
parameter enables you to specify the number of bins needed to classify your data. In this example, data will be put into two bins.
// Define binning estimatorvar binningEstimator = mlContext.Transforms.NormalizeBinning("Price", maximumBinCount: 2);// Fit data to estimator// Fitting generates a transformer that applies the operations of defined by estimatorvar binningTransformer = binningEstimator.Fit(data);// Transform DataIDataView transformedData = binningTransformer.Transform(data);
The result of binning creates bin bounds of[0,200000,Infinity]
. Therefore the resulting bins are[0,1,1]
because the first observation is between 0-200000 and the others are greater than 200000 but less than infinity.
One of the most common types of data is categorical data. Categorical data has a finite number of categories. For example, the states of the USA, or a list of the types of animals found in a set of pictures. Whether the categorical data are features or labels, they must be mapped onto a numerical value so they can be used to generate a machine learning model. There are a number of ways of working with categorical data in ML.NET, depending on the problem you are solving.
In ML.NET, a key is an integer value that represents a category. Key value mapping is most often used to map string labels into unique integer values for training, then back to their string values when the model is used to make a prediction.
The transforms used to perform key value mapping areMapValueToKey andMapKeyToValue.
MapValueToKey
adds a dictionary of mappings in the model, so thatMapKeyToValue
can perform the reverse transform when making a prediction.
One hot encoding takes a finite set of values and maps them onto integers whose binary representation has a single1
value in unique positions in the string. One hot encoding can be the best choice if there is no implicit ordering of the categorical data. The following table shows an example with zip codes as raw values.
Raw value | One hot encoded value |
---|---|
98052 | 00...01 |
98100 | 00...10 |
... | ... |
98109 | 10...00 |
The transform to convert categorical data to one-hot encoded numbers isOneHotEncoding
.
Hashing is another way to convert categorical data to numbers. A hash function maps data of an arbitrary size (a string of text for example) onto a number with a fixed range. Hashing can be a fast and space-efficient way of vectorizing features. One notable example of hashing in machine learning is email spam filtering where, instead of maintaining a dictionary of known words, every word in the email is hashed and added to a large feature vector. Using hashing in this way avoids the problem of malicious spam filtering circumvention by the use of words that are not in the dictionary.
ML.NET providesHash transform to perform hashing on text, dates, and numerical data. Like value key mapping, the outputs of the hash transform are key types.
Like categorical data, text data needs to be transformed into numerical features before using it to build a machine learning model. Visit thetransforms page for a more detailed list and description of text transforms.
Using data like the data below that has been loaded into anIDataView
:
ReviewData[] reviews = new ReviewData[]{ new ReviewData { Description="This is a good product", Rating=4.7f }, new ReviewData { Description="This is a bad product", Rating=2.3f }};
ML.NET provides theFeaturizeText
transform that takes a text's string value and creates a set of features from the text, by applying a series of individual transforms.
// Define text transform estimatorvar textEstimator = mlContext.Transforms.Text.FeaturizeText("Description");// Fit data to estimator// Fitting generates a transformer that applies the operations of defined by estimatorITransformer textTransformer = textEstimator.Fit(data);// Transform dataIDataView transformedData = textTransformer.Transform(data);
The resulting transform converts the text values in theDescription
column to a numerical vector that looks similar to the output below:
[ 0.2041241, 0.2041241, 0.2041241, 0.4082483, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0, 0, 0, 0, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0 ]
The transforms that make upFeaturizeText
can also be applied individually for finer grain control over feature generation.
// Define text transform estimatorvar textEstimator = mlContext.Transforms.Text.NormalizeText("Description") .Append(mlContext.Transforms.Text.TokenizeIntoWords("Description")) .Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Description")) .Append(mlContext.Transforms.Conversion.MapValueToKey("Description")) .Append(mlContext.Transforms.Text.ProduceNgrams("Description")) .Append(mlContext.Transforms.NormalizeLpNorm("Description"));
textEstimator
contains a subset of operations performed by theFeaturizeText
method. The benefit of a more complex pipeline is control and visibility over the transformations applied to the data.
Using the first entry as an example, the following is a detailed description of the results produced by the transformation steps defined bytextEstimator
:
Original Text: This is a good product
Transform | Description | Result |
---|---|---|
1. NormalizeText | Converts all letters to lowercase by default | this is a good product |
2. TokenizeWords | Splits string into individual words | ["this","is","a","good","product"] |
3. RemoveDefaultStopWords | Removes stop words likeis anda. | ["good","product"] |
4. MapValueToKey | Maps the values to keys (categories) based on the input data | [1,2] |
5. ProduceNGrams | Transforms text into sequence of consecutive words | [1,1,1,0,0] |
6. NormalizeLpNorm | Scale inputs by their lp-norm | [ 0.577350529, 0.577350529, 0.577350529, 0, 0 ] |
Was this page helpful?
Was this page helpful?