Movatterモバイル変換

This sample tutorial illustrates using ML.NET to create a GitHub issue classifier to train a model that classifies and predicts the Area label for a GitHub issue via a .NET console application using C# in Visual Studio.

In this tutorial, you learn how to:

Prepare your data
Transform the data
Train the model
Evaluate the model
Predict with the trained model
Deploy and Predict with a loaded model

You can find the source code for this tutorial at thedotnet/samples repository.

Prerequisites

Visual Studio 2022 with the ".NET Desktop Development" workload installed.
TheGitHub issues training tab-separated file (issues_train.tsv).
TheGitHub issues test tab-separated file (issues_test.tsv).

Create a console application

Create a project

Create a C#Console Application called "GitHubIssueClassification". SelectNext.
Choose .NET 7 as the framework to use. SelectCreate.
Create a directory namedData in your project to save your data set files:
InSolution Explorer, right-click on your project and selectAdd >New Folder. Type "Data" and pressEnter.
Create a directory namedModels in your project to save your model:
InSolution Explorer, right-click on your project and selectAdd >New Folder. Type "Models" and pressEnter.
Install theMicrosoft.ML NuGet Package:
Note
This sample uses the latest stable version of the NuGet packages mentioned unless otherwise stated.
In Solution Explorer, right-click on your project and selectManage NuGet Packages. Choose "nuget.org" as the Package source, select the Browse tab, search forMicrosoft.ML and selectInstall. Select theOK button on thePreview Changes dialog and then select theI Accept button on theLicense Acceptance dialog if you agree with the license terms for the packages listed.

Prepare your data

Download theissues_train.tsv and theissues_test.tsv data sets and save them to theData folder you created previously. The first dataset trains the machine learning model and the second can be used to evaluate how accurate your model is.
In Solution Explorer, right-click each of the *.tsv files and selectProperties. UnderAdvanced, change the value ofCopy to Output Directory toCopy if newer.

Create classes and define paths

Add the following additionalusing directives to the top of theProgram.cs file:

using Microsoft.ML;using GitHubIssueClassification;

Create three global fields to hold the paths to the recently downloaded files, and global variables for theMLContext,DataView, andPredictionEngine:

_trainDataPath has the path to the dataset used to train the model.
_testDataPath has the path to the dataset used to evaluate the model.
_modelPath has the path where the trained model is saved.
_mlContext is theMLContext that provides processing context.
_trainingDataView is theIDataView used to process the training dataset.
_predEngine is thePredictionEngine<TSrc,TDst> used for single predictions.

Add the following code to the line directly below theusing directives to specify those paths and the other variables:

string _appPath = Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]) ?? ".";string _trainDataPath = Path.Combine(_appPath, "..", "..", "..", "Data", "issues_train.tsv");string _testDataPath = Path.Combine(_appPath, "..", "..", "..", "Data", "issues_test.tsv");string _modelPath = Path.Combine(_appPath, "..", "..", "..", "Models", "model.zip");MLContext _mlContext;PredictionEngine<GitHubIssue, IssuePrediction> _predEngine;ITransformer _trainedModel;IDataView _trainingDataView;

Create some classes for your input data and predictions. Add a new class to your project:

InSolution Explorer, right-click the project, and then selectAdd >New Item.
In theAdd New Item dialog box, selectClass and change theName field toGitHubIssueData.cs. Then, selectAdd.
TheGitHubIssueData.cs file opens in the code editor. Add the followingusing directive to the top ofGitHubIssueData.cs:
```
using Microsoft.ML.Data;
```
Remove the existing class definition and add the following code to theGitHubIssueData.cs file. This code has two classes,GitHubIssue andIssuePrediction.
```
public class GitHubIssue{    [LoadColumn(0)]    public string? ID { get; set; }    [LoadColumn(1)]    public string? Area { get; set; }    [LoadColumn(2)]    public required string Title { get; set; }    [LoadColumn(3)]    public required string Description { get; set; }}public class IssuePrediction{    [ColumnName("PredictedLabel")]    public string? Area;}
```
Thelabel is the column you want to predict. The identifiedFeatures are the inputs you give the model to predict the Label.
Use theLoadColumnAttribute to specify the indices of the source columns in the data set.
GitHubIssue is the input dataset class and has the followingString fields:
- The first columnID (GitHub Issue ID).
- The second columnArea (the prediction for training).
- The third columnTitle (GitHub issue title) is the firstfeature used for predicting theArea.
- The fourth columnDescription is the secondfeature used for predicting theArea.
IssuePrediction is the class used for prediction after the model has been trained. It has a singlestring (Area) and aPredictedLabelColumnName attribute. ThePredictedLabel is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.
All ML.NET operations start in theMLContext class. InitializingmlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, toDBContext inEntity Framework.

Initialize variables

Initialize the_mlContext global variable with a new instance ofMLContext with a random seed (seed: 0) for repeatable/deterministic results across multiple trainings. Replace theConsole.WriteLine("Hello World!") line with the following code:

_mlContext = new MLContext(seed: 0);

Load the data

ML.NET uses theIDataView interface as a flexible, efficient way of describing numeric or text tabular data.IDataView can load either text files or in real time (for example, SQL database or log files).

To initialize and load the_trainingDataView global variable in order to use it for the pipeline, add the following code after themlContext initialization:

_trainingDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_trainDataPath,hasHeader: true);

TheLoadFromTextFile() defines the data schema and reads in the file. It takes in the data path variables and returns anIDataView.

Add the following after calling theLoadFromTextFile() method:

var pipeline = ProcessData();

TheProcessData method executes the following tasks:

Extracts and transforms the data.
Returns the processing pipeline.

Create theProcessData method at the bottom of theProgram.cs file using the following code:

IEstimator<ITransformer> ProcessData(){}

Extract features and transform the data

As you want to predict the Area GitHub label for aGitHubIssue, use theMapValueToKey() method to transform theArea column into a numeric key typeLabel column (a format accepted by classification algorithms) and add it as a new dataset column:

var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")

Next, callmlContext.Transforms.Text.FeaturizeText, which transforms the text (Title andDescription) columns into a numeric vector for each calledTitleFeaturized andDescriptionFeaturized. Append the featurization for both columns to the pipeline with the following code:

.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Title", outputColumnName: "TitleFeaturized")).Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized"))

The last step in data preparation combines all of the feature columns into theFeatures column using theConcatenate() method. By default, a learning algorithm processes only features from theFeatures column. Append this transformation to the pipeline with the following code:

.Append(_mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"))

Next, append aAppendCacheCheckpoint to cache the DataView so when you iterate over the data multiple times using the cache might get better performance, as with the following code:

.AppendCacheCheckpoint(_mlContext);

Warning

Use AppendCacheCheckpoint for small/medium datasets to lower training time. Do NOT use it (remove .AppendCacheCheckpoint()) when handling very large datasets.

Return the pipeline at the end of theProcessData method.

return pipeline;

This step handles preprocessing/featurization. Using additional components available in ML.NET can enable better results with your model.

Build and train the model

Add the following call to theBuildAndTrainModelmethod as the next line after the call to theProcessData() method:

var trainingPipeline = BuildAndTrainModel(_trainingDataView, pipeline);

TheBuildAndTrainModel method executes the following tasks:

Creates the training algorithm class.
Trains the model.
Predicts area based on training data.
Returns the model.

Create theBuildAndTrainModel method, just after the declaration of theProcessData() method, using the following code:

IEstimator<ITransformer> BuildAndTrainModel(IDataView trainingDataView, IEstimator<ITransformer> pipeline){}

About the classification task

Classification is a machine learning task that uses data to determine the category, type, or class of an item or row of data and is frequently one of the following types:

Binary: either A or B.
Multiclass: multiple categories that can be predicted by using a single model.

For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary).

Append the machine learning algorithm to the data transformation definitions by adding the following as the first line of code inBuildAndTrainModel():

var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))        .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

TheSdcaMaximumEntropy is your multiclass classification training algorithm. This is appended to thepipeline and accepts the featurizedTitle andDescription (Features) and theLabel input parameters to learn from the historic data.

Train the model

Fit the model to thesplitTrainSet data and return the trained model by adding the following as the next line of code in theBuildAndTrainModel() method:

_trainedModel = trainingPipeline.Fit(trainingDataView);

TheFit()method trains your model by transforming the dataset and applying the training.

ThePredictionEngine is a convenience API that allows you to pass in and then perform a prediction on a single instance of data. Add this as the next line in theBuildAndTrainModel() method:

_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(_trainedModel);

Predict with the trained model

Add a GitHub issue to test the trained model's prediction in thePredict method by creating an instance ofGitHubIssue:

GitHubIssue issue = new GitHubIssue() {    Title = "WebSockets communication is slow in my machine",    Description = "The WebSockets communication used under the covers by SignalR looks like is going slow in my development machine.."};

Use thePredict() function to make a prediction on a single row of data:

var prediction = _predEngine.Predict(issue);

Use the model: Prediction results

DisplayGitHubIssue and correspondingArea label prediction in order to share the results and act on them accordingly. Create a display for the results using the followingConsole.WriteLine() code:

Console.WriteLine($"=============== Single Prediction just-trained-model - Result: {prediction.Area} ===============");

Return the model trained to use for evaluation

Return the model at the end of theBuildAndTrainModel method.

return trainingPipeline;

Evaluate the model

Now that you've created and trained the model, you need to evaluate it with a different dataset for quality assurance and validation. In theEvaluate method, the model created inBuildAndTrainModel is passed in to be evaluated. Create theEvaluate method, just afterBuildAndTrainModel, as in the following code:

void Evaluate(DataViewSchema trainingDataViewSchema){}

TheEvaluate method executes the following tasks:

Loads the test dataset.
Creates the multiclass evaluator.
Evaluates the model and create metrics.
Displays the metrics.

Add a call to the new method, right under theBuildAndTrainModel method call, using the following code:

Evaluate(_trainingDataView.Schema);

As you did previously with the training dataset, load the test dataset by adding the following code to theEvaluate method:

var testDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_testDataPath,hasHeader: true);

TheEvaluate() method computes the quality metrics for the model using the specified dataset. It returns aMulticlassClassificationMetrics object that contains the overall metrics computed by multiclass classification evaluators.To display the metrics to determine the quality of the model, you need to get them first.Notice the use of theTransform() method of the machine learning_trainedModel global variable (anITransformer) to input the features and return predictions. Add the following code to theEvaluate method as the next line:

var testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));

The following metrics are evaluated for multiclass classification:

Micro Accuracy - Every sample-class pair contributes equally to the accuracy metric. You want Micro Accuracy to be as close to one as possible.
Macro Accuracy - Every class contributes equally to the accuracy metric. Minority classes are given equal weight as the larger classes. You want Macro Accuracy to be as close to one as possible.
Log-loss - seeLog Loss. You want Log-loss to be as close to zero as possible.
Log-loss reduction - Ranges from [-inf, 1.00], where 1.00 is perfect predictions and 0 indicates mean predictions. You want Log-loss reduction to be as close to one as possible.

Display the metrics for model validation

Use the following code to display the metrics, share the results, and then act on them:

Console.WriteLine($"*************************************************************************************************************");Console.WriteLine($"*       Metrics for Multi-class Classification model - Test Data     ");Console.WriteLine($"*------------------------------------------------------------------------------------------------------------");Console.WriteLine($"*       MicroAccuracy:    {testMetrics.MicroAccuracy:0.###}");Console.WriteLine($"*       MacroAccuracy:    {testMetrics.MacroAccuracy:0.###}");Console.WriteLine($"*       LogLoss:          {testMetrics.LogLoss:#.###}");Console.WriteLine($"*       LogLossReduction: {testMetrics.LogLossReduction:#.###}");Console.WriteLine($"*************************************************************************************************************");

Save the model to a file

Once satisfied with your model, save it to a file to make predictions at a later time or in another application. Add the following code to theEvaluate method.

SaveModelAsFile(_mlContext, trainingDataViewSchema, _trainedModel);

Create theSaveModelAsFile method below yourEvaluate method.

void SaveModelAsFile(MLContext mlContext,DataViewSchema trainingDataViewSchema, ITransformer model){}

Add the following code to yourSaveModelAsFile method. This code uses theSave method to serialize and store the trained model as a zip file.

mlContext.Model.Save(model, trainingDataViewSchema, _modelPath);

Deploy and Predict with a model

Add a call to the new method, right under theEvaluate method call, using the following code:

PredictIssue();

Create thePredictIssue method, just after theEvaluate method (and just before theSaveModelAsFile method), using the following code:

void PredictIssue(){}

ThePredictIssue method executes the following tasks:

Loads the saved model.
Creates a single issue of test data.
Predicts area based on test data.
Combines test data and predictions for reporting.
Displays the predicted results.

Load the saved model into your application by adding the following code to thePredictIssue method:

ITransformer loadedModel = _mlContext.Model.Load(_modelPath, out var modelInputSchema);

Add a GitHub issue to test the trained model's prediction in thePredict method by creating an instance ofGitHubIssue:

GitHubIssue singleIssue = new GitHubIssue() { Title = "Entity Framework crashes", Description = "When connecting to the database, EF is crashing" };

As you did previously, create aPredictionEngine instance with the following code:

_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(loadedModel);

ThePredictionEngine is a convenience API that allows you to perform a prediction on a single instance of data.PredictionEngine is not thread-safe. It's acceptable to use in single-threaded or prototype environments. For improved performance and thread safety in production environments, use thePredictionEnginePool service, which creates anObjectPool ofPredictionEngine objects for use throughout your application. See this guide on how tousePredictionEnginePool in an ASP.NET Core Web API.

Note

PredictionEnginePool service extension is currently in preview.

Use thePredictionEngine to predict the Area GitHub label by adding the following code to thePredictIssue method for the prediction:

var prediction = _predEngine.Predict(singleIssue);

Use the loaded model for prediction

DisplayArea in order to categorize the issue and act on it accordingly. Create a display for the results using the followingConsole.WriteLine() code:

Console.WriteLine($"=============== Single Prediction - Result: {prediction.Area} ===============");

Results

Your results should be similar to the following. As the pipeline processes, it displays messages. You might see warnings, or processing messages. These messages have been removed from the following results for clarity.

=============== Single Prediction just-trained-model - Result: area-System.Net ===============**************************************************************************************************************       Metrics for Multi-class Classification model - Test Data*------------------------------------------------------------------------------------------------------------*       MicroAccuracy:    0.738*       MacroAccuracy:    0.668*       LogLoss:          .919*       LogLossReduction: .643*************************************************************************************************************=============== Single Prediction - Result: area-System.Data ===============

Congratulations! You've now successfully built a machine-learning model for classifying and predicting an Area label for a GitHub issue. You can find the source code for this tutorial at thedotnet/samples repository.

Next steps

In this tutorial, you learned how to:

Prepare your data
Transform the data
Train the model
Evaluate the model
Predict with the trained model
Deploy and Predict with a loaded model

Advance to the next tutorial to learn more.

Taxi Fare Predictor