This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can trysigning in orchanging directories.
Access to this page requires authorization. You can trychanging directories.
This sample tutorial illustrates using ML.NET to create a GitHub issue classifier to train a model that classifies and predicts the Area label for a GitHub issue via a .NET console application using C# in Visual Studio.
In this tutorial, you learn how to:
You can find the source code for this tutorial at thedotnet/samples repository.
Create a C#Console Application called "GitHubIssueClassification". SelectNext.
Choose .NET 7 as the framework to use. SelectCreate.
Create a directory namedData in your project to save your data set files:
InSolution Explorer, right-click on your project and selectAdd >New Folder. Type "Data" and pressEnter.
Create a directory namedModels in your project to save your model:
InSolution Explorer, right-click on your project and selectAdd >New Folder. Type "Models" and pressEnter.
Install theMicrosoft.ML NuGet Package:
Note
This sample uses the latest stable version of the NuGet packages mentioned unless otherwise stated.
In Solution Explorer, right-click on your project and selectManage NuGet Packages. Choose "nuget.org" as the Package source, select the Browse tab, search forMicrosoft.ML and selectInstall. Select theOK button on thePreview Changes dialog and then select theI Accept button on theLicense Acceptance dialog if you agree with the license terms for the packages listed.
Download theissues_train.tsv and theissues_test.tsv data sets and save them to theData folder you created previously. The first dataset trains the machine learning model and the second can be used to evaluate how accurate your model is.
In Solution Explorer, right-click each of the *.tsv files and selectProperties. UnderAdvanced, change the value ofCopy to Output Directory toCopy if newer.
Add the following additionalusing
directives to the top of theProgram.cs file:
using Microsoft.ML;using GitHubIssueClassification;
Create three global fields to hold the paths to the recently downloaded files, and global variables for theMLContext
,DataView
, andPredictionEngine
:
_trainDataPath
has the path to the dataset used to train the model._testDataPath
has the path to the dataset used to evaluate the model._modelPath
has the path where the trained model is saved._mlContext
is theMLContext that provides processing context._trainingDataView
is theIDataView used to process the training dataset._predEngine
is thePredictionEngine<TSrc,TDst> used for single predictions.Add the following code to the line directly below theusing
directives to specify those paths and the other variables:
string _appPath = Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]) ?? ".";string _trainDataPath = Path.Combine(_appPath, "..", "..", "..", "Data", "issues_train.tsv");string _testDataPath = Path.Combine(_appPath, "..", "..", "..", "Data", "issues_test.tsv");string _modelPath = Path.Combine(_appPath, "..", "..", "..", "Models", "model.zip");MLContext _mlContext;PredictionEngine<GitHubIssue, IssuePrediction> _predEngine;ITransformer _trainedModel;IDataView _trainingDataView;
Create some classes for your input data and predictions. Add a new class to your project:
InSolution Explorer, right-click the project, and then selectAdd >New Item.
In theAdd New Item dialog box, selectClass and change theName field toGitHubIssueData.cs. Then, selectAdd.
TheGitHubIssueData.cs file opens in the code editor. Add the followingusing
directive to the top ofGitHubIssueData.cs:
using Microsoft.ML.Data;
Remove the existing class definition and add the following code to theGitHubIssueData.cs file. This code has two classes,GitHubIssue
andIssuePrediction
.
public class GitHubIssue{ [LoadColumn(0)] public string? ID { get; set; } [LoadColumn(1)] public string? Area { get; set; } [LoadColumn(2)] public required string Title { get; set; } [LoadColumn(3)] public required string Description { get; set; }}public class IssuePrediction{ [ColumnName("PredictedLabel")] public string? Area;}
Thelabel
is the column you want to predict. The identifiedFeatures
are the inputs you give the model to predict the Label.
Use theLoadColumnAttribute to specify the indices of the source columns in the data set.
GitHubIssue
is the input dataset class and has the followingString fields:
ID
(GitHub Issue ID).Area
(the prediction for training).Title
(GitHub issue title) is the firstfeature
used for predicting theArea
.Description
is the secondfeature
used for predicting theArea
.IssuePrediction
is the class used for prediction after the model has been trained. It has a singlestring
(Area
) and aPredictedLabel
ColumnName
attribute. ThePredictedLabel
is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.
All ML.NET operations start in theMLContext class. InitializingmlContext
creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, toDBContext
inEntity Framework
.
Initialize the_mlContext
global variable with a new instance ofMLContext
with a random seed (seed: 0
) for repeatable/deterministic results across multiple trainings. Replace theConsole.WriteLine("Hello World!")
line with the following code:
_mlContext = new MLContext(seed: 0);
ML.NET uses theIDataView interface as a flexible, efficient way of describing numeric or text tabular data.IDataView
can load either text files or in real time (for example, SQL database or log files).
To initialize and load the_trainingDataView
global variable in order to use it for the pipeline, add the following code after themlContext
initialization:
_trainingDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_trainDataPath,hasHeader: true);
TheLoadFromTextFile() defines the data schema and reads in the file. It takes in the data path variables and returns anIDataView
.
Add the following after calling theLoadFromTextFile()
method:
var pipeline = ProcessData();
TheProcessData
method executes the following tasks:
Create theProcessData
method at the bottom of theProgram.cs file using the following code:
IEstimator<ITransformer> ProcessData(){}
As you want to predict the Area GitHub label for aGitHubIssue
, use theMapValueToKey() method to transform theArea
column into a numeric key typeLabel
column (a format accepted by classification algorithms) and add it as a new dataset column:
var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")
Next, callmlContext.Transforms.Text.FeaturizeText
, which transforms the text (Title
andDescription
) columns into a numeric vector for each calledTitleFeaturized
andDescriptionFeaturized
. Append the featurization for both columns to the pipeline with the following code:
.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Title", outputColumnName: "TitleFeaturized")).Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized"))
The last step in data preparation combines all of the feature columns into theFeatures column using theConcatenate() method. By default, a learning algorithm processes only features from theFeatures column. Append this transformation to the pipeline with the following code:
.Append(_mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"))
Next, append aAppendCacheCheckpoint to cache the DataView so when you iterate over the data multiple times using the cache might get better performance, as with the following code:
.AppendCacheCheckpoint(_mlContext);
Warning
Use AppendCacheCheckpoint for small/medium datasets to lower training time. Do NOT use it (remove .AppendCacheCheckpoint()) when handling very large datasets.
Return the pipeline at the end of theProcessData
method.
return pipeline;
This step handles preprocessing/featurization. Using additional components available in ML.NET can enable better results with your model.
Add the following call to theBuildAndTrainModel
method as the next line after the call to theProcessData()
method:
var trainingPipeline = BuildAndTrainModel(_trainingDataView, pipeline);
TheBuildAndTrainModel
method executes the following tasks:
Create theBuildAndTrainModel
method, just after the declaration of theProcessData()
method, using the following code:
IEstimator<ITransformer> BuildAndTrainModel(IDataView trainingDataView, IEstimator<ITransformer> pipeline){}
Classification is a machine learning task that uses data to determine the category, type, or class of an item or row of data and is frequently one of the following types:
For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary).
Append the machine learning algorithm to the data transformation definitions by adding the following as the first line of code inBuildAndTrainModel()
:
var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features")) .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));
TheSdcaMaximumEntropy is your multiclass classification training algorithm. This is appended to thepipeline
and accepts the featurizedTitle
andDescription
(Features
) and theLabel
input parameters to learn from the historic data.
Fit the model to thesplitTrainSet
data and return the trained model by adding the following as the next line of code in theBuildAndTrainModel()
method:
_trainedModel = trainingPipeline.Fit(trainingDataView);
TheFit()
method trains your model by transforming the dataset and applying the training.
ThePredictionEngine is a convenience API that allows you to pass in and then perform a prediction on a single instance of data. Add this as the next line in theBuildAndTrainModel()
method:
_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(_trainedModel);
Add a GitHub issue to test the trained model's prediction in thePredict
method by creating an instance ofGitHubIssue
:
GitHubIssue issue = new GitHubIssue() { Title = "WebSockets communication is slow in my machine", Description = "The WebSockets communication used under the covers by SignalR looks like is going slow in my development machine.."};
Use thePredict() function to make a prediction on a single row of data:
var prediction = _predEngine.Predict(issue);
DisplayGitHubIssue
and correspondingArea
label prediction in order to share the results and act on them accordingly. Create a display for the results using the followingConsole.WriteLine() code:
Console.WriteLine($"=============== Single Prediction just-trained-model - Result: {prediction.Area} ===============");
Return the model at the end of theBuildAndTrainModel
method.
return trainingPipeline;
Now that you've created and trained the model, you need to evaluate it with a different dataset for quality assurance and validation. In theEvaluate
method, the model created inBuildAndTrainModel
is passed in to be evaluated. Create theEvaluate
method, just afterBuildAndTrainModel
, as in the following code:
void Evaluate(DataViewSchema trainingDataViewSchema){}
TheEvaluate
method executes the following tasks:
Add a call to the new method, right under theBuildAndTrainModel
method call, using the following code:
Evaluate(_trainingDataView.Schema);
As you did previously with the training dataset, load the test dataset by adding the following code to theEvaluate
method:
var testDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_testDataPath,hasHeader: true);
TheEvaluate() method computes the quality metrics for the model using the specified dataset. It returns aMulticlassClassificationMetrics object that contains the overall metrics computed by multiclass classification evaluators.To display the metrics to determine the quality of the model, you need to get them first.Notice the use of theTransform() method of the machine learning_trainedModel
global variable (anITransformer) to input the features and return predictions. Add the following code to theEvaluate
method as the next line:
var testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));
The following metrics are evaluated for multiclass classification:
Use the following code to display the metrics, share the results, and then act on them:
Console.WriteLine($"*************************************************************************************************************");Console.WriteLine($"* Metrics for Multi-class Classification model - Test Data ");Console.WriteLine($"*------------------------------------------------------------------------------------------------------------");Console.WriteLine($"* MicroAccuracy: {testMetrics.MicroAccuracy:0.###}");Console.WriteLine($"* MacroAccuracy: {testMetrics.MacroAccuracy:0.###}");Console.WriteLine($"* LogLoss: {testMetrics.LogLoss:#.###}");Console.WriteLine($"* LogLossReduction: {testMetrics.LogLossReduction:#.###}");Console.WriteLine($"*************************************************************************************************************");
Once satisfied with your model, save it to a file to make predictions at a later time or in another application. Add the following code to theEvaluate
method.
SaveModelAsFile(_mlContext, trainingDataViewSchema, _trainedModel);
Create theSaveModelAsFile
method below yourEvaluate
method.
void SaveModelAsFile(MLContext mlContext,DataViewSchema trainingDataViewSchema, ITransformer model){}
Add the following code to yourSaveModelAsFile
method. This code uses theSave
method to serialize and store the trained model as a zip file.
mlContext.Model.Save(model, trainingDataViewSchema, _modelPath);
Add a call to the new method, right under theEvaluate
method call, using the following code:
PredictIssue();
Create thePredictIssue
method, just after theEvaluate
method (and just before theSaveModelAsFile
method), using the following code:
void PredictIssue(){}
ThePredictIssue
method executes the following tasks:
Load the saved model into your application by adding the following code to thePredictIssue
method:
ITransformer loadedModel = _mlContext.Model.Load(_modelPath, out var modelInputSchema);
Add a GitHub issue to test the trained model's prediction in thePredict
method by creating an instance ofGitHubIssue
:
GitHubIssue singleIssue = new GitHubIssue() { Title = "Entity Framework crashes", Description = "When connecting to the database, EF is crashing" };
As you did previously, create aPredictionEngine
instance with the following code:
_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(loadedModel);
ThePredictionEngine is a convenience API that allows you to perform a prediction on a single instance of data.PredictionEngine
is not thread-safe. It's acceptable to use in single-threaded or prototype environments. For improved performance and thread safety in production environments, use thePredictionEnginePool
service, which creates anObjectPool
ofPredictionEngine
objects for use throughout your application. See this guide on how tousePredictionEnginePool
in an ASP.NET Core Web API.
Note
PredictionEnginePool
service extension is currently in preview.
Use thePredictionEngine
to predict the Area GitHub label by adding the following code to thePredictIssue
method for the prediction:
var prediction = _predEngine.Predict(singleIssue);
DisplayArea
in order to categorize the issue and act on it accordingly. Create a display for the results using the followingConsole.WriteLine() code:
Console.WriteLine($"=============== Single Prediction - Result: {prediction.Area} ===============");
Your results should be similar to the following. As the pipeline processes, it displays messages. You might see warnings, or processing messages. These messages have been removed from the following results for clarity.
=============== Single Prediction just-trained-model - Result: area-System.Net ===============************************************************************************************************************** Metrics for Multi-class Classification model - Test Data*------------------------------------------------------------------------------------------------------------* MicroAccuracy: 0.738* MacroAccuracy: 0.668* LogLoss: .919* LogLossReduction: .643*************************************************************************************************************=============== Single Prediction - Result: area-System.Data ===============
Congratulations! You've now successfully built a machine-learning model for classifying and predicting an Area label for a GitHub issue. You can find the source code for this tutorial at thedotnet/samples repository.
In this tutorial, you learned how to:
Advance to the next tutorial to learn more.
Was this page helpful?
Was this page helpful?