gyrdym/ml_algoPublic

NotificationsYou must be signed in to change notification settings
Fork36
Star197

Machine learning algorithms in Dart programming language

License

BSD-2-Clause license

197 stars 36 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,066 Commits
.github		.github
benchmark		benchmark
e2e		e2e
example		example
lib		lib
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CNAME		CNAME
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
analysis_options.yaml		analysis_options.yaml
build.sh		build.sh
build.yaml		build.yaml
e2e_tests.sh		e2e_tests.sh
pubspec.yaml		pubspec.yaml

Repository files navigation

Machine learning algorithms for Dart developers - ml_algo library

The library is a part of the ecosystem:

ml_algo library - implementation of popular machine learning algorithms
ml_preprocessing library - a library for data preprocessing
ml_linalg library - a library for linear algebra
ml_dataframe library- a library for storing and manipulating data

Table of contents

What is ml_algo for?

The main purpose of the library is to give native Dart implementation of machine learning algorithms to those who areinterested both in Dart language and data science. This library aims at Dart VM and Flutter. It is also possible to useits core features in web applications using web assembly.

The library content

Model selection
- CrossValidator.A factory that creates instances of cross validators. Cross-validation allows researchers to fit differenthyperparameters of machine learning algorithmsassessing prediction quality on different parts of a dataset.
Classification algorithms
- LogisticRegressor.A class that performs linear binary classification of data. To use this kind of classifier your data has to belinearly separable.
  - LogisticRegressor.SGD.Implementation of the logistic regression algorithm based on stochastic gradient descent with L2 regularisation.To use this kind of classifier your data has to belinearly separable.
  - LogisticRegressor.BGD.Implementation of the logistic regression algorithm based on batch gradient descent with L2 regularisation.To use this kind of classifier your data has to belinearly separable.
  - LogisticRegressor.newton.Implementation of the logistic regression algorithm based on Newton-Raphson method with L2 regularisation.To use this kind of classifier your data has to belinearly separable.
- SoftmaxRegressor.A class that performs linear multiclass classification of data. To use this kind of classifier your data has to belinearly separable.
- DecisionTreeClassifierA class that performs classification using decision trees. May work with data with non-linear patterns.
- KnnClassifierA class that performs classification usingk nearest neighbours algorithm - it makes predictions based onthe firstk closest observations to the given one.
Regression algorithms
- LinearRegressor.A general class for finding a linear pattern in training data and predicting outcomes as real numbers.
  - LinearRegressor.lassoImplementation of the linear regression algorithm based on coordinate descent with lasso regularisation
  - LinearRegressor.SGDImplementation of the linear regression algorithm based on stochastic gradient descent with L2 regularisation
  - LinearRegressor.BGDImplementation of the linear regression algorithm based on batch gradient descent with L2 regularisation
  - LinearRegressor.newtonImplementation of the linear regression algorithm based on Newton-Raphson method with L2 regularisation
- KnnRegressorA class that makes predictions for each new observation based on the firstk closest observations fromtraining data. It may catch non-linear patterns of the data.
Clustering and retrieval algorithms
- KDTree An algorithm forefficient data retrieval.
- Locality sensitive hashing. A family of algorithms that randomly partition all reference data points intodifferent bins, which makes it possible to perform efficient K Nearest Neighbours search, since there is no needto search for the neighbours through the entire data. The family is represented by the following classes:
  - RandomBinaryProjectionSearcher

For more information on the library's API, please visit theAPI reference

Examples

Logistic regression

Let's classify records from a well-known dataset -Pima Indians Diabetes DatabaseviaLogistic regressor

Important note:

Please pay attention to problems that classifiers and regressors exposed by the library solve. For e.g.,Logistic regressorsolves onlybinary classification problems, and that means that you can't use this classifier with a datasetwith more than two classes, keep that in mind - in order to find out more about regressors and classifiers, please refer totheAPI documentation of the package

Import all necessary packages. First, it's needed to ensure if you haveml_preprocessing andml_dataframe packagesin your dependencies:

dependencies:  ml_dataframe: ^1.5.0  ml_preprocessing: ^7.0.2

We need these repos to parse raw data in order to use it further. For more details, pleasevisitml_preprocessing repository page.

Important note:

Regressors and classifiers exposed by the library do not handle strings, booleans and nulls, they can only deal withnumbers! You necessarily need to convert all the improper values of your dataset to numbers, please refer toml_preprocessinglibrary to find out more about data preprocessing.

import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';import'package:ml_preprocessing/ml_preprocessing.dart';

Read a dataset's file

We have 2 options here:

Download the dataset fromPima Indians Diabetes Database.

Instructions

For a desktop application:

Just provide a proper path to your downloaded file and use a function-factoryfromCsv fromml_dataframe package toread the file:

final samples=awaitfromCsv('datasets/pima_indians_diabetes_database.csv');

For a flutter application:

It's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:

flutter:  assets:    - assets/datasets/pima_indians_diabetes_database.csv

You need to create the assets directory in the file system and put the dataset's file there. After that youcan access the dataset:

import'package:flutter/services.dart'show rootBundle;import'package:ml_dataframe/ml_dataframe.dart';voidmain()async {final rawCsvContent=await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');final samples=DataFrame.fromRawCsv(rawCsvContent);}

Or we may simply usegetPimaIndiansDiabetesDataFrame functionfromml_dataframe package. The function returns a ready to useDataFrame instancefilled withPima Indians Diabetes Database data.

Instructions

import'package:ml_dataframe/ml_dataframe.dart';voidmain() {final samples=getPimaIndiansDiabetesDataFrame();}

Prepare datasets for training and testing

Data in this file is represented by 768 records and 8 features. The 9th column is a label column, it contains either 0 or 1on each row. This column is our target - we should predict a class label for each observation. The column's name isOutcome. Let's store it:

final targetColumnName='Outcome';

Now it's the time to prepare data splits. Since we have a smallish dataset (only 768 records), we can't afford tosplit the data into just train and test sets and evaluate the model on them, the best approach in our case is Cross-Validation.According to this, let's split the data in the following way using the library'ssplitDatafunction:

final splits=splitData(samples, [0.7]);final validationData= splits[0];final testData= splits[1];

splitData accepts aDataFrame instance as the first argument and ratio list as the second one. Now we have 70% of ourdata as a validation set and 30% as a test set for evaluating generalization errors.

Set up a model selection algorithm

Then we may create an instance ofCrossValidator class to fit thehyperparametersof our model. We should pass validation data (ourvalidationData variable), and a number of folds into CrossValidatorconstructor.

final validator=CrossValidator.kFold(validationData, numberOfFolds:5);

Let's create a factory for the classifier with desired hyperparameters. We have to decide after the cross-validationif the selected hyperparameters are good enough or not:

final createClassifier= (DataFrame samples)=>LogisticRegressor(    samples    targetColumnName,  );

If we want to evaluate the learning process more thoroughly, we may passcollectLearningData argument to the classifierconstructor:

final createClassifier= (DataFrame samples)=>LogisticRegressor(    ...,    collectLearningData:true,  );

This argument activates collecting costs per each optimization iteration, and you can see the cost values right afterthe model creation.

Evaluate the performance of the model

Assume, we chose perfect hyperparameters. In order to validate this hypothesis, let's use CrossValidator instancecreated before:

final scores=await validator.evaluate(createClassifier,MetricType.accuracy);

Since the CrossValidator instance returns aVector of scores as a result of our predictor evaluation, we may chooseany way to reduce all the collected scores to a single number, for instance, we may use Vector'smean method:

final accuracy= scores.mean();

Let's print the score:

print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');

We can see something like this:

accuracy on k fold validation: 0.75

Let's assess our hyperparameters on the test set in order to evaluate the model's generalization error:

final testSplits=splitData(testData, [0.8]);final classifier=createClassifier(testSplits[0]);final finalScore= classifier.assess(testSplits[1],MetricType.accuracy);

The final score is like:

print(finalScore.toStringAsFixed(2));// approx. 0.75

If we specifiedcollectLearningData parameter, we may see costs per each iteration in order to evaluate how our costchanged from iteration to iteration during the learning process:

print(classifier.costPerIteration);

Write the model to a json file

Seems, our model has a good generalization ability, and that means we may use it in the future.To do so we may store the model in a file as JSON:

await classifier.saveAsJson('diabetes_classifier.json');

After that we can simply read the model from the file and make predictions:

import'dart:io';voidmain() {// ...final fileName='diabetes_classifier.json';final file=File(fileName);final encodedModel=await file.readAsString();final classifier=LogisticRegressor.fromJson(encodedModel);final unlabelledData=awaitfromCsv('some_unlabelled_data.csv');final prediction= classifier.predict(unlabelledData);print(prediction.header);// ('class variable (0 or 1)')print(prediction.rows);// [//   (1),//   (0),//   (0),//   (1),//   ...,//   (1),// ]// ...}

Please note that all the hyperparameters that we used to generate the model are persisted as the model's read-onlyfields, and we can access them anytime:

print(classifier.iterationsLimit);print(classifier.probabilityThreshold);// and so on

All the code for a desktop application:

import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';import'package:ml_preprocessing/ml_preprocessing.dart';voidmain()async {// Another option - to use a toy dataset:// final samples = getPimaIndiansDiabetesDataFrame();final samples=awaitfromCsv('datasets/pima_indians_diabetes_database.csv', headerExists:true);final targetColumnName='Outcome';final splits=splitData(samples, [0.7]);final validationData= splits[0];final testData= splits[1];final validator=CrossValidator.kFold(validationData, numberOfFolds:5);final createClassifier= (DataFrame samples)=>LogisticRegressor(      samples      targetColumnName,    );final scores=await validator.evaluate(createClassifier,MetricType.accuracy);final accuracy= scores.mean();print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');final testSplits=splitData(testData, [0.8]);final classifier=createClassifier(testSplits[0], targetNames);final finalScore= classifier.assess(testSplits[1], targetNames,MetricType.accuracy);print(finalScore.toStringAsFixed(2));await classifier.saveAsJson('diabetes_classifier.json');}

All the code for a flutter application:

import'package:flutter/services.dart'show rootBundle;import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';import'package:ml_preprocessing/ml_preprocessing.dart';voidmain()async {final rawCsvContent=await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');// Another option - to use a toy dataset:// final samples = getPimaIndiansDiabetesDataFrame();final samples=DataFrame.fromRawCsv(rawCsvContent);final targetColumnName='Outcome';final splits=splitData(samples, [0.7]);final validationData= splits[0];final testData= splits[1];final validator=CrossValidator.kFold(validationData, numberOfFolds:5);final createClassifier= (DataFrame samples)=>LogisticRegressor(      samples      targetColumnName,    );final scores=await validator.evaluate(createClassifier,MetricType.accuracy);final accuracy= scores.mean();print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');final testSplits=splitData(testData, [0.8]);final classifier=createClassifier(testSplits[0], targetNames);final finalScore= classifier.assess(testSplits[1], targetNames,MetricType.accuracy);print(finalScore.toStringAsFixed(2));await classifier.saveAsJson('diabetes_classifier.json');}

Linear regression

Let's try to predict house prices using linear regression and the famousBoston Housing dataset.The dataset contains 13 independent variables and 1 dependent variable -medv which is the target one (you can findthe dataset ine2e/_datasets/housing.csv).

Again, first we need to download the file and create a dataframe. The dataset is headless, we may either use autoheader or provide our own header.Let's use autoheader in our example:

For a desktop application:

Just provide a proper path to your downloaded file and use a function-factoryfromCsv fromml_dataframe package toread the file:

final samples=awaitfromCsv('datasets/housing.csv', headerExists:false, columnDelimiter:' ');

For a flutter application:

It's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:

flutter:  assets:    - assets/datasets/housing.csv

You need to create the assets directory in the file system and put the dataset's file there. After that youcan access the dataset:

import'package:flutter/services.dart'show rootBundle;import'package:ml_dataframe/ml_dataframe.dart';final rawCsvContent=await rootBundle.loadString('assets/datasets/housing.csv');final samples=DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter:' ');

Prepare the dataset for training and testing

Data in this file is represented by 505 records and 13 features. The 14th column is a target. Since we use autoheader, thetarget's name is autogenerated and it iscol_13. Let's store it in a variable:

final targetName='col_13';

then let's shuffle the data:

final shuffledSamples= samples.shuffle();

Now it's the time to prepare data splits. Let's split the data into train and test subsets using the library'ssplitDatafunction:

final splits=splitData(samples, [0.8]);final trainData= splits[0];final testData= splits[1];

splitData accepts aDataFrame instance as the first argument and ratio list as the second one. Now we have 80% of ourdata as a train set and 20% as a test set.

Let's train the model:

final model=LinearRegressor(trainData, targetName);

By default,LinearRegressor uses a closed-form solution to train the model. One can also use a different solution type,e.g. stochastic gradient descent algorithm:

final model=LinearRegressor.SGD(  shuffledSamples  targetName,  iterationLimit:90,);

or linear regression based on coordinate descent with Lasso regularization:

final model=LinearRegressor.lasso(  shuffledSamples,  targetName,  iterationLimit:90,);

Next, we should evaluate performance of our model:

final error= model.assess(testData,MetricType.mape);print(error);

If we are fine with the error, we can save the model for the future use:

await model.saveAsJson('housing_model.json');

Later we may use our trained model for prediction:

import'dart:io';import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';voidmain()async {final file=File('housing_model.json');final encodedModel=await file.readAsString();final model=LinearRegressor.fromJson(encodedModel);final unlabelledData=awaitfromCsv('some_unlabelled_data.csv');final prediction= model.predict(unlabelledData);print(prediction.header);print(prediction.rows);}

All the code for a desktop application:

import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';voidmain()async {final samples= (awaitfromCsv('datasets/housing.csv', headerExists:false, columnDelimiter:' ')).shuffle();final targetName='col_13';final splits=splitData(samples, [0.8]);final trainData= splits[0];final testData= splits[1];final model=LinearRegressor(trainData, targetName);final error= model.assess(testData,MetricType.mape);print(error);await classifier.saveAsJson('housing_model.json');}

All the code for a flutter application:

import'package:flutter/services.dart'show rootBundle;import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';voidmain()async {final rawCsvContent=await rootBundle.loadString('assets/datasets/housing.csv');final samples=DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter:' ').shuffle();final targetName='col_13';final splits=splitData(samples, [0.8]);final trainData= splits[0];final testData= splits[1];final model=LinearRegressor(trainData, targetName);final error= model.assess(testData,MetricType.mape);print(error);await classifier.saveAsJson('housing_model.json');}

Decision tree-based classification

Let's try to classify data from a well-knownIris dataset using a non-linear algorithm -decision trees

First, you need to download the data and place it in a proper place in your file system. To do so you should follow theinstructions which are given in theLogistic regression section. Or you may usegetIrisDataFramefunction that returns ready to useDataFrame instance filled withIrisdataset.

After loading the data, it's needed to preprocess it. We should drop theId column since the column doesn't make sense.Also, we need to encode the 'Species' column - originally, it contains 3 repeated string labels, to feed it to the classifierit's needed to convert the labels into numbers:

import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';import'package:ml_preprocessing/ml_preprocessing.dart';voidmain()async {final samples=getIrisDataFrame()      .shuffle()      .dropSeries(names: ['Id']);final pipeline=Pipeline(samples, [toIntegerLabels(        columnNames: ['Species'],// Here we convert strings from 'Species' column into numbers      ),    ]);}

Next, let's create a model:

final model=DecisionTreeClassifier(  processed,'Species',  minError:0.3,  minSamplesCount:5,  maxDepth:4,);

As you can see, we specified 3 hyperparameters:minError,minSamplesCount andmaxDepth. Let's look at theparameters in more detail:

minError. A minimum error on a tree node. If the error is less than or equal to the value, the node is considered a leaf.
minSamplesCount. A minimum number of samples on a node. If the number of samples is less than or equal to the value, the node is considered a leaf.
maxDepth. A maximum depth of the resulting decision tree. Once the tree reaches themaxDepth, all the level's nodes are considered leaves.

All the parameters serve as stopping criteria for the tree building algorithm.

Now we have a ready to use model. As usual, we can save the model to a JSON file:

await model.saveAsJson('path/to/json/file.json');

Unlike other models, in the case of a decision tree, we can visualise the algorithm result - we can save the model as an SVG file:

await model.saveAsSvg('path/to/svg/file.svg');

Once we saved it, we can open the file through any image viewer, e.g. through a web browser. An example of theresulting SVG image:

KDTree-based data retrieval

Let's take a look at another field of machine learning - data retrieval. The field is represented by a family of algorithms,one of them isKDTree which is exposed by the library.

KDTree is an algorithm that divides the whole search space into partitions in form of the binary tree which makes itefficient to retrieve data.

Let's retrieve some data points through a kd-tree built on theIris dataset.

First, we need to prepare the data. To do so, it's needed to load the dataset. For this purpose, we may usegetIrisDataFrame function fromml_dataframe. The function returns prefilled with the Iris data DataFrame instance:

import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';voidmain() {final originalData=getIrisDataFrame();}

Since the dataset containsId column that doesn't make sense andSpecies column that contains text data, we need todrop these columns:

import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';voidmain() {final originalData=getIrisDataFrame();final data= originalData.dropSeries(names: ['Id','Species']);}

Next, we can build the tree:

import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';voidmain() {final originalData=getIrisDataFrame();final data= originalData.dropSeries(names: ['Id','Species']);final tree=KDTree(data);}

And query nearest neighbours for an arbitrary point. Let's say, we want to find 5 nearest neighbours for the point[6.5, 3.01, 4.5, 1.5]:

import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';import'package:ml_linalg/vector.dart';voidmain() {final originalData=getIrisDataFrame();final data= originalData.dropSeries(names: ['Id','Species']);final tree=KDTree(data);final neighbourCount=5;final point=Vector.fromList([6.5,3.01,4.5,1.5]);final neighbours= tree.query(point, neighbourCount);print(neighbours);}

The last instruction prints the following:

(Index: 75, Distance: 0.17349341930302867), (Index: 51, Distance: 0.21470911402365767), (Index: 65, Distance: 0.26095956499211426), (Index: 86, Distance: 0.29681616124778537), (Index: 56, Distance: 0.4172527193942372))

The nearest point has an index 75 in the original data. Let's check a record at the index:

import'package:ml_dataframe/ml_dataframe.dart';voidmain() {final originalData=getIrisDataFrame();print(originalData.rows.elementAt(75));}

It prints the following:

(76, 6.6, 3.0, 4.4, 1.4, Iris-versicolor)

Remember, we droppedId andSpecies columns which are the very first and the very last elements in the output, so therest elements,6.6, 3.0, 4.4, 1.4 look quite similar to our target point -6.5, 3.01, 4.5, 1.5, so the query result makessense.

If you want to useKDTree outside the ml_algo ecosystem, meaning you don't want to useml_linalg andml_dataframepackages in your application, you may import onlyKDTree library and usefromIterable constructor andqueryIterablemethod to perform the query:

import'package:ml_algo/kd_tree.dart';voidmain()async {final tree=KDTree.fromIterable([// some data here  ]);final neighbourCount=5;final neighbours= tree.queryIterable([/* some point here */], neighbourCount);print(neighbours);}

As usual, we can persist our tree by saving it to a JSON file:

import'dart:io';import'package:ml_algo/ml_algo.dart';import'package:ml_dataframe/ml_dataframe.dart';voidmain() {final originalData=getIrisDataFrame();final data= originalData.dropSeries(names: ['Id','Species']);final tree=KDTree(data);// ...await tree.saveAsJson('path/to/json/file.json');// ...final file=awaitFile('path/to/json/file.json').readAsString();final encodedTree=jsonDecode(file)asMap<String,dynamic>;final restoredTree=KDTree.fromJson(encodedTree);print(restoredTree);}

Models retraining

Someday our previously shining model can degrade in terms of prediction accuracy - in this case, we can retrain it.Retraining means simply re-running the same learning algorithm that was used to generate our current modelkeeping the same hyperparameters but using a new data set with the same features:

import'dart:io';final fileName='diabetes_classifier.json';final file=File(fileName);final encodedModel=await file.readAsString();final classifier=LogisticRegressor.fromJson(encodedModel);// ...// here we do something and realize that our classifier performance is not so good// ...final newData=awaitfromCsv('path/to/dataset/with/new/data/to/retrain/the/classifier');final retrainedClassifier= classifier.retrain(newData);

The workflow with other predictors (SoftmaxRegressor, DecisionTreeClassifier and so on) is quite similar to the describedabove for LogisticRegressor, feel free to experiment with other models.

A couple of words about linear models which use gradient optimisation methods

Sometimes you may get NaN or Infinity as a value of your score, or it may be equal to some inconceivable value(extremely big or extremely low). To prevent so, you need to find a proper value of the initial learning rate, and alsoyou may choose between the following learning rate strategies:constant,timeBased,stepBased andexponential:

final createClassifier= (DataFrame samples)=>LogisticRegressor(      ...,      initialLearningRate:1e-5,      learningRateType:LearningRateType.timeBased,      ...,    );

Helpful articles on algorithms standing behind the library

Contacts

If you have questions, feel free to text me on

About

Machine learning algorithms in Dart programming language

gyrdym.github.io/ml_algo/

Releases

83tags

Sponsor this project

Learn more about GitHub Sponsors

Packages

No packages published

Movatterモバイル変換

Uh oh!

License

gyrdym/ml_algo

Folders and files

Latest commit

History

Repository files navigation

Machine learning algorithms for Dart developers - ml_algo library

What is ml_algo for?

The library content

Model selection

Classification algorithms

Regression algorithms

Clustering and retrieval algorithms

Examples

Logistic regression

Read a dataset's file

For a desktop application:

For a flutter application:

Prepare datasets for training and testing

Set up a model selection algorithm

Evaluate the performance of the model

Write the model to a json file

Linear regression

For a desktop application:

For a flutter application:

Prepare the dataset for training and testing

Decision tree-based classification

KDTree-based data retrieval

Models retraining

A couple of words about linear models which use gradient optimisation methods

Helpful articles on algorithms standing behind the library

Contacts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages