Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 497 Commits
.github		.github
docs		docs
examples		examples
imgs		imgs
lightautoml		lightautoml
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
check_docs.py		check_docs.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Repository files navigation

LightAutoML - automatic model creation framework

LightAutoML (LAMA) is an AutoML framework by Sber AI Lab.

It provides automatic model creation for the following tasks:

binary classification
multiclass classification
regression

Current version of the package handles datasets that have independent samples in each row. I.e.each row is an object with its specific features and target.Multitable datasets and sequences are a work in progress :)

Note: we useAutoWoE library to automatically create interpretable models.

Authors:Alexander Ryzhkov,Anton Vakhrushev,Dmitry Simakov, Vasilii Bunakov, Rinchin Damdinov, Pavel Shvets, Alexander Kirilin.

Documentation of LightAutoML is availablehere, you can alsogenerate it.

(New feature) GPU pipeline

Full GPU pipeline for LightAutoML currently available for developers testing (still in progress). The code and tutorialsavailable here

Installation

To install LAMA framework on your machine from PyPI, execute following commands:

# Install base functionality:pip install -U lightautoml# For partial installation use corresponding option.# Extra dependecies: [nlp, cv, report]# Or you can use 'all' to install everythingpip install -U lightautoml[nlp]

Additionaly, run following commands to enable pdf report generation:

# MacOSbrew install cairo pango gdk-pixbuf libffi# Debian / Ubuntusudo apt-get install build-essential libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info# Fedorasudo yum install redhat-rpm-config libffi-devel cairo pango gdk-pixbuf2# Windows# follow this tutorial https://weasyprint.readthedocs.io/en/stable/install.html#windows

Quick tour

Let's solve the popular Kaggle Titanic competition below. There are two main ways to solve machine learning problems using LightAutoML:

Use ready preset for tabular data:

importpandasaspdfromsklearn.metricsimportf1_scorefromlightautoml.automl.presets.tabular_presetsimportTabularAutoMLfromlightautoml.tasksimportTaskdf_train=pd.read_csv('../input/titanic/train.csv')df_test=pd.read_csv('../input/titanic/test.csv')automl=TabularAutoML(task=Task(name='binary',metric=lambday_true,y_pred:f1_score(y_true, (y_pred>0.5)*1)))oof_pred=automl.fit_predict(df_train,roles= {'target':'Survived','drop': ['PassengerId']})test_pred=automl.predict(df_test)pd.DataFrame({'PassengerId':df_test.PassengerId,'Survived': (test_pred.data[:,0]>0.5)*1}).to_csv('submit.csv',index=False)

LighAutoML framework has a lot of ready-to-use parts and extensive customization options, to learn more check out theresources section.

Resources

Kaggle kernel examples of LightAutoML usage:

Google Colab tutorials andother examples:

Tutorial_1_basics.ipynb - get started with LightAutoML on tabular data.
Tutorial_2_WhiteBox_AutoWoE.ipynb - creating interpretable models.
Tutorial_3_sql_data_source.ipynb - shows how to use LightAutoML presets (both standalone and time utilized variants) for solving ML tasks on tabular data from SQL data base instead of CSV.
Tutorial_4_NLP_Interpretation.ipynb - example of using TabularNLPAutoML preset, LimeTextExplainer.
Tutorial_5_uplift.ipynb - shows how to use LightAutoML for a uplift-modeling task.
Tutorial_6_custom_pipeline.ipynb - shows how to create your own pipeline from specified blocks: pipelines for feature generation and feature selection, ML algorithms, hyperparameter optimization etc.
Tutorial_7_ICE_and_PDP_interpretation.ipynb - shows how to obtain local and global interpretation of model results using ICE and PDP approaches.

Note 1: for production you have no need to use profiler (which increase work time and memory consomption), so please do not turn it on - it is in off state by default

Note 2: to take a look at this report after the run, please comment last line of demo with report deletion command.

Courses, videos and papers

LightAutoML crash courses:
- (Russian)AutoML course for OpenDataScience community
Video guides:
Papers:
- Anton Vakhrushev, Alexander Ryzhkov, Dmitry Simakov, Rinchin Damdinov, Maxim Savchenko, Alexander Tuzhilin"LightAutoML: AutoML Solution for a Large Financial Services Ecosystem". arXiv:2109.01528, 2021.
Articles about LightAutoML:
- (English)LightAutoML vs Titanic: 80% accuracy in several lines of code (Medium)
- (English)Hands-On Python Guide to LightAutoML – An Automatic ML Model Creation Framework (Analytic Indian Mag)

Contributing to LightAutoML

If you are interested in contributing to LightAutoML, please read theContributing Guide to get started.

License

This project is licensed under the Apache License, Version 2.0. SeeLICENSE file for more details.

For developers

Installation from source code

First of all you need to installgit andpoetry.

# Load LAMA source codegit clone https://github.com/sberbank-ai-lab/LightAutoML.gitcd LightAutoML/# !!!Choose only one item!!!# 1. Global installation: Don't create virtual environmentpoetry config virtualenvs.createfalse --local# 2. Recommended: Create virtual environment inside your project directorypoetry config virtualenvs.in-projecttrue# For more information read poetry docs# Install LAMApoetry lockpoetry install

Build your own custom pipeline:

importpandasaspdfromsklearn.metricsimportf1_scorefromlightautoml.automl.presets.tabular_presetsimportTabularAutoMLfromlightautoml.tasksimportTaskdf_train=pd.read_csv('../input/titanic/train.csv')df_test=pd.read_csv('../input/titanic/test.csv')# define that machine learning problem is binary classificationtask=Task("binary")reader=PandasToPandasReader(task,cv=N_FOLDS,random_state=RANDOM_STATE)# create a feature selectormodel0=BoostLGBM(default_params={'learning_rate':0.05,'num_leaves':64,'seed':42,'num_threads':N_THREADS})pipe0=LGBSimpleFeatures()mbie=ModelBasedImportanceEstimator()selector=ImportanceCutoffSelector(pipe0,model0,mbie,cutoff=0)# build first level pipeline for AutoMLpipe=LGBSimpleFeatures()# stop after 20 iterations or after 30 secondsparams_tuner1=OptunaTuner(n_trials=20,timeout=30)model1=BoostLGBM(default_params={'learning_rate':0.05,'num_leaves':128,'seed':1,'num_threads':N_THREADS})model2=BoostLGBM(default_params={'learning_rate':0.025,'num_leaves':64,'seed':2,'num_threads':N_THREADS})pipeline_lvl1=MLPipeline([    (model1,params_tuner1),model2],pre_selection=selector,features_pipeline=pipe,post_selection=None)# build second level pipeline for AutoMLpipe1=LGBSimpleFeatures()model=BoostLGBM(default_params={'learning_rate':0.05,'num_leaves':64,'max_bin':1024,'seed':3,'num_threads':N_THREADS},freeze_defaults=True)pipeline_lvl2=MLPipeline([model],pre_selection=None,features_pipeline=pipe1,post_selection=None)# build AutoML pipelineautoml=AutoML(reader, [    [pipeline_lvl1],    [pipeline_lvl2],],skip_conn=False)# train AutoML and get predictionsoof_pred=automl.fit_predict(df_train,roles= {'target':'Survived','drop': ['PassengerId']})test_pred=automl.predict(df_test)pd.DataFrame({'PassengerId':df_test.PassengerId,'Survived': (test_pred.data[:,0]>0.5)*1}).to_csv('submit.csv',index=False)