salesforce/TransmogrifAIPublic

NotificationsYou must be signed in to change notification settings
Fork401
Star2.3k

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

License

BSD-3-Clause license

2.3k stars 401 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 532 Commits
.circleci		.circleci
.git2gus		.git2gus
.github		.github
cli		cli
core		core
docs		docs
features		features
gradle		gradle
helloworld		helloworld
local		local
models		models
readers		readers
resources		resources
templates/simple		templates/simple
test-data		test-data
testkit		testkit
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
pom.xml		pom.xml
repl		repl
settings.gradle		settings.gradle
static.json		static.json

Repository files navigation

TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse.Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.

Use TransmogrifAI if you need a machine learning library to:

Build production ready machine learning applications in hours, not months
Build machine learning models without getting a Ph.D. in machine learning
Build modular, reusable, strongly typed machine learning workflows

To understand the motivation behind TransmogrifAI check out these:

Open Sourcing TransmogrifAI: Automated Machine Learning for Structured Data, a blog post by@snabar
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions, a talk by@tovbinm
Low Touch Machine Learning, a talk by@leahmcguire

Skip toQuick Start and Documentation.

Predicting Titanic Survivors with TransmogrifAI

The Titanic dataset is an often-cited dataset in the machine learning community. The goal is to build a machine learnt model that will predict survivors from the Titanic passenger manifest. Here is how you would build the model using TransmogrifAI:

importcom.salesforce.op._importcom.salesforce.op.readers._importcom.salesforce.op.features._importcom.salesforce.op.features.types._importcom.salesforce.op.stages.impl.classification._importorg.apache.spark.SparkConfimportorg.apache.spark.sql.SparkSessionimplicitvalspark=SparkSession.builder.config(newSparkConf()).getOrCreate()importspark.implicits._// Read Titanic data as a DataFramevalpassengersData=DataReaders.Simple.csvCase[Passenger](path= pathToData).readDataset().toDF()// Extract response and predictor Featuresval (survived, predictors)=FeatureBuilder.fromDataFrame[RealNN](passengersData, response="survived")// Automated feature engineeringvalfeatureVector= predictors.transmogrify()// Automated feature validation and selectionvalcheckedFeatures= survived.sanityCheck(featureVector, removeBadFeatures=true)// Automated model selectionvalpred=BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()// Setting up a TransmogrifAI workflow and training the modelvalmodel=newOpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()println("Model summary:\n"+ model.summaryPretty())

Model summary:

Evaluated Logistic Regression, Random Forest models with 3 folds and AuPR metric.Evaluated 3 Logistic Regression models with AuPR between [0.6751930383321765, 0.7768725281794376]Evaluated 16 Random Forest models with AuPR between [0.7781671467343991, 0.8104798040316159]Selected model Random Forest classifier with parameters:|-----------------------|--------------|| Model Param           |     Value    ||-----------------------|--------------|| modelType             | RandomForest || featureSubsetStrategy |         auto || impurity              |         gini || maxBins               |           32 || maxDepth              |           12 || minInfoGain           |        0.001 || minInstancesPerNode   |           10 || numTrees              |           50 || subsamplingRate       |          1.0 ||-----------------------|--------------|Model evaluation metrics:|-------------|--------------------|---------------------|| Metric Name | Hold Out Set Value |  Training Set Value ||-------------|--------------------|---------------------|| Precision   |               0.85 |   0.773851590106007 || Recall      | 0.6538461538461539 |  0.6930379746835443 || F1          | 0.7391304347826088 |  0.7312186978297163 || AuROC       | 0.8821603927986905 |  0.8766642291593114 || AuPR        | 0.8225075757571668 |   0.850331080886535 || Error       | 0.1643835616438356 | 0.19682151589242053 || TP          |               17.0 |               219.0 || TN          |               44.0 |               438.0 || FP          |                3.0 |                64.0 || FN          |                9.0 |                97.0 ||-------------|--------------------|---------------------|Top model insights computed using correlation:|-----------------------|----------------------|| Top Positive Insights |      Correlation     ||-----------------------|----------------------|| sex = "female"        |   0.5177801026737666 || cabin = "OTHER"       |   0.3331391338844782 || pClass = 1            |   0.3059642953159715 ||-----------------------|----------------------|| Top Negative Insights |      Correlation     ||-----------------------|----------------------|| sex = "male"          |  -0.5100301587292186 || pClass = 3            |  -0.5075774968534326 || cabin = null          | -0.31463114463832633 ||-----------------------|----------------------|Top model insights computed using CramersV:|-----------------------|----------------------||      Top Insights     |       CramersV       ||-----------------------|----------------------|| sex                   |    0.525557139885501 || embarked              |  0.31582347194683386 || age                   |  0.21582347194683386 ||-----------------------|----------------------|

While this may seem a bit too magical, for those who want more control, TransmogrifAI also provides the flexibility to completely specify all the features being extracted and all the algorithms being applied in your ML pipeline. Visit ourdocs site for full documentation, getting started, examples, faq and other information.

Adding TransmogrifAI into your project

You can simply add TransmogrifAI as a regular dependency to an existing project.Start by picking TransmogrifAI version to match your project dependencies from the version matrix below (if not sure - take thestable version):

TransmogrifAI Version	Spark Version	Scala Version	Java Version
0.7.1 (unreleased, master),0.7.0 (stable)	2.4	2.11	1.8
0.6.1, 0.6.0, 0.5.3, 0.5.2, 0.5.1, 0.5.0	2.3	2.11	1.8
0.4.0, 0.3.4	2.2	2.11	1.8

For Gradle inbuild.gradle add:

repositories {    jcenter()    mavenCentral()}dependencies {// TransmogrifAI core dependency    compile'com.salesforce.transmogrifai:transmogrifai-core_2.11:0.7.0'// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)// compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.7.0'}

For SBT inbuild.sbt add:

scalaVersion:="2.11.12"resolvers+=Resolver.jcenterRepo// TransmogrifAI core dependencylibraryDependencies+="com.salesforce.transmogrifai"%%"transmogrifai-core"%"0.7.0"// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)// libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.7.0"

Then import TransmogrifAI into your code:

// TransmogrifAI functionality: feature types, feature builders, feature dsl, readers, aggregators etc.importcom.salesforce.op._importcom.salesforce.op.aggregators._importcom.salesforce.op.features._importcom.salesforce.op.features.types._importcom.salesforce.op.readers._// Spark enrichments (optional)importcom.salesforce.op.utils.spark.RichDataset._importcom.salesforce.op.utils.spark.RichRDD._importcom.salesforce.op.utils.spark.RichRow._importcom.salesforce.op.utils.spark.RichMetadata._importcom.salesforce.op.utils.spark.RichStructType._

Quick Start and Documentation

Visit ourdocs site for full documentation, getting started, examples, faq and other information.

Seescaladoc for the programming API.

Authors

Kevin Moore@jauntbox
Kin Fai Kan@kinfaikan
Leah McGuire@leahmcguire
Matthew Tovbin@tovbinm
Max Ovsiankin@maxov
Michael Loh@mikeloh77
Michael Weil@michaelweilsalesforce
Shubha Nabar@snabar
Vitaly Gordon@vitalyg
Vlad Patryshev@vpatryshev

Internal Contributors (prior to release)

Chris Rupley@crupley
Chris Wu@cjwooo
Eric Wayman@ericwayman
Felipe Oliveira@feliperazeek
Gera Shegalov@gerashegalov
Jean-Marc Soumet@ajmssc
Marco Vivero@marcovivero
Mario Rodriguez@mrodriguezsfiq
Mayukh Bhaowal@mayukhb
Minh-An Quinn@minhanquinn
Nicolas Drizard@nicodri
Oleg Gusak@ogusak
Patrick Framption@tricktrap
Ryle Goehausen@ryleg
Sanmitra Ijeri@sanmitra
Sky Chen@almandsky
Sophie Xiaodan Sun@sxd929
Till Bergmann@tillbe
Xiaoqian Liu@wingsrc

License

BSD 3-Clause © Salesforce.com, Inc.

About

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

transmogrif.ai