amidst/toolboxPublic

NotificationsYou must be signed in to change notification settings
Fork35
Star123

A Java Toolbox for Scalable Probabilistic Machine Learning

License

Apache-2.0 license

123 stars 35 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 7,910 Commits
classes/META-INF		classes/META-INF
configurationFiles		configurationFiles
core-dynamic		core-dynamic
core		core
datasets		datasets
examples		examples
extensions		extensions
flinklink		flinklink
huginlink		huginlink
latent-variable-models		latent-variable-models
moalink		moalink
models		models
module-all		module-all
networks		networks
sparklink		sparklink
wekalink		wekalink
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
Manifiest.txt		Manifiest.txt
README.md		README.md
compile.sh		compile.sh
core.properties		core.properties
core.xml		core.xml
module_core.xml		module_core.xml
newfile.txt		newfile.txt
pom.xml		pom.xml
run.sh		run.sh

Repository files navigation

www.amidsttoolbox.com)

v.0.7.2

Description

Probabilistic Machine Learning

The AMIDST Toolbox allows you to model your problem using a flexible probabilistic language based on graphical models.Then you fit your model with data using a Bayesian approach to handle modeling uncertainty.

Multi-core and distributed processing

AMIDST provides tailored parallel (powered by Java 8 Streams) and distributed (powered byFlink orSpark) implementations of Bayesian parameter learning for batch and streaming data. This processing is based on flexible andscalable message passing algorithms.

#Features

Probabilistic Graphical Models: Specify your model using probabilistic graphical models withlatent variablesandtemporal dependencies. AMIDST contains a large list of predefined latent variable models:

Scalable inference: Perform inference on your probabilistic models with powerful approximate andscalable algorithms.
Data Streams: Update your models when new data is available. This makes our toolboxappropriate for learning from (massive) data streams.
Large-scale Data: Use your defined models to process massive data sets in a distributedcomputer cluster usingApache Flink or (soon)Apache Spark.
Extensible: Code your models or algorithms within AMiDST and expand the toolbox functionalities.Flexible toolbox for researchers performing their experimentation in machine learning.
Interoperability: Leverage existing functionalities and algorithms by interfacingto other software tools such asHugin,MOA, Weka, R, etc.

#Simple Code Example

Fitting a model with local data

//Load the dataStringfilename ="./data.arff";DataStream<DataInstance>data =DataStreamLoader.open(filename);//Learn the modelModelmodel =newCustomGaussianMixture(data.getAttributes());model.updateModel(data);System.out.println(model.getModel());// Save with .bn formatBayesianNetworkWriter.save(model.getModel(),"./example.bn");

Fitting a model with distributed data

//Load the dataStringfilename ="hdfs://dataDistributed.arff";finalExecutionEnvironmentenv =ExecutionEnvironment.getExecutionEnvironment();DataFlink<DataInstance>data =DataFlinkLoader.loadDataFromFolder(env,filename,false);//Learn the modelModelmodel =newCustomGaussianMixture(data.getAttributes());model.updateModel(data);System.out.println(model.getModel());// Save with .bn formatBayesianNetworkWriter.save(model.getModel(),"./example.bn");

#Real-World Uses Cases

Risk prediction in credit operations

AMIDST Toolbox has been used to track concept drift and do risk prediction in credit operations,and as data is collected continuously and reported on a daily basis, this gives rise to a streaming dataclassification problem. This work has been performed in collaboration with one of our partners,the Spanish bank BCC. It is expected to be into production at the beginning of 2017.

Recognition of traffic maneuvers

AMIDST Toolbox has been used to prototype models for early recognition of traffic maneuverintentions. Similarly to the previous case, data is continuously collected by car on-boardsensors giving rise to a large and quickly evolving data stream. This work has been performedin collaboration with one of our partners, DAIMLER.

Documentation

Getting Started! explains how toinstall the AMIDST toolbox, how this toolbox makes use of Java 8 new functional style programmingfeatures, and why it is based on a module based architecture.
Toolbox Functionalities describesthe main functionalities (i.e., data streams, BNs, DBNs, static and dynamic learning and inferenceengines, etc.) of the AMIDST toolbox.

Bayesian networks: Code Examples includesa list of source code examples explaining how to use some functionalities of the AMIDST toolbox.

Dynamic Bayesian networks: Code Examplesincludes some source code examples of functionalities related to Dynamic Bayesian networks.

FlinkLink: Code Examples includes somesource code examples of functionalities related to the module that integrates Apache Flink with AMIDST.

SparkLink: some source code examples offunctionalities related to the module that integrates Apache Spark with AMIDST.
API JavaDoc of the AMIDST toolbox.

Scalability

Multi-Core Scalablity using Java 8 Streams

Scalability is a main concern for the AMIDST toolbox. Java 8 streams are used toprovide parallel implementations of our learning algorithms. If more computation capacity is needed to processdata, AMIDST users can also use more CPU cores. As an example, the following figure shows howthe data processing capacity of our toolbox increases given the number of CPU cores when learning ana probabilistic model (including a class variable C, two latent variables (dashed nodes), multinomial(blue nodes) and Gaussian (green nodes) observable variables) using the AMIDST's learning engine.As can be seen, using our variational learning engine, AMIDST toolbox is able to process data in the orderof gigabytes (GB) per hour depending on the number of available CPU cores with large and complex PGMs withlatent variables. Note that, these experiments were carried out on a Ubuntu Linux server with a x86_64architecture and 32 cores. The size of the processed data set was measured according to theWeka's ARFF format.

Distributed Scalablity using Apache Flink

If your data is really big and can not be stored in a single laptop, you can also learnyour probabilistic model on it by using the AMIDST distributed learning engine based ona novel and state-of-the-artdistributed message passing scheme implemented on topofApache Flink. As detailed in thispaper, we were able to perform inference in a billion node (i.e. 10^9) probabilistic model in an Amazon's cluster with 2, 4, 8 and 16 nodes, each node containing 8 processing units. The following figure shows the scalability of our approach under these settings.

Spark Link Module on AMIDST

This module integrates the functionality of the AMIDST toolbox with theApache Spark platform.

The following functionality is already implemented on thesparklink module:

Data Sources integration: Reading and writing data from SparkSQL on AMIDST
Distributed Sampling of Bayesian Networks
Parametric learning from distributed data (Maximum Likelihood)

More informationhere

Publications & Use-Cases

The following repositoryhttps://github.com/amidst/toolbox-usecasescontains the source code and details about the publications and use-cases using the AMIDST toolbox.

Upcoming Developments

The AMIDST toolbox is an expanding project and upcoming developments include for instance the ongoingintegration of the toolbox inSpark to enlarge its scalability capacities.In addition, a new link toRis still in progress which will expand the AMIDST user-base.

Contributing to AMIDST

AMIDST is an open source toolbox and the end-users are encouraged to upload theircontributions (which may include basic contributions, major extensions, and/or use-cases)following the indications given in thislink.

# Acknowledgements and LicenseThis software was performed as part of the AMIDST project. AMIDST has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 619209.

This software is distributed under Apache License Version 2.0

About

A Java Toolbox for Scalable Probabilistic Machine Learning

www.amidsttoolbox.com

Releases19

v0.7.2 Latest

Sep 4, 2018

+ 18 releases

Packages

No packages published

Movatterモバイル変換

License

amidst/toolbox

Folders and files

Latest commit

History

Repository files navigation

AMIDST Toolbox (http://www.amidsttoolbox.com)

Description

Probabilistic Machine Learning

Multi-core and distributed processing

Fitting a model with local data

Fitting a model with distributed data

Risk prediction in credit operations

Recognition of traffic maneuvers

Documentation

Scalability

Multi-Core Scalablity using Java 8 Streams

Distributed Scalablity using Apache Flink

Spark Link Module on AMIDST

Publications & Use-Cases

Upcoming Developments

Contributing to AMIDST

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases19

Packages0

Uh oh!

Contributors13

Uh oh!

Languages

Packages