probcomp/PCleanPublic

NotificationsYou must be signed in to change notification settings
Fork32
Star228

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

License

Apache-2.0 license

228 stars 32 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
datasets		datasets
experiments		experiments
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Repository files navigation

PClean

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Warning: This is a rapidly evolving research prototype.

PClean was created at theMIT Probabilistic Computing Project.

If you use PClean in your research, please cite the our 2021 AISTATS paper:

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March).In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)

Using PClean

To use PClean, create a Julia file with the following structure:

using PCleanusing DataFrames: DataFrameimport CSV# Load datadata= CSV.File(filepath)|> DataFrame# Define PClean modelPClean.@model MyModelbegin@class ClassName1begin...end...@class ClassNameNbegin...endend# Align column names of CSV with variables in the model.# Format is ColumnName CleanVariable DirtyVariable, or, if# there is no corruption for a certain variable, one can omit# the DirtyVariable.query=@query MyModel.ClassNameN [  HospitalName hosp.name             observed_hosp_name  Condition    metric.condition.desc observed_condition...]# Configure observed datasetobservations= [ObservedDataset(query, data)]# Configurationconfig= PClean.InferenceConfig(1,2; use_mh_instead_of_pg=true)# SMC initializationstate=initialize_trace(observations, config)# Rejuvenation sweepsrun_inference!(state, config)# Evaluate accuracy, if ground truth is availableground_truth= CSV.File(filepath)|> CSV.DataFrameresults=evaluate_accuracy(data, ground_truth, state, query)# Can print results.f1, results.precision, results.accuracy, etc.println(results)# Even without ground truth, can save the entire latent database to CSV files:PClean.save_results(dir, dataset_name, state, observations)

Then, from this directory, run the Julia file.

JULIA_PROJECT=. julia my_file.jl

To learn to write a PClean model, seeour paper, but notethe surface syntax changes described below.

Differences from the paper

As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax,from the stand-alone syntax presented in our paper:

(1) Instead oflatent class C ... end, we write@class C begin ... end.

(2) Instead ofsubproblem begin ... end, inference hints are given using ordinaryJuliabegin ... end blocks.

(3) Instead ofparameter x ~ d(...), we use@learned x :: D{...}. The set ofdistributions D for parameters is somewhat restricted.

(4) Instead ofx ~ d(...) preferring E, we writex ~ d(..., E).

(5) Instead ofobserve x as y, ... from C, write@query ModelName.C [x y; ...].Clauses of the formx z y are also allowed, and tell PClean that the model variableC.z represents a clean version ofx, whose observed (dirty) version is modeledasC.y. This is used when automatically reconstructing a clean, flat dataset.

The names of built-in distributions may also be different, e.g.AddTypos instead oftypos,andProportionsParameter instead ofdirichlet.

About

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

Languages

Julia100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PClean

Using PClean

Differences from the paper

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

probcomp/PClean

Folders and files

Latest commit

History

Repository files navigation

PClean

Using PClean

Differences from the paper

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages