- Notifications
You must be signed in to change notification settings - Fork32
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
License
probcomp/PClean
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning
Warning: This is a rapidly evolving research prototype.
PClean was created at theMIT Probabilistic Computing Project.
If you use PClean in your research, please cite the our 2021 AISTATS paper:
PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March).In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)
To use PClean, create a Julia file with the following structure:
using PCleanusing DataFrames: DataFrameimport CSV# Load datadata= CSV.File(filepath)|> DataFrame# Define PClean modelPClean.@model MyModelbegin@class ClassName1begin...end...@class ClassNameNbegin...endend# Align column names of CSV with variables in the model.# Format is ColumnName CleanVariable DirtyVariable, or, if# there is no corruption for a certain variable, one can omit# the DirtyVariable.query=@query MyModel.ClassNameN [ HospitalName hosp.name observed_hosp_name Condition metric.condition.desc observed_condition...]# Configure observed datasetobservations= [ObservedDataset(query, data)]# Configurationconfig= PClean.InferenceConfig(1,2; use_mh_instead_of_pg=true)# SMC initializationstate=initialize_trace(observations, config)# Rejuvenation sweepsrun_inference!(state, config)# Evaluate accuracy, if ground truth is availableground_truth= CSV.File(filepath)|> CSV.DataFrameresults=evaluate_accuracy(data, ground_truth, state, query)# Can print results.f1, results.precision, results.accuracy, etc.println(results)# Even without ground truth, can save the entire latent database to CSV files:PClean.save_results(dir, dataset_name, state, observations)
Then, from this directory, run the Julia file.
JULIA_PROJECT=. julia my_file.jlTo learn to write a PClean model, seeour paper, but notethe surface syntax changes described below.
As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax,from the stand-alone syntax presented in our paper:
(1) Instead oflatent class C ... end, we write@class C begin ... end.
(2) Instead ofsubproblem begin ... end, inference hints are given using ordinaryJuliabegin ... end blocks.
(3) Instead ofparameter x ~ d(...), we use@learned x :: D{...}. The set ofdistributions D for parameters is somewhat restricted.
(4) Instead ofx ~ d(...) preferring E, we writex ~ d(..., E).
(5) Instead ofobserve x as y, ... from C, write@query ModelName.C [x y; ...].Clauses of the formx z y are also allowed, and tell PClean that the model variableC.z represents a clean version ofx, whose observed (dirty) version is modeledasC.y. This is used when automatically reconstructing a clean, flat dataset.
The names of built-in distributions may also be different, e.g.AddTypos instead oftypos,andProportionsParameter instead ofdirichlet.
About
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors2
Uh oh!
There was an error while loading.Please reload this page.