Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

License

NotificationsYou must be signed in to change notification settings

probcomp/PClean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Warning: This is a rapidly evolving research prototype.

PClean was created at theMIT Probabilistic Computing Project.

If you use PClean in your research, please cite the our 2021 AISTATS paper:

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March).In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. (pdf)

Using PClean

To use PClean, create a Julia file with the following structure:

using PCleanusing DataFrames: DataFrameimport CSV# Load datadata= CSV.File(filepath)|> DataFrame# Define PClean modelPClean.@model MyModelbegin@class ClassName1begin...end...@class ClassNameNbegin...endend# Align column names of CSV with variables in the model.# Format is ColumnName CleanVariable DirtyVariable, or, if# there is no corruption for a certain variable, one can omit# the DirtyVariable.query=@query MyModel.ClassNameN [  HospitalName hosp.name             observed_hosp_name  Condition    metric.condition.desc observed_condition...]# Configure observed datasetobservations= [ObservedDataset(query, data)]# Configurationconfig= PClean.InferenceConfig(1,2; use_mh_instead_of_pg=true)# SMC initializationstate=initialize_trace(observations, config)# Rejuvenation sweepsrun_inference!(state, config)# Evaluate accuracy, if ground truth is availableground_truth= CSV.File(filepath)|> CSV.DataFrameresults=evaluate_accuracy(data, ground_truth, state, query)# Can print results.f1, results.precision, results.accuracy, etc.println(results)# Even without ground truth, can save the entire latent database to CSV files:PClean.save_results(dir, dataset_name, state, observations)

Then, from this directory, run the Julia file.

JULIA_PROJECT=. julia my_file.jl

To learn to write a PClean model, seeour paper, but notethe surface syntax changes described below.

Differences from the paper

As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax,from the stand-alone syntax presented in our paper:

(1) Instead oflatent class C ... end, we write@class C begin ... end.

(2) Instead ofsubproblem begin ... end, inference hints are given using ordinaryJuliabegin ... end blocks.

(3) Instead ofparameter x ~ d(...), we use@learned x :: D{...}. The set ofdistributions D for parameters is somewhat restricted.

(4) Instead ofx ~ d(...) preferring E, we writex ~ d(..., E).

(5) Instead ofobserve x as y, ... from C, write@query ModelName.C [x y; ...].Clauses of the formx z y are also allowed, and tell PClean that the model variableC.z represents a clean version ofx, whose observed (dirty) version is modeledasC.y. This is used when automatically reconstructing a clean, flat dataset.

The names of built-in distributions may also be different, e.g.AddTypos instead oftypos,andProportionsParameter instead ofdirichlet.

About

A domain-specific probabilistic programming language for scalable Bayesian data cleaning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors2

  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp