sdv-dev/SDVPublic

NotificationsYou must be signed in to change notification settings
Fork368
Star3.1k

Synthetic data generation for tabular data

License

View license

3.1k stars 368 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,988 Commits
.github		.github
docs		docs
scripts		scripts
sdv		sdv
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
EVALUATION.md		EVALUATION.md
HISTORY.md		HISTORY.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
apt.txt		apt.txt
codecov.yml		codecov.yml
latest_requirements.txt		latest_requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
static_code_analysis.txt		static_code_analysis.txt
tasks.py		tasks.py

Repository files navigation

This repository is part ofThe Synthetic Data Vault Project, a project fromDataCebo.

Overview

TheSynthetic Data Vault (SDV) is a Python library designed to be your one-stop shop forcreating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learnpatterns from your real data and emulate them in synthetic data.

Features

🧠Create synthetic data using machine learning. The SDV offers multiple models, rangingfrom classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generatedata for single tables, multiple connected tables or sequential tables.

📊Evaluate and visualize data. Compare the synthetic data to the real data against avariety of measures. Diagnose problems and generate a quality report to get more insights.

🔄Preprocess, anonymize and define constraints. Control dataprocessing to improve the quality of synthetic data, choose from different types of anonymizationand define business rules in the form of logical constraints.

Important Links
Tutorials	Get some hands-on experience with the SDV. Launch the tutorial notebooks and run the code yourself.
📖Docs	Learn how to use the SDV library with user guides and API references.
📙Blog	Get more insights about using the SDV, deploying models and our synthetic data community.
Community	Join our Slack workspace for announcements and discussions.
💻Website	Check out the SDV website for more information about the project.

Install

The SDV is publicly available under theBusiness Source License.Install SDV using pip or conda. We recommend using a virtual environment to avoid conflicts withother software on your device.

pip install sdv

conda install -c pytorch -c conda-forge sdv

Getting Started

Load a demo dataset to get started. This dataset is a single table describing guests staying at afictional hotel.

fromsdv.datasets.demoimportdownload_demoreal_data,metadata=download_demo(modality='single_table',dataset_name='fake_hotel_guests')

The demo also includesmetadata, a description of the dataset, including the data types in eachcolumn and the primary key (guest_email).

Synthesizing Data

Next, we can create anSDV synthesizer, an object that you can use to create synthetic data.It learns patterns from the real data and replicates them to generate synthetic data. Let's usetheGaussianCopulaSynthesizer.

fromsdv.single_tableimportGaussianCopulaSynthesizersynthesizer=GaussianCopulaSynthesizer(metadata)synthesizer.fit(data=real_data)

And now the synthesizer is ready to create synthetic data!

synthetic_data=synthesizer.sample(num_rows=500)

The synthetic data will have the following properties:

Sensitive columns are fully anonymized. The email, billing address and credit card numbercolumns contain new data so you don't expose the real values.
Other columns follow statistical patterns. For example, the proportion of room types, thedistribution of check in dates and the correlations between room rate and room type are preserved.
Keys and other relationships are intact. The primary key (guest email) is unique for each row.If you have multiple tables, the connection between a primary and foreign keys makes sense.

Evaluating Synthetic Data

The SDV library allows you to evaluate the synthetic data by comparing it to the real data. Getstarted by generating a quality report.

fromsdv.evaluation.single_tableimportevaluate_qualityquality_report=evaluate_quality(real_data,synthetic_data,metadata)

Generating report ...(1/2) Evaluating Column Shapes: |████████████████| 9/9 [00:00<00:00, 1133.09it/s]|Column Shapes Score: 89.11%(2/2) Evaluating Column Pair Trends: |██████████████████████████████████████████| 36/36 [00:00<00:00, 502.88it/s]|Column Pair Trends Score: 88.3%Overall Score (Average): 88.7%

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as wellas detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data.

fromsdv.evaluation.single_tableimportget_column_plotfig=get_column_plot(real_data=real_data,synthetic_data=synthetic_data,column_name='amenities_fee',metadata=metadata)fig.show()

What's Next?

Using the SDV library, you can synthesize single table, multi table and sequential data. You canalso customize the full synthetic data workflow, including preprocessing, anonymization and addingconstraints.

To learn more, visit theSDV Demo page.

Credits

Thank you to our team of contributors who have built and maintained the SDV ecosystem over theyears!

View Contributors

Citation

If you use SDV for your research, please cite the following paper:

Neha Patki, Roy Wedge, Kalyan Veeramachaneni.The Synthetic Data Vault.IEEE DSAA 2016.

@inproceedings{    SDV,    title={The Synthetic data vault},    author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},    booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},    year={2016},    pages={399-410},    doi={10.1109/DSAA.2016.49},    month={Oct}}

The Synthetic Data Vault Project was first created at MIT'sData to AI Lab in 2016. After 4 years of research and traction with enterprise, wecreatedDataCebo in 2020 with the goal of growing the project.Today, DataCebo is the proud developer of SDV, the largest ecosystem forsynthetic data generation & evaluation. It is home to multiple libraries that support syntheticdata, including:

🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,multi table and time series data.
📊 Measuring quality and privacy of synthetic data, and comparing different synthetic datageneration models.

Get started using the SDV package -- a fullyintegrated solution and your one-stop shop for synthetic data. Or, use the standalone librariesfor specific needs.