hitsz-ids/synthetic-data-generatorPublic

NotificationsYou must be signed in to change notification settings
Fork379
Star2.3k

SDG is a specialized framework designed to generate high-quality structured tabular data.

License

Apache-2.0 license

2.3k stars 379 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
dev-tools		dev-tools
docker		docker
docs		docs
example		example
sdgx		sdgx
tests		tests
.all-contributorsrc		.all-contributorsrc
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING_ZH_CN.md		CONTRIBUTING_ZH_CN.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
README_ZH_CN.md		README_ZH_CN.md
ROADMAP.md		ROADMAP.md
ROADMAP_ZH_CN.md		ROADMAP_ZH_CN.md
pyproject.toml		pyproject.toml
sweep.yaml		sweep.yaml

Repository files navigation

🚀 Synthetic Data Generator

Switch Language:简体中文 | LatestAPI Docs | Roadmap | JoinWechat Group

Colab Examples: LLM: Data Synthesis | LLM: Off-Table Inference | Billion-Level-Data supported CTGAN

The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data.

Synthetic data does not contain any sensitive information, yet it retains the essential characteristics of the original data, making it exempt from privacy regulations such as GDPR and ADPPA.

High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc.

We are excited to have you here and look forward to your contributions, get started with the project through thisContributing Overview Guide!

💥News

Our current key achievements and timelines are as follows:

🔥 Nov 21, 2024: 1) Model Integration - We've integrated theGaussianCopula model into our Data Processor System. Check out the code example in thisPR; 2) Synthetic Quality - We implemented automatic detection of data column relationships and allowed for relationship specification, improved the quality of synthetic data(Code Example); 3) Performance Enhancement - We significantly reduced the memory usage of GaussianCopula when handling discrete data, enabling training on thousands of categorical data entries with a2C4G setup!

🔥 May 30, 2024: The Data Processor module was officially merged. This module will: 1) help SDG convert the format of some data columns (such as Datetime columns) before feeded into the model (so as to avoid being treated as discrete types), and reversely convert the model-generated data into the original format; 2) perform more customized pre-processing and post-processing on various data types; 3) easily deal with problems such as null values in the original data; 4) support the plug-in system.

🔥 Feb 20, 2024: a single-table data synthesis model based on LLM is included, view colab example: LLM: Data Synthesis and LLM: Off-table Feature Inference.

🔧 Feb 7, 2024: We improvedsdgx.data_models.metadata to support metadata information describing for single tables and multiple tables, support multiple data types, support automatic data type inference. view colab example:SDG Single-Table Metadata。

🔶 Dec 20, 2023: v0.1.0 released, a CTGAN model that supports billions of data processing capabilities is included, view our benchmark against SDV, where SDG achieved less memory consumption and avoided crashing during training. For specific use, view colab example: Billion-Level-Data supported CTGAN.

🔆 Aug 10, 2023: First line of SDG code committed.

🎉 LLM-integrated synthetic data generation

For a long time, LLM has been used to understand and generate various types of data. In fact, LLM also has certain capabilities in tabular data generation. Also, it has some abilities that cannot be achieved by traditional (based on GAN methods or statistical methods) .

Oursdgx.models.LLM.single_table.gpt.SingleTableGPTModel implements two new features:

Synthetic data generation without Data

No training data is required, synthetic data can be generated based on metadata data, view in our colab example.

Off-Table feature inference

Infer new column data based on the existing data in the table and the knowledge mastered by LLM, view in our colab example.

💫 Why SDG ?

Technological advancements:
- Supports a wide range of statistical data synthesis algorithms, LLM-based synthetic data generation model is also integrated;
- Optimized for big data, effectively reducing memory consumption;
- Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner.
Privacy enhancements:
- SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data.
Easy to extend:
- Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages.

🌀 Quick Start

Pre-build image

You can use pre-built images to quickly experience the latest features.

docker pull idsteam/sdgx:latest

Install from PyPi

pip install sdgx

Local Install (Recommended)

Use SDG by installing it through the source code.

git clone git@github.com:hitsz-ids/synthetic-data-generator.gitpip install.# Or install from gitpip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Quick Demo of Single Table Data Generation and Metric

Demo code

fromsdgx.data_connectors.csv_connectorimportCsvConnectorfromsdgx.models.ml.single_table.ctganimportCTGANSynthesizerModelfromsdgx.synthesizerimportSynthesizerfromsdgx.utilsimportdownload_demo_data# This will download demo data to ./datasetdataset_csv=download_demo_data()# Create data connector for csv filedata_connector=CsvConnector(path=dataset_csv)# Initialize synthesizer, use CTGAN modelsynthesizer=Synthesizer(model=CTGANSynthesizerModel(epochs=1),# For quick demodata_connector=data_connector,)# Fit the modelsynthesizer.fit()# Samplesampled_data=synthesizer.sample(1000)print(sampled_data)

Comparison

Real data are as follows：

>>>data_connector.read()ageworkclassfnlwgteducation  ...capitallosshoursperweeknative-countryclass02State-gov77516Bachelors  ...02United-States<=50K13Self-emp-not-inc83311Bachelors  ...00United-States<=50K22Private215646HS-grad  ...02United-States<=50K33Private23472111th  ...02United-States<=50K41Private338409Bachelors  ...02Cuba<=50K...    ...               ...     ...        ...  ...          ...          ...            ...    ...488372Private215419Bachelors  ...02United-States<=50K488384NaN321403HS-grad  ...02United-States<=50K488392Private374983Bachelors  ...03United-States<=50K488402Private83891Bachelors  ...02United-States<=50K488411Self-emp-inc182148Bachelors  ...03United-States>50K[48842rowsx15columns]

Synthetic data are as follows：

>>>sampled_dataageworkclassfnlwgteducation  ...capitallosshoursperweeknative-countryclass01NaN28219Some-college  ...02Puerto-Rico<=50K12Private250166HS-grad  ...02United-States>50K22Private50304HS-grad  ...02United-States<=50K34Private89318Bachelors  ...02Puerto-Rico>50K41Private172149Bachelors  ...03United-States<=50K..   ...       ...     ...           ...  ...          ...          ...            ...    ...9952NaN208938Bachelors  ...01United-States<=50K9962Private166416Bachelors  ...22United-States<=50K9972NaN336022HS-grad  ...01United-States<=50K9983Private198051Masters  ...02United-States>50K9991NaN41973HS-grad  ...02United-States<=50K[1000rowsx15columns]

👩‍🎓 Related Work

🤝 Join Community

The SDG project was initiated byInstitute of Data Security, Harbin Institute of Technology. If you are interested in out project, welcome to join our community. We welcome organizations, teams, and individuals who share our commitment to data protection and security through open source:

ReadCONTRIBUTING before draft a pull request.
Submit an issue by viewingView Good First Issue or submit a Pull Request.
Join our Wechat Group through QR code.

📄 License

The SDG open source project uses Apache-2.0 license, please refer to theLICENSE.

About

SDG is a specialized framework designed to generate high-quality structured tabular data.

Releases15

0.2.4 Latest

Dec 3, 2024

+ 14 releases

Contributors17

+ 3 contributors

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

🚀 Synthetic Data Generator

💥News

🎉 LLM-integrated synthetic data generation

Synthetic data generation without Data

Off-Table feature inference

💫 Why SDG ?

🌀 Quick Start

Pre-build image

Install from PyPi

Local Install (Recommended)

Quick Demo of Single Table Data Generation and Metric

Demo code

Comparison

👩‍🎓 Related Work

🤝 Join Community

📄 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases15

Contributors17

Languages