awslabs/python-deequPublic

NotificationsYou must be signed in to change notification settings
Fork143
Star784

Python API for Deequ

License

Apache-2.0 license

784 stars 143 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github		.github
docs		docs
imgs		imgs
pydeequ		pydeequ
tests		tests
tutorials		tutorials
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Repository files navigation

PyDeequ

PyDeequ is a Python API forDeequ, a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.

There are 4 main components of Deequ, and they are:

Metrics Computation:
- Profiles leverages Analyzers to analyze each column of a dataset.
- Analyzers serve here as a foundational module that computes metrics for data profiling and validation at scale.
Constraint Suggestion:
- Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.
Constraint Verification:
- Perform data validation on a dataset with respect to various constraints set by you.
Metrics Repository
- Allows for persistence and tracking of Deequ runs over time.

🎉 Announcements 🎉

NEW!!! The 1.4.0 release of Python Deequ has been published to PYPIhttps://pypi.org/project/pydeequ/. This release adds support for Spark 3.5.0.
The latest version of Deequ, 2.0.7, is made available With Python Deequ 1.3.0.
1.1.0 release of Python Deequ has been published to PYPIhttps://pypi.org/project/pydeequ/. This release brings many recent upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.
With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variableSPARK_VERSION to specify your Spark version!
We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out:Monitor data quality in your data lake using PyDeequ and AWS Glue.
Check out thePyDeequ Release Announcement Blogpost with a tutorial walkthrough the Amazon Reviews dataset!
Join the PyDeequ community onPyDeequ Slack to chat with the devs!

Quickstart

The following will quickstart you with some basic usage. For more in-depth examples, take a look in thetutorials/ directory for executable Jupyter notebooks of each module. For documentation on supported interfaces, view thedocumentation.

Installation

You can installPyDeequ via pip.

pip install pydeequ

Set up a PySpark session

frompyspark.sqlimportSparkSession,Rowimportpydeequspark= (SparkSession    .builder    .config("spark.jars.packages",pydeequ.deequ_maven_coord)    .config("spark.jars.excludes",pydeequ.f2j_maven_coord)    .getOrCreate())df=spark.sparkContext.parallelize([Row(a="foo",b=1,c=5),Row(a="bar",b=2,c=6),Row(a="baz",b=3,c=None)]).toDF()

Analyzers

frompydeequ.analyzersimport*analysisResult=AnalysisRunner(spark) \                    .onData(df) \                    .addAnalyzer(Size()) \                    .addAnalyzer(Completeness("b")) \                    .run()analysisResult_df=AnalyzerContext.successMetricsAsDataFrame(spark,analysisResult)analysisResult_df.show()

Profile

frompydeequ.profilesimport*result=ColumnProfilerRunner(spark) \    .onData(df) \    .run()forcol,profileinresult.profiles.items():print(profile)

Constraint Suggestions

frompydeequ.suggestionsimport*suggestionResult=ConstraintSuggestionRunner(spark) \             .onData(df) \             .addConstraintRule(DEFAULT()) \             .run()# Constraint Suggestions in JSON formatprint(suggestionResult)

Constraint Verification

frompydeequ.checksimport*frompydeequ.verificationimport*check=Check(spark,CheckLevel.Warning,"Review Check")checkResult=VerificationSuite(spark) \    .onData(df) \    .addCheck(check.hasSize(lambdax:x>=3) \        .hasMin("b",lambdax:x==0) \        .isComplete("c")  \        .isUnique("a")  \        .isContainedIn("a", ["foo","bar","baz"]) \        .isNonNegative("b")) \    .run()checkResult_df=VerificationResult.checkResultsAsDataFrame(spark,checkResult)checkResult_df.show()

Repository

Save to a Metrics Repository by adding theuseRepository() andsaveOrAppendResult() calls to your Analysis Runner.

frompydeequ.repositoryimport*frompydeequ.analyzersimport*metrics_file=FileSystemMetricsRepository.helper_metrics_file(spark,'metrics.json')repository=FileSystemMetricsRepository(spark,metrics_file)key_tags= {'tag':'pydeequ hello world'}resultKey=ResultKey(spark,ResultKey.current_milli_time(),key_tags)analysisResult=AnalysisRunner(spark) \    .onData(df) \    .addAnalyzer(ApproxCountDistinct('b')) \    .useRepository(repository) \    .saveOrAppendResult(resultKey) \    .run()

To load previous runs, use therepository object to load previous results back in.

result_metrep_df=repository.load() \    .before(ResultKey.current_milli_time()) \    .forAnalyzers([ApproxCountDistinct('b')]) \    .getSuccessMetricsAsDataFrame()

Wrapping up

After you've ran your jobs with PyDeequ, be sure to shut down your Spark session to prevent any hanging processes.

spark.sparkContext._gateway.shutdown_callback_server()spark.stop()

Contributing

Please refer to thecontributing doc for how to contribute to PyDeequ.

License

This library is licensed under the Apache 2.0 License.

Contributing Developer Setup

SetupSDKMAN
SetupJava
SetupApache Spark
InstallPoetry
Runtests locally

Setup SDKMAN

SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix basedsystem. It provides a convenient command line interface for installing, switching, removing and listingCandidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. Seedocumentation on theSDKMAN! website.

Open your favourite terminal and enter the following:

$ curl -s https://get.sdkman.io| bashIf the environment needs tweakingfor SDKMAN to be installed,the installer will prompt you accordingly and ask you to restart.Next, open a new terminal or enter:$source"$HOME/.sdkman/bin/sdkman-init.sh"Lastly, run the following code snippet to ensure that installation succeeded:$ sdk version

Setup Java

Install Java Now open favourite terminal and enter the following:

List the AdoptOpenJDK OpenJDK versions$ sdk list javaTo install For Java 11$ sdk install java 11.0.10.hs-adptTo install For Java 11$ sdk install java 8.0.292.hs-adpt

Setup Apache Spark

Install Java Now open favourite terminal and enter the following:

List the Apache Spark versions:$ sdk list sparkTo install For Spark 3$ sdk install spark 3.0.2

Poetry

PoetryCommands

poetry installpoetry update# --tree: List the dependencies as a tree.# --latest (-l): Show the latest version.# --outdated (-o): Show the latest version but only for packages that are outdated.poetry show -o

Running Tests Locally

Take a look at tests intests/dataquality andtests/jobs

$ poetry run pytest

Running Tests Locally (Docker)

If you have issues installing the dependencies listed above, another way to run the tests and verify your changes is through Docker. There is a Dockerfile that will install the required dependencies and run the tests in a container.

docker build . -t spark-3.3-docker-testdocker run spark-3.3-docker-test

About

Python API for Deequ

Resources

Readme

License

Apache-2.0 license

Code of conduct

Releases8

v1.4.0 Latest

Jul 3, 2024

+ 7 releases

Packages

No packages published

Contributors18

+ 4 contributors

Movatterモバイル変換

License

awslabs/python-deequ

Folders and files

Latest commit

History

Repository files navigation

PyDeequ

🎉 Announcements 🎉

Quickstart

Installation

Set up a PySpark session

Analyzers

Profile

Constraint Suggestions

Constraint Verification

Repository

Wrapping up

Contributing

License

Contributing Developer Setup

Setup SDKMAN

Setup Java

Setup Apache Spark

Poetry

Running Tests Locally

Running Tests Locally (Docker)

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases8

Packages0

Uh oh!

Contributors18

Languages

Packages