Luanee/pandera-reportPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star5

Pandera Report for row-based reporting by using the power of pandera.

License

MIT license

5 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
pandera_report		pandera_report
requirements		requirements
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.md		LICENSE.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Repository files navigation

Pandera Extension for row-based reporting

🚀 Description

pandera provides a flexible and expressive API for performing datavalidation on dataframe-like objects to make data processing pipelines morereadable and robust

If you have to report potential quality issues resulting from the dataframe validation viapandera, thanpandera-report is your friend. Based on the information of possible validation issues that pandera provides, your original dataframe will be extended with these issues on a row-level base.

Withpandera-report, you can:

Seamlessly integrates with thepandera library to provide enhanced data validation capabilities without interfering with the pandera functionality.
Provides a convenient way to enrich your data with information about why specific rows failed validation.

⚡ Setup

Using pip:

pip install pandera-report

Using poetry:

poetry add pandera-report

Quick start

The following example is taken from thepandera documentation and shows the definition of a DataFrameSchema which will end in a valid result for the provided dataframe.

importpandasaspdimportpanderaaspa# data to validatedf=pd.DataFrame({"column1": [1,4,0,10,9],"column2": [-1.3,-1.4,-2.9,-10.1,-20.4],"column3": ["value_1","value_2","value_3","value_2","value_1"]})# define schemaschema=pa.DataFrameSchema({"column1":pa.Column(int,checks=pa.Check.le(10)),"column2":pa.Column(float,checks=pa.Check.lt(-1.2)),"column3":pa.Column(str,checks=[pa.Check.str_startswith("value_"),# define custom checks as functions that take a series as input and# outputs a boolean or boolean Seriespa.Check(lambdas:s.str.split("_",expand=True).shape[1]==2)    ]),})validated_df=schema(df)print(validated_df)#     column1  column2  column3#  0        1     -1.3  value_1#  1        4     -1.4  value_2#  2        0     -2.9  value_3#  3       10    -10.1  value_2#  4        9    -20.4  value_1

To make usage of thepandera-report functionality for the same schema and dataframe, you can do this:

validator=DataFrameValidator()# default is quality_report=True, lazy=Trueprint(validator.validate(schema,df))#     column1  column2  column3 quality_issues quality_status#  0        1     -1.3  value_1           None          Valid#  1        4     -1.4  value_2           None          Valid#  2        0     -2.9  value_3           None          Valid#  3       10    -10.1  value_2           None          Valid#  4        9    -20.4  value_1           None          Valid

You see?! Same result but extended by the fact that the validation of the dataframe was completely valid. This can also be deactivated for the case that everything is 100% valid.

But what if the dataframe contains data quality issues?pandera will throw SchemaErrors or SchemaError (depends on the lazyness). Let's see whatpandera-report does, if we change the dataframe against the schema definition:

# data to validatedf=pd.DataFrame({"column1": [1,4,0,10,9],"column2": [-1.3,-1.4,-2.9,-10.1,-20.4],"column3": ["value_1","value_2","value_3","value_2","value1"]})validator=DataFrameValidator()print(validator.validate(schema,df))#     column1  column2  column3                              quality_issues quality_status#  0        1     -1.3  value_1                                        None          Valid#  1        4     -1.4  value_2                                        None          Valid#  2        0     -2.9  value_3                                        None          Valid#  3       10    -10.1  value_2                                        None          Valid#  4        9    -20.4   value1  Column <column3>: str_startswith('value_')        Invalid

Why is this useful? Quite simply, it becomes particularly interesting when you are not the one who has to prepare a valid file so that it can be processed into a valid DataFrame in the end.

About

Pandera Report for row-based reporting by using the power of pandera.

Releases3

Release v0.1.2 Latest

Sep 27, 2024

+ 2 releases

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Pandera Extension for row-based reporting

🚀 Description

⚡ Setup

Quick start

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages

Uh oh!

Languages

Movatterモバイル変換

License

Luanee/pandera-report

Folders and files

Latest commit

History

Repository files navigation

Pandera Extension for row-based reporting

🚀 Description

⚡ Setup

Quick start

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Languages

Packages