- Notifications
You must be signed in to change notification settings - Fork0
Pandera Report for row-based reporting by using the power of pandera.
License
Luanee/pandera-report
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
pandera provides a flexible and expressive API for performing datavalidation on dataframe-like objects to make data processing pipelines morereadable and robust
If you have to report potential quality issues resulting from the dataframe validation viapandera
, thanpandera-report
is your friend. Based on the information of possible validation issues that pandera provides, your original dataframe will be extended with these issues on a row-level base.
Withpandera-report
, you can:
- Seamlessly integrates with the
pandera
library to provide enhanced data validation capabilities without interfering with the pandera functionality. - Provides a convenient way to enrich your data with information about why specific rows failed validation.
Using pip:
pip install pandera-report
Using poetry:
poetry add pandera-report
The following example is taken from thepandera
documentation and shows the definition of a DataFrameSchema which will end in a valid result for the provided dataframe.
importpandasaspdimportpanderaaspa# data to validatedf=pd.DataFrame({"column1": [1,4,0,10,9],"column2": [-1.3,-1.4,-2.9,-10.1,-20.4],"column3": ["value_1","value_2","value_3","value_2","value_1"]})# define schemaschema=pa.DataFrameSchema({"column1":pa.Column(int,checks=pa.Check.le(10)),"column2":pa.Column(float,checks=pa.Check.lt(-1.2)),"column3":pa.Column(str,checks=[pa.Check.str_startswith("value_"),# define custom checks as functions that take a series as input and# outputs a boolean or boolean Seriespa.Check(lambdas:s.str.split("_",expand=True).shape[1]==2) ]),})validated_df=schema(df)print(validated_df)# column1 column2 column3# 0 1 -1.3 value_1# 1 4 -1.4 value_2# 2 0 -2.9 value_3# 3 10 -10.1 value_2# 4 9 -20.4 value_1
To make usage of thepandera-report
functionality for the same schema and dataframe, you can do this:
validator=DataFrameValidator()# default is quality_report=True, lazy=Trueprint(validator.validate(schema,df))# column1 column2 column3 quality_issues quality_status# 0 1 -1.3 value_1 None Valid# 1 4 -1.4 value_2 None Valid# 2 0 -2.9 value_3 None Valid# 3 10 -10.1 value_2 None Valid# 4 9 -20.4 value_1 None Valid
You see?! Same result but extended by the fact that the validation of the dataframe was completely valid. This can also be deactivated for the case that everything is 100% valid.
But what if the dataframe contains data quality issues?pandera
will throw SchemaErrors or SchemaError (depends on the lazyness). Let's see whatpandera-report
does, if we change the dataframe against the schema definition:
# data to validatedf=pd.DataFrame({"column1": [1,4,0,10,9],"column2": [-1.3,-1.4,-2.9,-10.1,-20.4],"column3": ["value_1","value_2","value_3","value_2","value1"]})validator=DataFrameValidator()print(validator.validate(schema,df))# column1 column2 column3 quality_issues quality_status# 0 1 -1.3 value_1 None Valid# 1 4 -1.4 value_2 None Valid# 2 0 -2.9 value_3 None Valid# 3 10 -10.1 value_2 None Valid# 4 9 -20.4 value1 Column <column3>: str_startswith('value_') Invalid
Why is this useful? Quite simply, it becomes particularly interesting when you are not the one who has to prepare a valid file so that it can be processed into a valid DataFrame in the end.
About
Pandera Report for row-based reporting by using the power of pandera.