- Notifications
You must be signed in to change notification settings - Fork181
Library for exploring and validating machine learning data
License
tensorflow/data-validation
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
TensorFlow Data Validation (TFDV) is a library for exploring and validatingmachine learning data. It is designed to be highly scalableand to work well with TensorFlow andTensorFlow Extended (TFX).
TF Data Validation includes:
- Scalable calculation of summary statistics of training and test data.
- Integration with a viewer for data distributions and statistics, as wellas faceted comparison of pairs of features (Facets)
- Automateddata-schemageneration to describe expectations about datalike required values, ranges, and vocabularies
- A schema viewer to help you inspect the schema.
- Anomaly detection to identifyanomalies,such as missing features,out-of-range values, or wrong feature types, to name a few.
- An anomalies viewer so that you can see what features have anomalies andlearn more in order to correct them.
For instructions on using TFDV, see theget started guideand try out theexample notebook.Some of the techniques implemented in TFDV are described in atechnical paper published in SysML'19.
Caution: TFDV may be backwards incompatible before version 1.0.
The recommended way to install TFDV is using thePyPI package:
pip install tensorflow-data-validation
This is the recommended way to build TFDV under Linux, and is continuouslytested at Google.
Please first installdocker
anddocker-compose
by following the directions:docker;docker-compose.
git clone https://github.com/tensorflow/data-validationcd data-validation
Note that these instructions will install the latest master branch of TensorFlowData Validation. If you want to install a specific branch (such as a releasebranch), pass-b <branchname>
to thegit clone
command.
When building on Python 2, make sure to strip the Python types in the sourcecode using the following commands:
pip install strip-hintspython tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/
Then, run the following at the project root:
sudo docker-compose build manylinux2010sudo docker-compose run -e PYTHON_VERSION=${PYTHON_VERSION} manylinux2010
wherePYTHON_VERSION
is one of{27, 35, 36, 37}
.
A wheel will be produced underdist/
.
pip install dist/*.whl
To compile and use TFDV, you need to set up some prerequisites.
If NumPy is not installed on your system, install it now by followingthesedirections.
If Bazel is not installed on your system, install it now by followingthesedirections.
git clone https://github.com/tensorflow/data-validationcd data-validation
Note that these instructions will install the latest master branch of TensorFlowData Validation. If you want to install a specific branch (such as a release branch),pass-b <branchname>
to thegit clone
command.
When building on Python 2, make sure to strip the Python types in the sourcecode using the following commands:
pip install strip-hintspython tensorflow_data_validation/tools/strip_type_hints.py tensorflow_data_validation/
TFDV uses Bazel to build the pip package from source. Before invoking thefollowing commands, make sure thepython
in your$PATH
is the one of thetarget version and has NumPy installed.
bazel run -c opt --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0 tensorflow_data_validation:build_pip_package
Note that we are assuming here that dependent packages (e.g. PyArrow) are builtwith a GCC older than 5.1 and use the flagD_GLIBCXX_USE_CXX11_ABI=0
to becompatible with the old std::string ABI.
You can find the generated.whl
file in thedist
subdirectory.
pip install dist/*.whl
TFDV is tested on the following 64-bit operating systems:
- macOS 10.12.6 (Sierra) or later.
- Ubuntu 16.04 or later.
- Windows 7 or later.
TFDV requires TensorFlow but does not depend on thetensorflow
PyPI package. See theTensorFlow install guidesfor instructions on how to get started with TensorFlow.
Apache Beam is required; it's the way that efficientdistributed computation is supported. By default, Apache Beam runs in localmode but can also run in distributed mode usingGoogle Cloud Dataflow.TFDV is designed to be extensible for other Apache Beam runners.
Apache Arrow is also required. TFDV uses Arrow torepresent data internally in order to make use of vectorized numpy functions.
The following table shows the package versions that arecompatible with each other. This is determined by our testing framework, butotheruntested combinations may also work.
tensorflow-data-validation | tensorflow | apache-beam[gcp] | pyarrow |
---|---|---|---|
GitHub master | nightly (1.x/2.x) | 2.17.0 | 0.15.0 |
0.21.1 | 1.15 / 2.1 | 2.17.0 | 0.15.0 |
0.21.0 | 1.15 / 2.1 | 2.17.0 | 0.15.0 |
0.15.0 | 1.15 / 2.0 | 2.16.0 | 0.14.0 |
0.14.1 | 1.14 | 2.14.0 | 0.14.0 |
0.14.0 | 1.14 | 2.14.0 | 0.14.0 |
0.13.1 | 1.13 | 2.11.0 | n/a |
0.13.0 | 1.13 | 2.11.0 | n/a |
0.12.0 | 1.12 | 2.10.0 | n/a |
0.11.0 | 1.11 | 2.8.0 | n/a |
0.9.0 | 1.9 | 2.6.0 | n/a |
Please direct any questions about working with TF Data Validation toStack Overflow using thetensorflow-data-validationtag.
About
Library for exploring and validating machine learning data
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.