- Notifications
You must be signed in to change notification settings - Fork44
rapidsai/gpu-bdb
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
gpu-bdb is derived fromTPCx-BB. Any results based on gpu-bdb are considered unofficial results, and perTPC policy, cannot be compared against official TPCx-BB results.
The GPU Big Data Benchmark (gpu-bdb) is aRAPIDS library based benchmark for enterprises that includes 30 queries representing real-world ETL & ML workflows at various "scale factors": SF1000 is 1 TB of data, SF10000 is 10TB. Each “query” is in fact a model workflow that can include SQL, user-defined functions, careful sub-setting and aggregation, and machine learning.
We provide a conda environment definition specifying all RAPIDS dependencies needed to run our query implementations. To install and activate it:
CONDA_ENV="rapids-gpu-bdb"conda env create --name$CONDA_ENV -f gpu-bdb/conda/rapids-gpu-bdb.ymlconda activate rapids-gpu-bdb
This repository includes a small local module containing utility functions for running the queries. You can install it with the following:
cd gpu-bdb/gpu_bdbpython -m pip install.
This will install a package namedbdb-tools
into your Conda environment. It should look like this:
conda list| grep bdbbdb-tools 0.2 pypi_0 pypi
Note that this Conda environment needs to be replicated or installed manually on all nodes, which will allow starting one dask-cuda-worker per node.
Queries 10, 18, and 19 depend on two static (negativeSentiment.txt, positiveSentiment.txt) files. As we cannot redistribute those files, you shoulddownload the tpcx-bb toolkit and extract them to your data directory on your shared filesystem:
jar xf bigbenchqueriesmr.jarcp tpcx-bb1.3.1/distributions/Resources/io/bigdatabenchmark/v1/queries/q10/*.txt ${DATA_DIR}/sentiment_files/
For Query 27, we rely onspacy. To download the necessary language model after activating the Conda environment:
python -m spacy download en_core_web_sm
We use thedask-scheduler
anddask-cuda-worker
command line interfaces to start a Dask cluster. We provide acluster_configuration
directory with a bash script to help you set up an NVLink-enabled cluster using UCX.
Before running the script, you'll make changes specific to your environment.
Incluster_configuration/cluster-startup.sh
:
- Update `GPU_BDB_HOME=...` to location on disk of this repo- Update `CONDA_ENV_PATH=...` to refer to your conda environment path.- Update `CONDA_ENV_NAME=...` to refer to the name of the conda environment you created, perhaps using the `yml` files provided in this repository.- Update `INTERFACE=...` to refer to the relevant network interface present on your cluster.- Update `CLUSTER_MODE="TCP"` to refer to your communication method, either "TCP" or "NVLINK". You can also configure this as an environment variable.- You may also need to change the `LOCAL_DIRECTORY` and `WORKER_DIR` depending on your filesystem. Make sure that these point to a location to which you have write access and that `LOCAL_DIRECTORY` is accessible from all nodes.
To start up the cluster on your scheduler node, please run the following fromgpu_bdb/cluster_configuration/
. This will spin up a scheduler and one Dask worker per GPU.
DASK_JIT_UNSPILL=True CLUSTER_MODE=NVLINK bash cluster-startup.sh SCHEDULER
Note: Don't use DASK_JIT_UNSPILL when running BlazingSQL queries.
Then run the following on every other node fromgpu_bdb/cluster_configuration/
.
bash cluster-startup.sh
This will spin up one Dask worker per GPU. If you are running on a single node, you will only need to runbash cluster-startup.sh SCHEDULER
.
If you are using a Slurm cluster, please adapt the example Slurm setup ingpu_bdb/benchmark_runner/slurm/
which usesgpu_bdb/cluster_configuration/cluster-startup-slurm.sh
.
To run a query, starting from the repository root, go to the query specific subdirectory. For example, to run q07:
cd gpu_bdb/queries/q07/
The queries assume that they can attach to a running Dask cluster. Cluster address and other benchmark configuration lives in a yaml file (gpu_bdb/benchmark_runner/becnhmark_config.yaml
). You will need to fill this out as appropriate if you are not using the Slurm cluster configuration.
conda activate rapids-gpu-bdbpython gpu_bdb_query_07.py --config_file=../../benchmark_runner/benchmark_config.yaml
To NSYS profile a gpu-bdb query, changestart_local_cluster
in benchmark_config.yaml toTrue
and run:
nsys profile -t cuda,nvtx python gpu_bdb_query_07_dask_sql.py --config_file=../../benchmark_runner/benchmark_config.yaml
Note: There is no need to start workers withcluster-startup.sh
asthere is aLocalCUDACluster
being started inattach_to_cluster
API.
This repository includes optional performance-tracking automation using Google Sheets. To enable logging query runtimes, on the client node:
export GOOGLE_SHEETS_CREDENTIALS_PATH=<path to creds.json>
Then configure the--sheet
and--tab
arguments inbenchmark_config.yaml
.
The includedbenchmark_runner.py
script will run all queries sequentially. Configuration for this type of end-to-end run is specified inbenchmark_runner/benchmark_config.yaml
.
To run all queries, cd togpu_bdb/
and:
pythonbenchmark_runner.py--config_filebenchmark_runner/benchmark_config.yaml
By default, this will run each Dask query five times, and, if BlazingSQL queries are enabled inbenchmark_config.yaml
, each BlazingSQL query five times. You can control the number of repeats by changing theN_REPEATS
variable in the script.
BlazingSQL implementations of all queries are included. BlazingSQL currently supports communication via TCP. To run BlazingSQL queries, please follow the instructions above to create a cluster usingCLUSTER_MODE=TCP
.
The RAPIDS queries expectApache Parquet formatted data. We provide ascript which can be used to convert bigBench dataGen's raw CSV files to optimally sized Parquet partitions.