- Notifications
You must be signed in to change notification settings - Fork28
oneCCL Bindings for Pytorch*
License
intel/torch-ccl
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This repository holds PyTorch bindings maintained by Intel® for the Intel® oneAPI Collective Communications Library (oneCCL).
PyTorch is an open-source machine learning framework.
Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training, implementing collectives likeallreduce
,allgather
,alltoall
. For more information on oneCCL, please refer to theoneCCL documentation.
oneccl_bindings_for_pytorch
module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now.
The table below shows which functions are available for use with CPU / Intel dGPU tensors.
CPU | GPU | |
---|---|---|
send | × | √ |
recv | × | √ |
broadcast | √ | √ |
all_reduce | √ | √ |
reduce | √ | √ |
all_gather | √ | √ |
gather | √ | √ |
scatter | √ | √ |
reduce_scatter | √ | √ |
all_to_all | √ | √ |
barrier | √ | √ |
We recommend using Anaconda as Python package management system. The followings are the corresponding branches (tags) ofoneccl_bindings_for_pytorch
and supported PyTorch.
The usage details can be found in the README of corresponding branch.
Python 3.8 or later and a C++17 compiler
PyTorch v2.3.1
The following build options are supported in Intel® oneCCL Bindings for PyTorch*.
Build Option | Default Value | Description |
---|---|---|
COMPUTE_BACKEND | N/A | Set oneCCLCOMPUTE_BACKEND , set todpcpp and use DPC++ compiler to enable support for Intel XPU |
USE_SYSTEM_ONECCL | OFF | Use oneCCL library in system |
CCL_PACKAGE_NAME | oneccl-bind-pt | Set wheel name |
ONECCL_BINDINGS_FOR_PYTORCH_BACKEND | cpu | Set backend |
CCL_SHA_VERSION | False | Add git head sha version into wheel name |
The following launch options are supported in Intel® oneCCL Bindings for PyTorch*.
Launch Option | Default Value | Description |
---|---|---|
ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE | 0 | Set verbose level in oneccl_bindings_for_pytorch |
ONECCL_BINDINGS_FOR_PYTORCH_ENV_WAIT_GDB | 0 | Set 1 to force the oneccl_bindings_for_pytorch wait for GDB attaching |
TORCH_LLM_ALLREDUCE | 0 | Set 1 to enable this prototype feature for better scale-up performance. This is a prototype feature to provide better scale-up performance by enabling optimized collective algorithms in oneCCL and asynchronous execution in torch-ccl. This feature requires XeLink enabled for cross-cards communication. |
CCL_BLOCKING_WAIT | 0 | Set 1 to enable this prototype feature, which is to control whether collectives execution on XPU is host blocking or non-blocking. |
CCL_SAME_STREAM | 0 | Set 1 to enable this prototype feature, which is to allow using a computation stream as communication stream to minimize overhead for streams synchronization. |
clone the
oneccl_bindings_for_pytorch
.git clone https://github.com/intel/torch-ccl.git&&cd torch-cclgit submodule syncgit submodule update --init --recursive
Install
oneccl_bindings_for_pytorch
# for CPU Backend Onlypython setup.py install# for XPU Backend: use DPC++ Compiler to enable support for Intel XPU# build with oneCCL from third partyCOMPUTE_BACKEND=dpcpp python setup.py install# build with oneCCL from basekitexport INTELONEAPIROOT=${HOME}/intel/oneapiUSE_SYSTEM_ONECCL=ON COMPUTE_BACKEND=dpcpp python setup.py install
Wheel files are available for the following Python versions. Please always use the latest release to get started.
Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | Python 3.11 |
---|---|---|---|---|---|---|
2.3.100 | √ | √ | √ | √ | ||
2.1.400 | √ | √ | √ | √ | ||
2.1.300 | √ | √ | √ | √ | ||
2.1.200 | √ | √ | √ | √ | ||
2.1.100 | √ | √ | √ | √ | ||
2.0.100 | √ | √ | √ | √ | ||
1.13 | √ | √ | √ | √ | ||
1.12.100 | √ | √ | √ | √ | ||
1.12.0 | √ | √ | √ | √ | ||
1.11.0 | √ | √ | √ | √ | ||
1.10.0 | √ | √ | √ | √ |
python -m pip install oneccl_bind_pt==2.3.100 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
Note: Please set proxy or update URL address tohttps://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ if you meet connection issue.
- If oneccl_bindings_for_pytorch is built without oneCCL and use oneCCL in system, dynamic link oneCCl from oneAPI basekit (recommended usage):
source$basekit_root/ccl/latest/env/vars.sh
Note: Make sure you have installedbasekit when using Intel® oneCCL Bindings for Pytorch* on Intel® GPUs.
- If oneccl_bindings_for_pytorch is built with oneCCL from third party or installed from prebuilt wheel:Dynamic link oneCCL and Intel MPI libraries:
source$(python -c"import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh
Dynamic link oneCCL only (not including Intel MPI):
source$(python -c"import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/vars.sh
Note: Pleaseimport torch
andimport intel_extension_for_pytorch
, prior toimport oneccl_bindings_for_pytorch
.
example.py
importtorchimportintel_extension_for_pytorchimportoneccl_bindings_for_pytorchimporttorch.nn.parallelimporttorch.distributedasdist...os.environ['MASTER_ADDR']='127.0.0.1'os.environ['MASTER_PORT']='29500'os.environ['RANK']=str(os.environ.get('PMI_RANK',0))os.environ['WORLD_SIZE']=str(os.environ.get('PMI_SIZE',1))backend='ccl'dist.init_process_group(backend, ...)my_rank=dist.get_rank()my_size=dist.get_world_size()print("my rank = %d my size = %d"% (my_rank,my_size))...model=torch.nn.parallel.DistributedDataParallel(model, ...)...
(oneccl_bindings_for_pytorch is built without oneCCL, use oneCCL and MPI(if needed) in system)
source$basekit_root/ccl/latest/env/vars.shsource$basekit_root/mpi/latest/env/vars.shmpirun -n<N> -ppn<PPN> -f<hostfile> python example.py
For debugging performance of communication primitives PyTorch'sAutograd profilercan be used to inspect time spent inside oneCCL calls.
Example:
profiling.py
importtorch.nn.parallelimporttorch.distributedasdistimportoneccl_bindings_for_pytorchimportosos.environ['MASTER_ADDR']='127.0.0.1'os.environ['MASTER_PORT']='29500'os.environ['RANK']=str(os.environ.get('PMI_RANK',0))os.environ['WORLD_SIZE']=str(os.environ.get('PMI_SIZE',1))backend='ccl'dist.init_process_group(backend)my_rank=dist.get_rank()my_size=dist.get_world_size()print("my rank = %d my size = %d"% (my_rank,my_size))x=torch.ones([2,2])y=torch.ones([4,4])withtorch.autograd.profiler.profile(record_shapes=True)asprof:for_inrange(10):dist.all_reduce(x)dist.all_reduce(y)dist.barrier()print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))
mpirun -n 2 -l python profiling.py
[0] my rank = 0 my size = 2[0] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------[0] Name Self CPU % Self CPU CPU total % CPU total CPUtime avg# of Calls Input Shapes[0] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------[0] oneccl_bindings_for_pytorch::allreduce 91.41% 297.900ms 91.41% 297.900ms 29.790ms 10 [[2, 2]][0] oneccl_bindings_for_pytorch::wait::cpu::allreduce 8.24% 26.845ms 8.24% 26.845ms 2.684ms 10 [[2, 2], [2, 2]][0] oneccl_bindings_for_pytorch::wait::cpu::allreduce 0.30% 973.651us 0.30% 973.651us 97.365us 10 [[4, 4], [4, 4]][0] oneccl_bindings_for_pytorch::allreduce 0.06% 190.254us 0.06% 190.254us 19.025us 10 [[4, 4]][0] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------[0] Self CPUtime total: 325.909ms[0][1] my rank = 1 my size = 2[1] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------[1] Name Self CPU % Self CPU CPU total % CPU total CPUtime avg# of Calls Input Shapes[1] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------[1] oneccl_bindings_for_pytorch::allreduce 96.03% 318.551ms 96.03% 318.551ms 31.855ms 10 [[2, 2]][1] oneccl_bindings_for_pytorch::wait::cpu::allreduce 3.62% 12.019ms 3.62% 12.019ms 1.202ms 10 [[2, 2], [2, 2]][1] oneccl_bindings_for_pytorch::allreduce 0.33% 1.082ms 0.33% 1.082ms 108.157us 10 [[4, 4]][1] oneccl_bindings_for_pytorch::wait::cpu::allreduce 0.02% 56.505us 0.02% 56.505us 5.651us 10 [[4, 4], [4, 4]][1] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------[1] Self CPUtime total: 331.708ms[1]
For Point-to-point communication, directly call dist.send/recv after initializing the process group in launch script will trigger runtime error. Because all ranks of the group are expected to participate in this call to create communicators in our current implementation, while dist.send/recv only has a pair of ranks' participation. As a result, dist.send/recv should be used after collective call, which ensures all ranks' participation. The further solution for supporting directly call dist.send/recv after initializing the process group is still under investigation.
About
oneCCL Bindings for Pytorch*