- Notifications
You must be signed in to change notification settings - Fork269
Description
Code Version
Latest development branch commit:4d5afa8
Observation
CPU usage of gprMax drops below 80% when the grid size is less than 1000x1000x1000 and the number of OpenMP threads exceeds 16. The following image illustrates the effective CPU usage on a 64-core system with a grid size of 200x200x200.
Profiling
During the PML update phase, numerous "bubbles" (marked in white) are observed, indicating that the CPU is neither occupied with OpenMP synchronization (indicated in red) nor engaged in calculations (shown in green).
Upon closer inspection, each PML layer update kernel can be divided into three segments, separated by OpenMP barriers. In the initial segment, the functionsPyGILState_Ensure
andPyEval_SaveThread
are executed (the white portion). The execution time of these functions varies significantly between threads. Threads that complete these operations early enter a spinning state (the green portion) until all threads reach the barrier, after which the actual computation commences. Subsequently,PyEval_SaveThread
anddrop_gil
are executed.
The Cause
The Cython code for a PML kernel can be found here:
gprMax/gprMax/cython/pml_updates_magnetic_HORIPML.pyx
Lines 69 to 90 in4d5afa8
for iin prange(0, nx,nogil=True,schedule='static',num_threads=nthreads): | |
ii= xf- (i+1) | |
RA01= RA[0, i]-1 | |
RB0= RB[0, i] | |
RE0= RE[0, i] | |
RF0= RF[0, i] | |
for jinrange(0, ny): | |
jj= j+ ys | |
for kinrange(0, nz): | |
kk= k+ zs | |
# Hy | |
materialHy= ID[4, ii, jj, kk] | |
dEz= (Ez[ii+1, jj, kk]- Ez[ii, jj, kk])/ dx | |
Hy[ii, jj, kk]= (Hy[ii, jj, kk]+ updatecoeffsH[materialHy,4]* | |
(RA01* dEz+ RB0* Phi1[0, i, j, k])) | |
Phi1[0, i, j, k]= RE0* Phi1[0, i, j, k]- RF0* dEz | |
# Hz | |
materialHz= ID[5, ii, jj, kk] | |
dEy= (Ey[ii+1, jj, kk]- Ey[ii, jj, kk])/ dx | |
Hz[ii, jj, kk]= (Hz[ii, jj, kk]- updatecoeffsH[materialHz,4]* | |
(RA01* dEy+ RB0* Phi2[0, i, j, k])) | |
Phi2[0, i, j, k]= RE0* Phi2[0, i, j, k]- RF0* dEy |
Below is a simplified excerpt of the generated C code:
#pragma omp parallel num_threads(nthreads){PyGILState_STATE__pyx_gilstate_save=__Pyx_PyGILState_Ensure();Py_BEGIN_ALLOW_THREADS#pragma omp forfor (inti=0;i<nx;i++) {// Main loop }Py_END_ALLOW_THREADS__Pyx_PyGILState_Release(__pyx_gilstate_save);}
Each thread preserves its context and restores it after the inner parallel region. This behavior seems unnecessary since the kernel does not require Python interactions.
Benchmark
After manually removing all the GIL-related code in the generated C file and changing the OpenMP schedule method to dynamic, the "bubbles" disappeared, and CPU usage increased. Consequently, gprMax achieved improved scalability.
Solutions Attempted
The following methods have been tried without success, as Cython continues to generate the same C code:
withnogil:foriinprange(0,nx,nogil=False,schedule='static',num_threads=nthreads):
cpdefvoidorder1_xminus(...)noexceptnogil:foriinprange(0,nx,nogil=False,schedule='static',num_threads=nthreads):