Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

High GIL Overhead in PML Update Kernel #429

Open
@esp0r

Description

@esp0r

Code Version

Latest development branch commit:4d5afa8

Observation

CPU usage of gprMax drops below 80% when the grid size is less than 1000x1000x1000 and the number of OpenMP threads exceeds 16. The following image illustrates the effective CPU usage on a 64-core system with a grid size of 200x200x200.

CPU Usage Image

Profiling

During the PML update phase, numerous "bubbles" (marked in white) are observed, indicating that the CPU is neither occupied with OpenMP synchronization (indicated in red) nor engaged in calculations (shown in green).

Profiling Image

Upon closer inspection, each PML layer update kernel can be divided into three segments, separated by OpenMP barriers. In the initial segment, the functionsPyGILState_Ensure andPyEval_SaveThread are executed (the white portion). The execution time of these functions varies significantly between threads. Threads that complete these operations early enter a spinning state (the green portion) until all threads reach the barrier, after which the actual computation commences. Subsequently,PyEval_SaveThread anddrop_gil are executed.

PML Layer Update Image

The Cause

The Cython code for a PML kernel can be found here:

for iin prange(0, nx,nogil=True,schedule='static',num_threads=nthreads):
ii= xf- (i+1)
RA01= RA[0, i]-1
RB0= RB[0, i]
RE0= RE[0, i]
RF0= RF[0, i]
for jinrange(0, ny):
jj= j+ ys
for kinrange(0, nz):
kk= k+ zs
# Hy
materialHy= ID[4, ii, jj, kk]
dEz= (Ez[ii+1, jj, kk]- Ez[ii, jj, kk])/ dx
Hy[ii, jj, kk]= (Hy[ii, jj, kk]+ updatecoeffsH[materialHy,4]*
(RA01* dEz+ RB0* Phi1[0, i, j, k]))
Phi1[0, i, j, k]= RE0* Phi1[0, i, j, k]- RF0* dEz
# Hz
materialHz= ID[5, ii, jj, kk]
dEy= (Ey[ii+1, jj, kk]- Ey[ii, jj, kk])/ dx
Hz[ii, jj, kk]= (Hz[ii, jj, kk]- updatecoeffsH[materialHz,4]*
(RA01* dEy+ RB0* Phi2[0, i, j, k]))
Phi2[0, i, j, k]= RE0* Phi2[0, i, j, k]- RF0* dEy

Below is a simplified excerpt of the generated C code:

#pragma omp parallel num_threads(nthreads){PyGILState_STATE__pyx_gilstate_save=__Pyx_PyGILState_Ensure();Py_BEGIN_ALLOW_THREADS#pragma omp forfor (inti=0;i<nx;i++) {// Main loop  }Py_END_ALLOW_THREADS__Pyx_PyGILState_Release(__pyx_gilstate_save);}

Each thread preserves its context and restores it after the inner parallel region. This behavior seems unnecessary since the kernel does not require Python interactions.

Benchmark

After manually removing all the GIL-related code in the generated C file and changing the OpenMP schedule method to dynamic, the "bubbles" disappeared, and CPU usage increased. Consequently, gprMax achieved improved scalability.

Improved CPU Usage Image

Scalability Image

Solutions Attempted

The following methods have been tried without success, as Cython continues to generate the same C code:

withnogil:foriinprange(0,nx,nogil=False,schedule='static',num_threads=nthreads):
cpdefvoidorder1_xminus(...)noexceptnogil:foriinprange(0,nx,nogil=False,schedule='static',num_threads=nthreads):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp