gprMax/gprMaxPublic

NotificationsYou must be signed in to change notification settings
Fork274
Star716

High GIL Overhead in PML Update Kernel #429

New issue

Open

High GIL Overhead in PML Update Kernel#429

Description

esp0r

opened

on Mar 18, 2024

Code Version

Latest development branch commit:4d5afa8

Observation

CPU usage of gprMax drops below 80% when the grid size is less than 1000x1000x1000 and the number of OpenMP threads exceeds 16. The following image illustrates the effective CPU usage on a 64-core system with a grid size of 200x200x200.

Profiling

During the PML update phase, numerous "bubbles" (marked in white) are observed, indicating that the CPU is neither occupied with OpenMP synchronization (indicated in red) nor engaged in calculations (shown in green).

Upon closer inspection, each PML layer update kernel can be divided into three segments, separated by OpenMP barriers. In the initial segment, the functionsPyGILState_Ensure andPyEval_SaveThread are executed (the white portion). The execution time of these functions varies significantly between threads. Threads that complete these operations early enter a spinning state (the green portion) until all threads reach the barrier, after which the actual computation commences. Subsequently,PyEval_SaveThread anddrop_gil are executed.

The Cause

The Cython code for a PML kernel can be found here:

gprMax/gprMax/cython/pml_updates_magnetic_HORIPML.pyx

Lines 69 to 90 in4d5afa8

	for iin prange(0, nx,nogil=True,schedule='static',num_threads=nthreads):
	ii= xf- (i+1)
	RA01= RA[0, i]-1
	RB0= RB[0, i]
	RE0= RE[0, i]
	RF0= RF[0, i]
	for jinrange(0, ny):
	jj= j+ ys
	for kinrange(0, nz):
	kk= k+ zs
	# Hy
	materialHy= ID[4, ii, jj, kk]
	dEz= (Ez[ii+1, jj, kk]- Ez[ii, jj, kk])/ dx
	Hy[ii, jj, kk]= (Hy[ii, jj, kk]+ updatecoeffsH[materialHy,4]*
	(RA01* dEz+ RB0* Phi1[0, i, j, k]))
	Phi1[0, i, j, k]= RE0* Phi1[0, i, j, k]- RF0* dEz
	# Hz
	materialHz= ID[5, ii, jj, kk]
	dEy= (Ey[ii+1, jj, kk]- Ey[ii, jj, kk])/ dx
	Hz[ii, jj, kk]= (Hz[ii, jj, kk]- updatecoeffsH[materialHz,4]*
	(RA01* dEy+ RB0* Phi2[0, i, j, k]))
	Phi2[0, i, j, k]= RE0* Phi2[0, i, j, k]- RF0* dEy

Below is a simplified excerpt of the generated C code:

#pragma omp parallel num_threads(nthreads){PyGILState_STATE__pyx_gilstate_save=__Pyx_PyGILState_Ensure();Py_BEGIN_ALLOW_THREADS#pragma omp forfor (inti=0;i<nx;i++) {// Main loop  }Py_END_ALLOW_THREADS__Pyx_PyGILState_Release(__pyx_gilstate_save);}

Each thread preserves its context and restores it after the inner parallel region. This behavior seems unnecessary since the kernel does not require Python interactions.

Benchmark

After manually removing all the GIL-related code in the generated C file and changing the OpenMP schedule method to dynamic, the "bubbles" disappeared, and CPU usage increased. Consequently, gprMax achieved improved scalability.

Solutions Attempted

The following methods have been tried without success, as Cython continues to generate the same C code:

withnogil:foriinprange(0,nx,nogil=False,schedule='static',num_threads=nthreads):

cpdefvoidorder1_xminus(...)noexceptnogil:foriinprange(0,nx,nogil=False,schedule='static',num_threads=nthreads):

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

High GIL Overhead in PML Update Kernel #429

Description

Code Version

Observation

Profiling

The Cause

Benchmark

Solutions Attempted

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions