NotificationsYou must be signed in to change notification settings
Fork41
Star164

[refer #PGPRO-6599]: Avoid race conditions while processing PROFILE_R…#52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

shinderuk merged 1 commit intomasterfromPGPRO-6599

Aug 31, 2022

Merged

[refer #PGPRO-6599]: Avoid race conditions while processing PROFILE_R…#52

shinderuk merged 1 commit intomasterfromPGPRO-6599

Aug 31, 2022

Conversation

Copy link

Contributor

rzharkov commentedJul 4, 2022

…EQUEST and

PROFILE_RESET requests.

After initialization of "request" variable in collector.c main loop, another
client backend could change the value of "collector_hdr->request" variable.
Changing this value from "PROFILE_RESET" to "PROFILE_REQUEST" could cause
deadlock: "collector" processed "PROFILE_RESET" query while client backend
waits data from the "collector_mq" shared memory message queue. Now we read
and write "collector_hdr->request" variable only using "PGWS_COLLECTOR_LOCK"
lock.

Removed obsolete "read_current_wait()" function definition from
pg_wait_sampling.h.

tags: pg_wait_sampling

Copy link

Collaborator

maksm90 commentedAug 10, 2022•
edited
Loading

Changing this value from "PROFILE_RESET" to "PROFILE_REQUEST" could cause
deadlock: "collector" processed "PROFILE_RESET" query while client backend
waits data from the "collector_mq" shared memory message queue.

Not deadlock but infinite waiting of requested profile backend process caused by lostPROFILE_REQUEST request (ifreset of request variable happened after setting ofPROFILE_REQUEST value).

In general, your proposed solution (moving of request value saving under collector lock) resolves your problem. But we might end up to lostPROFILE_RESET request whenPROFILE_RESET and successive some another signal arrive before collector acquires its lock. And root cause I see in asynchronous nature ofPROFILE_RESET request: client backend just send signal (set latch) andrelease queue lock allowing other clients to perform other requests.

You might rewrite patch to include some feedback mechanics for backend performingPROFILE_RESET request, e.g., via waiting around collector's message queue, leaving the initial state of codebase (before your patch) unchanged. Or you might stick out for your solution - I figure losing ofPROFILE_RESET signal as non-critical and anyway I plan to rewrite a whole mechanics of IPC communication soon. But in this case the moving of request variable reading under collector lock is enough, other changes that incorporates this variable setting on backend's side under lock looks redundant.

Copy link

ContributorAuthor

rzharkov commentedAug 11, 2022

Making PROFILE_RESET request processing synchronous was not the goal of these patches. But this synchronization could be easily added.
There were two problems:

deadlock (stuck, enter inconsistent state with denial of service, etc.) while processing script below

psql -c "CREATE EXTENSION pg_wait_sampling"
for j in {1..100}; do
echo "iteration $j"
for i in {1..10}; do
echo "
SELECT * FROM pg_wait_sampling_get_profile();
SELECT pg_wait_sampling_reset_profile();
" | psql >psql$i.log &
done
wait
done

Coredumps while parallel processing of two "queries" regression tests with "aqo" module loaded.
aqo+pgws.txt

Copy link

Collaborator

maksm90 commentedAug 11, 2022

deadlock (stuck, enter inconsistent state with denial of service, etc.) while processing script below
psql -c "CREATE EXTENSION pg_wait_sampling"
for j in {1..100}; do
echo "iteration $j"
for i in {1..10}; do
echo "
SELECT * FROM pg_wait_sampling_get_profile();
SELECT pg_wait_sampling_reset_profile();
" | psql >psql$i.log &
done
wait
done

Yes, deadlock in your presented above scenario occurs when successive queriespg_wait_sampling_get_profile() try to acquire queue lock and get stuck due to another backend process that waits on message queue already holds this lock and collector that had lostPROFILE_REQUEST signal waits for new signals.

As I have mentioned, makingpg_wait_sampling_reset_profile() synchronous resolves this and more race condition problems. But you might leave your solution (reorder collector lock acquiring andrequest variable saving on collector side) - the losing ofPROFILE_RESET signals is not critical on current stage.

Coredumps while parallel processing of two "queries" regression tests with "aqo" module loaded.
aqo+pgws.txt

I cannot find coredump(

Copy link

Collaborator

maksm90 commentedAug 15, 2022

@rzharkov could you make necessary changes - removeLockRelease() moving inreceive_array() andpg_wait_sampling_reset_profile() functions? This moving is redundant as it doesn't protectcollector_hdr->request shared variable but just is used to wait for collector completes its communication work.

Copy link

ContributorAuthor

rzharkov commentedAug 15, 2022

Of course, I'll do it. The task is stuck, because my reviewer is on vacation :)

maksm90 mentioned this pull request

Aug 15, 2022

Synchronization with collector may be lost if backend is terminated#48

Closed

Copy link

Contributor

shinderuk commentedAug 30, 2022

could you make necessary changes - remove LockRelease() moving in receive_array() and pg_wait_sampling_reset_profile() functions? This moving is redundant as it doesn't protect collector_hdr->request shared variable but just is used to wait for collector completes its communication work.

No, it's not redundant. Consider the following interleaving:

S1 executed pg_wait_sampling_reset_profile and setcollector_hdr->request toPROFILE_RESET.
S2 is executing receive_array. It acquires the queue lock, acquires and immediately releases the collector lock...
Now collector acquires the collector lock and readsPROFILE_RESET from S1.
S2 continues and setscollector_hdr->request toPROFILE_REQUEST.
collector continues and overwritescollector_hdr->request withNO_REQUEST andPROFILE_REQUEST is lost.

The following test script hangs sooner or later ifcollector_hdr->request is not protected with the lock in receive_array:

for i in `seq 1000`; doecho $ifor j in `seq 10`; dopsql <<EOF >/dev/null &select * from pg_wait_sampling_profile;select pg_wait_sampling_reset_profile();EOFdonewaitdone

For pg_wait_sampling_reset_profile it doesn't matter, but it's better to keep both functions in sync.

Copy link

Contributor

shinderuk commentedAug 30, 2022•
edited
Loading

+extern const LOCKTAG   queueTag;+extern const LOCKTAG   collectorTag;...-extern void init_lock_tag(LOCKTAG *tag, uint32 lock);

Instead of init_lock_tag that clashes with the like named symbol in aqo, we now have two other unprefixed external symbols that may clash with other modules. I think we should prefix all external symbols with "pgws_" to be safe. And it's better to do this in a separate PR. (Roman, I'm sorry for asking you to fix both problems together. Now I see that they are unrelated.) I will split the patch.

Copy link

Collaborator

maksm90 commentedAug 30, 2022

Consider the following interleaving:

1. S1 executed pg_wait_sampling_reset_profile and set `collector_hdr->request` to `PROFILE_RESET`.2. S2 is executing receive_array. It acquires the queue lock, acquires and immediately releases the collector lock...3. Now collector acquires the collector lock and reads `PROFILE_RESET` from S1.4. S2 continues and sets `collector_hdr->request` to `PROFILE_REQUEST`.5. collector continues and overwrites `collector_hdr->request` with `NO_REQUEST` and `PROFILE_REQUEST` is lost.

Hmm, agreed. Free interference ofcollector_hdr->request setting insidereceive_array() function and non-atomic manipulation with this variable inside collector imposes concurrency issues.

Approved this fix.

@rzharkov could you rewrite your commit message so that schedules causing concurrency issue would be exposes in step-by-step manner. And this message have to include two schedules (your provided and one from@shinderuk) to motivate proposed changes: reorder request keeping after collector lock acquiring in collector process and setting request variable in regular backends under this lock.

maksm90 self-requested a review

August 30, 2022 15:46

maksm90 approved these changes

Aug 30, 2022

View reviewed changes

PGPRO-6599: Avoid race when accessing the request shared variable.

2d38a79

Consider the following sequence of events: 1. Session 1 calls pg_wait_sampling_reset_profile and sets the request shared variable to PROFILE_RESET. 2. The collector reads request and saves PROFILE_RESET to a local variable. 3. Session 2 queries pg_wait_sampling_profile, which sets request to PROFILE_REQUEST and waits for the collector in shm_mq_receive. 4. The collector continues and clears shared request, thus dropping PROFILE_REQUEST from Session 2. 5. Session 2 waits indefinitely in shm_mq_receive.A similar example with query cancellation: 1. Session 1 queries pg_wait_sampling_history and sets request to HISTORY_REQUEST. 2. Session 1 cancels the query while waiting for the collector. 3. The collector reads request and saves HISTORY_REQUEST to a local variable. 4. Session 2 queries pg_wait_sampling_profile, sets request to PROFILE_REQUEST and waits for the collector. 5. The collector continues and responds to HISTORY_REQUEST. 6. Session 2 receives history data and renders them as profile data returning invalid counts.These interleavings are avoided by acquiring the collector lock before readingrequest from shared memory in the collector. But we also need to hold thecollector lock when we set request in receive_array in a backend. Otherwise,the following interleaving is possible: 1. Session 1 calls pg_wait_sampling_reset_profile and sets request to PROFILE_RESET. 2. Session 2 queries pg_wait_sampling_profile, acquires and releases the collector lock. 3. The collector acquires the lock, reads request and saves PROFILE_RESET to a local variable. 4. Session 2 sets request to PROFILE_REQUEST. 5. The collector clears request, and PROFILE_REQUEST is lost. 6. Session 2 waits indefinitely in shm_mq_receive.Same for the second example above. This patch, however, doesn't prevent loosingPROFILE_RESET requests: 1. Session 1 calls pg_wait_sampling_reset_profile and sets request to PROFILE_RESET. 2. Session 2 queries pg_wait_sampling_profile before the collector reads request. 3. The collector reads PROFILE_REQUEST, while PROFILE_RESET is lost.To fix this, we could make pg_wait_sampling_reset_profile wait for thecollector, but we decided not to, as loosing a PROFILE_RESET isn't critical.Resolves#48.Author: Roman ZharkovReported-By: Alexander LakhinReviewed-By: Maksim Milyutin, Sergey Shinderuk

shinderuk force-pushed thePGPRO-6599 branch fromfe400d0 to2d38a79Compare

August 31, 2022 12:45

Copy link

Contributor

shinderuk commentedAug 31, 2022

@maksm90 I dropped the changes related to init_lock_tag from the patch and rewrote the commit message. Please take a look. Are you satisfied with the commit message?

Copy link

Collaborator

maksm90 commentedAug 31, 2022

I dropped the changes related to init_lock_tag from the patch

It looked as harmless refactoring. I would accepted patch with this fix.

I dropped the changes related to init_lock_tag from the patch and rewrote the commit message.

All right! Approved.

shinderuk merged commit4c1ae7a intomaster

Aug 31, 2022

shinderuk deleted the PGPRO-6599 branch

August 31, 2022 13:41

Labels

None yet

Movatterモバイル変換

[refer #PGPRO-6599]: Avoid race conditions while processing PROFILE_R…#52

[refer #PGPRO-6599]: Avoid race conditions while processing PROFILE_R…#52

Uh oh!

Conversation

rzharkov commentedJul 4, 2022

Uh oh!

maksm90 commentedAug 10, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

rzharkov commentedAug 11, 2022

Uh oh!

maksm90 commentedAug 11, 2022

Uh oh!

maksm90 commentedAug 15, 2022

Uh oh!

rzharkov commentedAug 15, 2022

Uh oh!

shinderuk commentedAug 30, 2022

Uh oh!

shinderuk commentedAug 30, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

maksm90 commentedAug 30, 2022

Uh oh!

shinderuk commentedAug 31, 2022

Uh oh!

maksm90 commentedAug 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maksm90 commentedAug 10, 2022•
edited
Loading

shinderuk commentedAug 30, 2022•
edited
Loading