Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitee32782

Browse files
committed
Fix postmaster state machine to handle dead_end child crashes better.
A report from Alvaro Herrera shows that if we're in PM_STARTUPstate, and we spawn a dead_end child to reject some incomingconnection request, and that child dies with an unexpected exitcode, the postmaster does not respond well. We correctly sendSIGQUIT to the startup process, but then:* if the startup process exits with nonzero exit code, as expected,we thought that that indicated a crash and aborted startup.* if the startup process exits with zero exit code, which is possibledue to the inherent race condition, we'd advance to PM_RUN statewhich is fine --- but the code forgot that AbortStartTime would benonzero in this situation. We'd either die on the Asserts sayingthat it was zero, or perhaps misbehave later on. (A quick looksuggests that the only misbehavior might be busy-waiting due toDetermineSleepTime doing the wrong thing.)To fix the first point, adjust the state-machine logic to recognizethat a nonzero exit code is expected after sending SIGQUIT, and haveit transition to a state where we can restart the startup process.To fix the second point, change the Asserts to clear the variablerather than just claiming it should be clear already.Perhaps we could improve this further by not treating a crash ofa dead_end child as a reason for panic'ing the database. However,since those child processes are connected to shared memory, thatseems a bit risky. There are few good reasons for a dead_end childto report failure anyway (the cause of this in Alvaro's report isquite unclear). On balance, therefore, a minimal fix seems best.This is an oversight in commit45811be. While that was back-patched,I'm hesitant to back-patch this change. The lack of reasons for adead_end child to fail suggests that the case should be very rare inthe field, which squares with the lack of reports; so it seems likethis might not be worth the risk of introducing new issues. In anycase we can let it bake awhile in HEAD before considering a back-patch.Discussion:https://postgr.es/m/20190615160950.GA31378@alvherre.pgsql
1 parent348778d commitee32782

File tree

1 file changed

+19
-4
lines changed

1 file changed

+19
-4
lines changed

‎src/backend/postmaster/postmaster.c

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2920,7 +2920,9 @@ reaper(SIGNAL_ARGS)
29202920
* during PM_STARTUP is treated as catastrophic. There are no
29212921
* other processes running yet, so we can just exit.
29222922
*/
2923-
if (pmState==PM_STARTUP&& !EXIT_STATUS_0(exitstatus))
2923+
if (pmState==PM_STARTUP&&
2924+
StartupStatus!=STARTUP_SIGNALED&&
2925+
!EXIT_STATUS_0(exitstatus))
29242926
{
29252927
LogChildExit(LOG,_("startup process"),
29262928
pid,exitstatus);
@@ -2937,11 +2939,24 @@ reaper(SIGNAL_ARGS)
29372939
* then we previously sent the startup process a SIGQUIT; so
29382940
* that's probably the reason it died, and we do want to try to
29392941
* restart in that case.
2942+
*
2943+
* This stanza also handles the case where we sent a SIGQUIT
2944+
* during PM_STARTUP due to some dead_end child crashing: in that
2945+
* situation, if the startup process dies on the SIGQUIT, we need
2946+
* to transition to PM_WAIT_BACKENDS state which will allow
2947+
* PostmasterStateMachine to restart the startup process. (On the
2948+
* other hand, the startup process might complete normally, if we
2949+
* were too late with the SIGQUIT. In that case we'll fall
2950+
* through and commence normal operations.)
29402951
*/
29412952
if (!EXIT_STATUS_0(exitstatus))
29422953
{
29432954
if (StartupStatus==STARTUP_SIGNALED)
2955+
{
29442956
StartupStatus=STARTUP_NOT_RUNNING;
2957+
if (pmState==PM_STARTUP)
2958+
pmState=PM_WAIT_BACKENDS;
2959+
}
29452960
else
29462961
StartupStatus=STARTUP_CRASHED;
29472962
HandleChildCrash(pid,exitstatus,
@@ -2954,7 +2969,7 @@ reaper(SIGNAL_ARGS)
29542969
*/
29552970
StartupStatus=STARTUP_NOT_RUNNING;
29562971
FatalError= false;
2957-
Assert(AbortStartTime==0);
2972+
AbortStartTime=0;
29582973
ReachedNormalRunning= true;
29592974
pmState=PM_RUN;
29602975

@@ -3504,7 +3519,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
35043519
if (pid==StartupPID)
35053520
{
35063521
StartupPID=0;
3507-
StartupStatus=STARTUP_CRASHED;
3522+
/* Caller adjustsStartupStatus, so don't touch it here */
35083523
}
35093524
elseif (StartupPID!=0&&take_action)
35103525
{
@@ -5100,7 +5115,7 @@ sigusr1_handler(SIGNAL_ARGS)
51005115
{
51015116
/* WAL redo has started. We're out of reinitialization. */
51025117
FatalError= false;
5103-
Assert(AbortStartTime==0);
5118+
AbortStartTime=0;
51045119

51055120
/*
51065121
* Crank up the background tasks. It doesn't matter if this fails,

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp