Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

chore: prevent nil derefs in non-critical paths#11411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
Emyrk merged 3 commits intomainfromstevenmasley/time_panics
Jan 4, 2024

Conversation

Emyrk
Copy link
Member

Similar PR here:https://github.com/coder/coder/pull/9892/files

Fixes:#11308

I could not reproduce this, so it must just be flakey. This obviously sidesteps the problem, because ideally we just track down why the times are nil.However all this does is print a message with a duration to the UI. It is not served in any stats, or used in anything besides the build logs.

So I think this solution is fine just to prevent these in the future.

@mafredri
Copy link
Member

This is an interesting one. I was quite sure that thisshouldn't happen since the API is supposed to give us guarantees about which fields are available based on what lifecycle state we're in. So I did some digging and quite a bit of head scratching later, I do think we have a bug, but it's not here.

If we look at the agent logs, it's reporting in as it should:

2023-12-21T11:27:49.6967595Z     t.go:84: 2023-12-21 11:26:24.479 [debu]  agent: set lifecycle state  current={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  last={"state":"created","changed_at":"0001-01-01T00:00:00Z"}2023-12-21T11:27:49.6971361Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: reporting lifecycle state  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}2023-12-21T11:27:49.6974135Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: sdk request  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  body=""2023-12-21T11:27:49.6977363Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: set lifecycle state  current={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  last={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}2023-12-21T11:27:49.7020898Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: sdk response  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  status=204  body=""  trace_id=""  span_id=""2023-12-21T11:27:49.7026281Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: reporting lifecycle state  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}2023-12-21T11:27:49.7027697Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: sdk request  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  body=""2023-12-21T11:27:49.7039717Z     t.go:84: 2023-12-21 11:26:24.487 [debu]  agent: sdk response  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  status=204  body=""  trace_id=""  span_id=""

Likewise, the API is seeing what we expect it to see:

2023-12-21T11:27:49.7003352Z     t.go:84: 2023-12-21 11:26:24.482 [debu]  workspace agent state report  workspace_agent_id=ff758f63-c35e-4767-a37e-52f9c937f835  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  request_id=3ce9c96a-feb5-4f09-803a-d3ec490168072023-12-21T11:27:49.7013125Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4822116Z"  took=2.1672ms  status_code=204  latency_ms=2  request_id=3ce9c96a-feb5-4f09-803a-d3ec490168072023-12-21T11:27:49.7034778Z     t.go:84: 2023-12-21 11:26:24.485 [debu]  workspace agent state report  workspace_agent_id=e16a4d48-6057-425b-afcb-2d5ca5660e49  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e2023-12-21T11:27:49.7037500Z     t.go:84: 2023-12-21 11:26:24.486 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4855906Z"  took=1.1765ms  status_code=204  latency_ms=1  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e

However, a bit later it seems like either thestarted_at field has not been updated, or it has been overwritten by stale data:

2023-12-21T11:27:49.7086907Z     t.go:84: 2023-12-21 11:26:24.574 [debu]  accepting agent details  request_id=1afb922e-aa48-4a48-b427-36854b703ada  agent={"id":"e16a4d48-6057-425b-afcb-2d5ca5660e49","created_at":"2023-12-21T11:26:24.483817Z","updated_at":"2023-12-21T11:26:24.483817Z","name":"example","first_connected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"last_connected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"disconnected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"resource_id":"b0af077e-bc12-403b-acd7-bee021dcbac7","auth_token":"9d39a8d6-0d75-456e-90f1-546b6575a011","auth_instance_id":{"String":"","Valid":false},"architecture":"","environment_variables":{"RawMessage":null,"Valid":false},"operating_system":"","instance_metadata":{"RawMessage":null,"Valid":false},"resource_metadata":{"RawMessage":null,"Valid":false},"directory":"","version":"","last_connected_replica_id":null,"connection_timeout_seconds":0,"troubleshooting_url":"","motd_file":"","lifecycle_state":"ready","expanded_directory":"","logs_length":0,"logs_overflowed":false,"started_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"ready_at":{"Time":"2023-12-21T11:26:24.480422Z","Valid":true},"subsystems":null,"display_apps":null,"api_version":""}

As you can see, it's null/invalid even though the agent pushed the updated and coderd saw it, before the update for ready.

Since this is the in-memory database, we can assume that the HTTP 204 response means the data was most definitely written. So how did we lose it? We even send lifecycle updates synchronously, so there should be no risk that start/ready concurrently competing for the update.


As for this PR, I'm not opposed to handling this so that we can guarantee there are no panics, but I wouldn't want to pass it silently off as 0, either. We could at least indicate to the user that it's a non-normal state. Maybe returning-1 from the safe func and then printing[unknown], or similar?

Emyrk reacted with eyes emoji

@Emyrk
Copy link
MemberAuthor

Trying to see if we just forgot to do some assignment in dbmem...

Found something unrelated:#11412

@Emyrk
Copy link
MemberAuthor

A duration of<0 gets overwritten here:

ifduration<0 {
duration=0
}

If we want to fail non-silently, we can write some line to the stage writer?

@mafredri
Copy link
Member

A duration of<0 gets overwritten here:

ifduration<0 {
duration=0
}

Oh yeah, that was to protect against us occasionally showing negative millis.

If we want to fail non-silently, we can write some line to the stage writer?

Sure 👍🏻

Emyrk reacted with thumbs up emoji

@Emyrk
Copy link
MemberAuthor

@mafredri I added the logging of nil times.

@Emyrk
Copy link
MemberAuthor

}
}
sw.Complete(stage,agent.LastConnectedAt.Sub(disconnectedAt))
sw.Complete(stage,safeDuration(sw,agent.LastConnectedAt,agent.DisconnectedAt))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This failed the new log change because the logic changed here, we handled the nil time by checking status before, but now we log instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Or wait, that doesn't sound quite right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ok third comments the charm. You need to keep handling @ disconnectedAt := *agent.DisconnectedAt because we want the cached disconnect value after we've reconnected.

Emyrk reacted with thumbs up emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pushed a fix.

Emyrk reacted with heart emoji
// A panic has been observed in a test.
funcsafeDuration(sw*stageWriter,a,b*time.Time) time.Duration {
ifa==nil||b==nil {
ifsw!=nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No need for sw to be an optional param here, IMO 🤔 (i.e. no need to check).

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I just really didn't want to accidently introduce a new place to deref a nil lol

@EmyrkEmyrk merged commitc6366e5 intomainJan 4, 2024
@EmyrkEmyrk deleted the stevenmasley/time_panics branchJanuary 4, 2024 20:51
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsJan 4, 2024
@spikecurtis
Copy link
Contributor

This smells like db corruption because of a read-modify-update cycle not protected by a transaction

@spikecurtis
Copy link
Contributor

Ok, I have a theory:

check out the timing of the API logs:

2023-12-21T11:27:49.7003352Z     t.go:84: 2023-12-21 11:26:24.482 [debu]  workspace agent state report  workspace_agent_id=ff758f63-c35e-4767-a37e-52f9c937f835  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  request_id=3ce9c96a-feb5-4f09-803a-d3ec490168072023-12-21T11:27:49.7013125Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4822116Z"  took=2.1672ms  status_code=204  latency_ms=2  request_id=3ce9c96a-feb5-4f09-803a-d3ec490168072023-12-21T11:27:49.7034778Z     t.go:84: 2023-12-21 11:26:24.485 [debu]  workspace agent state report  workspace_agent_id=e16a4d48-6057-425b-afcb-2d5ca5660e49  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e2023-12-21T11:27:49.7037500Z     t.go:84: 2023-12-21 11:26:24.486 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4855906Z"  took=1.1765ms  status_code=204  latency_ms=1  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e

The API call to set the lifecycle to "starting" begins at 2023-12-21T11:26:24.4822116Z

However, the "created_at" time of the agent row is: 2023-12-21T11:26:24.483817Z, that is about a millisecond after the "starting" lifecycle.

Then finally the "ready" lifecycle update comes in at 2023-12-21T11:26:24.4855906Z

So, the "starting" lifecycle update was processedbefore the agent row was created, which explains why the data is missing.

But, the "starting" lifecycle update was successful, which doesn't make sense if the agent row didn't exist. However, the agent doesn't actually know it's own ID, it uses/api/v2/workspaceagents/me/report-lifecycle to report lifecycle, and coderd looks up the agent ID using the auth token. In the test that failed, we create the workspace, then stop it before running the SSH command which then starts a new build. Crucially, we hard-code the authentication token in the template version, which is a limitation of the Echo provisioner. So, when the "starting" lifecycle change was POSTed, Coderd found the agent from the previous build and set the time there.

Then later, when "ready" posted, the new agent row existed, and so Coderd found the new agent (it always looks for the latest matching build for the token).


Some take-aways:

  1. The Coder API doesn't enforce that "ready" comes after "starting" in the lifecycle. We rely on the Agent doing the right thing. Bugs in the Agent shouldn't create panics elsewhere, so the API needs to enforce the rules about consistent state (e.g. that ready implies started_at is valid and/or "ready" comes after starting). Also, the Agent and it's token are under end user control.

  2. Token uniqueness with respect to agents is assumed by the way our API works (i.e. paths like/api/v2/workspaceagents/me/report-lifecycle), but Echo provisioner doesn't create unique tokens and we don't enforce uniqueness in the database.@johnstcn once tried to enforce this at the database and found a bunch of tests broke so he gave up. But this bug is evidence that there are lurking issues with this oversight.

  3. The in-mem database doesn't enforce uniqueness of agent IDs. I don't think that contributed directly to this bug, but I noticed it and went down a rabbit hole wondering if we were inserting the same agentID more than once. This is probably endemic to many other types of data in the fake database.

@spikecurtis
Copy link
Contributor

Oh, also,https://github.com/coder/coder/blame/58873fa7e2e90cb026938b4cb4a05bcc7007b3bf/coderd/workspaceagents.go#L1755

does suffer from read-modify-update outside a transaction. With a well-behaved agent, it serializes calls to this API, but again, a bugged agent shouldn't cause panics in other components. And, a poorly made template could create two instances of the same agent with the same token, so the assumption that calls are serialized is dangerous.

Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@mafredrimafredrimafredri approved these changes

Assignees

@EmyrkEmyrk

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

test flake: TestSSH/StartStoppedWorkspace

3 participants

@Emyrk@mafredri@spikecurtis

[8]ページ先頭

©2009-2025 Movatter.jp