I could not reproduce this, so it must just be flakey. This obviously sidesteps the problem, because ideally we just track down why the times are nil.However all this does is print a message with a duration to the UI. It is not served in any stats, or used in anything besides the build logs.

So I think this solution is fine just to prevent these in the future.

chore: prevent nil derefs in non-critical paths

b3613ec

Emyrk requested a review frommafredri

January 4, 2024 15:39

github-actionsbot assignedEmyrk

Jan 4, 2024

Copy link

Member

mafredri commentedJan 4, 2024

This is an interesting one. I was quite sure that thisshouldn't happen since the API is supposed to give us guarantees about which fields are available based on what lifecycle state we're in. So I did some digging and quite a bit of head scratching later, I do think we have a bug, but it's not here.

If we look at the agent logs, it's reporting in as it should:

2023-12-21T11:27:49.6967595Z     t.go:84: 2023-12-21 11:26:24.479 [debu]  agent: set lifecycle state  current={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  last={"state":"created","changed_at":"0001-01-01T00:00:00Z"}2023-12-21T11:27:49.6971361Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: reporting lifecycle state  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}2023-12-21T11:27:49.6974135Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: sdk request  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  body=""2023-12-21T11:27:49.6977363Z     t.go:84: 2023-12-21 11:26:24.480 [debu]  agent: set lifecycle state  current={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  last={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}2023-12-21T11:27:49.7020898Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: sdk response  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  status=204  body=""  trace_id=""  span_id=""2023-12-21T11:27:49.7026281Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: reporting lifecycle state  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}2023-12-21T11:27:49.7027697Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  agent: sdk request  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  body=""2023-12-21T11:27:49.7039717Z     t.go:84: 2023-12-21 11:26:24.487 [debu]  agent: sdk response  method=POST  url=http://localhost:64379/api/v2/workspaceagents/me/report-lifecycle  status=204  body=""  trace_id=""  span_id=""

Likewise, the API is seeing what we expect it to see:

2023-12-21T11:27:49.7003352Z     t.go:84: 2023-12-21 11:26:24.482 [debu]  workspace agent state report  workspace_agent_id=ff758f63-c35e-4767-a37e-52f9c937f835  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  request_id=3ce9c96a-feb5-4f09-803a-d3ec490168072023-12-21T11:27:49.7013125Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4822116Z"  took=2.1672ms  status_code=204  latency_ms=2  request_id=3ce9c96a-feb5-4f09-803a-d3ec490168072023-12-21T11:27:49.7034778Z     t.go:84: 2023-12-21 11:26:24.485 [debu]  workspace agent state report  workspace_agent_id=e16a4d48-6057-425b-afcb-2d5ca5660e49  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e2023-12-21T11:27:49.7037500Z     t.go:84: 2023-12-21 11:26:24.486 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4855906Z"  took=1.1765ms  status_code=204  latency_ms=1  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e

However, a bit later it seems like either thestarted_at field has not been updated, or it has been overwritten by stale data:

2023-12-21T11:27:49.7086907Z     t.go:84: 2023-12-21 11:26:24.574 [debu]  accepting agent details  request_id=1afb922e-aa48-4a48-b427-36854b703ada  agent={"id":"e16a4d48-6057-425b-afcb-2d5ca5660e49","created_at":"2023-12-21T11:26:24.483817Z","updated_at":"2023-12-21T11:26:24.483817Z","name":"example","first_connected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"last_connected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"disconnected_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"resource_id":"b0af077e-bc12-403b-acd7-bee021dcbac7","auth_token":"9d39a8d6-0d75-456e-90f1-546b6575a011","auth_instance_id":{"String":"","Valid":false},"architecture":"","environment_variables":{"RawMessage":null,"Valid":false},"operating_system":"","instance_metadata":{"RawMessage":null,"Valid":false},"resource_metadata":{"RawMessage":null,"Valid":false},"directory":"","version":"","last_connected_replica_id":null,"connection_timeout_seconds":0,"troubleshooting_url":"","motd_file":"","lifecycle_state":"ready","expanded_directory":"","logs_length":0,"logs_overflowed":false,"started_at":{"Time":"0001-01-01T00:00:00Z","Valid":false},"ready_at":{"Time":"2023-12-21T11:26:24.480422Z","Valid":true},"subsystems":null,"display_apps":null,"api_version":""}

As you can see, it's null/invalid even though the agent pushed the updated and coderd saw it, before the update for ready.

Since this is the in-memory database, we can assume that the HTTP 204 response means the data was most definitely written. So how did we lose it? We even send lifecycle updates synchronously, so there should be no risk that start/ready concurrently competing for the update.

As for this PR, I'm not opposed to handling this so that we can guarantee there are no panics, but I wouldn't want to pass it silently off as 0, either. We could at least indicate to the user that it's a non-normal state. Maybe returning-1 from the safe func and then printing[unknown], or similar?

Copy link

MemberAuthor

Emyrk commentedJan 4, 2024

Trying to see if we just forgot to do some assignment in dbmem...

Found something unrelated:#11412

Copy link

MemberAuthor

Emyrk commentedJan 4, 2024

A duration of<0 gets overwritten here:

coder/cli/cliui/provisionerjob.go

Lines 232 to 234 inb3613ec

	ifduration<0 {
	duration=0
	}

If we want to fail non-silently, we can write some line to the stage writer?

Copy link

Member

mafredri commentedJan 4, 2024

A duration of<0 gets overwritten here:
coder/cli/cliui/provisionerjob.go
Lines 232 to 234 inb3613ec
ifduration<0 {
duration=0
}

Oh yeah, that was to protect against us occasionally showing negative millis.

If we want to fail non-silently, we can write some line to the stage writer?

Sure 👍🏻

log line when nil time detected

ae7c34c

Copy link

MemberAuthor

Emyrk commentedJan 4, 2024

@mafredri I added the logging of nil times.

Copy link

MemberAuthor

Emyrk commentedJan 4, 2024

@mafredri that makes the test fail anyways 🤦

https://github.com/coder/coder/actions/runs/7413833011/job/20173631242?pr=11411#step:5:384

Flake happened

mafredri reviewed

Jan 4, 2024

View reviewed changes

cli/cliui/agent.go Outdated

		}
		}
		sw.Complete(stage,agent.LastConnectedAt.Sub(disconnectedAt))
		sw.Complete(stage,safeDuration(sw,agent.LastConnectedAt,agent.DisconnectedAt))

Copy link

Member

mafredriJan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This failed the new log change because the logic changed here, we handled the nil time by checking status before, but now we log instead.

Copy link

Member

mafredriJan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Or wait, that doesn't sound quite right.

Copy link

Member

mafredriJan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ok third comments the charm. You need to keep handling @ disconnectedAt := *agent.DisconnectedAt because we want the cached disconnect value after we've reconnected.

Copy link

Member

mafredriJan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pushed a fix.

fix

55258d0

mafredri approved these changes

Jan 4, 2024

View reviewed changes

cli/cliui/agent.go

		// A panic has been observed in a test.
		funcsafeDuration(swstageWriter,a,btime.Time) time.Duration {
		ifa==nil\|\|b==nil {
		ifsw!=nil {

Copy link

Member

mafredriJan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No need for sw to be an optional param here, IMO 🤔 (i.e. no need to check).

Copy link

MemberAuthor

EmyrkJan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I just really didn't want to accidently introduce a new place to deref a nil lol

Emyrk merged commitc6366e5 intomain

Jan 4, 2024

Emyrk deleted the stevenmasley/time_panics branch

January 4, 2024 20:51

github-actionsbot locked and limited conversation to collaborators

Jan 4, 2024

Copy link

Contributor

spikecurtis commentedJan 5, 2024

This smells like db corruption because of a read-modify-update cycle not protected by a transaction

Copy link

Contributor

spikecurtis commentedJan 5, 2024

Ok, I have a theory:

check out the timing of the API logs:

2023-12-21T11:27:49.7003352Z     t.go:84: 2023-12-21 11:26:24.482 [debu]  workspace agent state report  workspace_agent_id=ff758f63-c35e-4767-a37e-52f9c937f835  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"starting","changed_at":"2023-12-21T11:26:24.479014Z"}  request_id=3ce9c96a-feb5-4f09-803a-d3ec490168072023-12-21T11:27:49.7013125Z     t.go:84: 2023-12-21 11:26:24.484 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4822116Z"  took=2.1672ms  status_code=204  latency_ms=2  request_id=3ce9c96a-feb5-4f09-803a-d3ec490168072023-12-21T11:27:49.7034778Z     t.go:84: 2023-12-21 11:26:24.485 [debu]  workspace agent state report  workspace_agent_id=e16a4d48-6057-425b-afcb-2d5ca5660e49  workspace_id=c18b2771-df4b-42d5-a3f1-a014910197b8  payload={"state":"ready","changed_at":"2023-12-21T11:26:24.480422Z"}  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e2023-12-21T11:27:49.7037500Z     t.go:84: 2023-12-21 11:26:24.486 [debu]  POST  host=localhost:64379  path=/api/v2/workspaceagents/me/report-lifecycle  proto=HTTP/1.1  remote_addr=127.0.0.1  start="2023-12-21T11:26:24.4855906Z"  took=1.1765ms  status_code=204  latency_ms=1  request_id=db18b534-cd87-48b5-96d8-fc919bb72a8e

The API call to set the lifecycle to "starting" begins at 2023-12-21T11:26:24.4822116Z

However, the "created_at" time of the agent row is: 2023-12-21T11:26:24.483817Z, that is about a millisecond after the "starting" lifecycle.

Then finally the "ready" lifecycle update comes in at 2023-12-21T11:26:24.4855906Z

So, the "starting" lifecycle update was processedbefore the agent row was created, which explains why the data is missing.

But, the "starting" lifecycle update was successful, which doesn't make sense if the agent row didn't exist. However, the agent doesn't actually know it's own ID, it uses/api/v2/workspaceagents/me/report-lifecycle to report lifecycle, and coderd looks up the agent ID using the auth token. In the test that failed, we create the workspace, then stop it before running the SSH command which then starts a new build. Crucially, we hard-code the authentication token in the template version, which is a limitation of the Echo provisioner. So, when the "starting" lifecycle change was POSTed, Coderd found the agent from the previous build and set the time there.

Then later, when "ready" posted, the new agent row existed, and so Coderd found the new agent (it always looks for the latest matching build for the token).

Some take-aways:

The Coder API doesn't enforce that "ready" comes after "starting" in the lifecycle. We rely on the Agent doing the right thing. Bugs in the Agent shouldn't create panics elsewhere, so the API needs to enforce the rules about consistent state (e.g. that ready implies started_at is valid and/or "ready" comes after starting). Also, the Agent and it's token are under end user control.
Token uniqueness with respect to agents is assumed by the way our API works (i.e. paths like/api/v2/workspaceagents/me/report-lifecycle), but Echo provisioner doesn't create unique tokens and we don't enforce uniqueness in the database.@johnstcn once tried to enforce this at the database and found a bunch of tests broke so he gave up. But this bug is evidence that there are lurking issues with this oversight.
The in-mem database doesn't enforce uniqueness of agent IDs. I don't think that contributed directly to this bug, but I noticed it and went down a rabbit hole wondering if we were inserting the same agentID more than once. This is probably endemic to many other types of data in the fake database.

Copy link

Contributor

spikecurtis commentedJan 5, 2024

Oh, also,https://github.com/coder/coder/blame/58873fa7e2e90cb026938b4cb4a05bcc7007b3bf/coderd/workspaceagents.go#L1755

does suffer from read-modify-update outside a transaction. With a well-behaved agent, it serializes calls to this API, but again, a bugged agent shouldn't cause panics in other components. And, a poorly made template could create two instances of the same agent with the same token, so the assumption that calls are serialized is dangerous.

Labels

None yet

Movatterモバイル変換

chore: prevent nil derefs in non-critical paths#11411

chore: prevent nil derefs in non-critical paths#11411

Uh oh!

Conversation

Emyrk commentedJan 4, 2024

Uh oh!

mafredri commentedJan 4, 2024

Uh oh!

Emyrk commentedJan 4, 2024

Uh oh!

Emyrk commentedJan 4, 2024

Uh oh!

mafredri commentedJan 4, 2024

Uh oh!

Emyrk commentedJan 4, 2024

Uh oh!

Emyrk commentedJan 4, 2024

Uh oh!

mafredriJan 4, 2024

Choose a reason for hiding this comment

Uh oh!

mafredriJan 4, 2024

Choose a reason for hiding this comment

Uh oh!

mafredriJan 4, 2024

Choose a reason for hiding this comment

Uh oh!

mafredriJan 4, 2024

Choose a reason for hiding this comment

Uh oh!

mafredriJan 4, 2024

Choose a reason for hiding this comment

Uh oh!

EmyrkJan 4, 2024

Choose a reason for hiding this comment

Uh oh!

spikecurtis commentedJan 5, 2024

Uh oh!

spikecurtis commentedJan 5, 2024

Uh oh!

spikecurtis commentedJan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants