This change runs the downloaded binary with a--version flag and checks if the command responds with text which matchesCoder. This is not the strictest of checks, but it's the most pragmatic in terms of backwards-compatibility (i.e. if we added a new--verify command, some agents would not have this yet by definition).

We considered using SHA1 hash comparisons but that was a heavier lift and gets us effectively to the same point.

github-actionsbot assigneddannykopping

May 15, 2024

dannykopping force-pushed thedk/verify-agent branch from8eb3abe to60d4e14Compare

May 15, 2024 15:40

dannykopping changed the title~~Throw an error if agent init script fails to download valid binary~~WIP: Throw an error if agent init script fails to download valid binary

May 15, 2024

dannykopping added5 commits

May 16, 2024 09:55

Update bootstrap scripts to check for executable correctness

6705c9e

Signed-off-by: Danny Kopping <danny@coder.com>

Add comment to more easily find string replacements

d11c2d3

Signed-off-by: Danny Kopping <danny@coder.com>

Appease shellcheck

b63b479

Signed-off-by: Danny Kopping <danny@coder.com>

Make lint script more portable

5438b65

Signed-off-by: Danny Kopping <danny@coder.com>

Add tests

8f08e00

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed thedk/verify-agent branch from60d4e14 to8f08e00Compare

May 16, 2024 08:00

dannykopping changed the title~~WIP: Throw an error if agent init script fails to download valid binary~~fix: throw an error if agent init script fails to download valid binary

May 16, 2024

dannykopping commented

May 16, 2024

View reviewed changes

provisionersdk/scripts/bootstrap_windows.ps1Show resolvedHide resolved

scripts/check_site_icons.shShow resolvedHide resolved

dannykopping marked this pull request as ready for review

May 16, 2024 08:23

dannykopping requested review frommafredri,johnstcn andkylecarbs

May 16, 2024 08:24

johnstcn reviewed

May 16, 2024

View reviewed changes

provisionersdk/scripts/bootstrap_linux.shShow resolvedHide resolved

provisionersdk/scripts/bootstrap_linux.sh OutdatedShow resolvedHide resolved

Use more expressive error, double-quote output

70e3091

Signed-off-by: Danny Kopping <danny@coder.com>

mafredri reviewed

May 16, 2024

View reviewed changes

provisionersdk/scripts/bootstrap_darwin.sh

		export CODER_AGENT_URL="${ACCESS_URL}"
		exec ./$BINARY_NAME agent

		output=$(./${BINARY_NAME} --version\| head -n1)

Copy link

Member

mafredriMay 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I can't recall now, but I worry if${} can be interpreted as a terraform variable here. This is good practice but I think we should avoid it in the bootstrap scripts.

Copy link

ContributorAuthor

dannykoppingMay 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's a good call-out, butBINARY_NAME is not replaced by the provider, it seems:
https://github.com/coder/terraform-provider-coder/blob/7815596401d6e69aebb4ceefe1e84369cb63c4ac/provider/agent.go#L345-L376

IMHO I think we should use a different template replacement syntax than Bash's variable expansion, to make it very clear that these are replaced by a script and not accepted into the script as env vars.

provisionersdk/scripts/bootstrap_windows.ps1Show resolvedHide resolved

dannykopping requested a review fromjohnstcn

May 16, 2024 09:44

johnstcn approved these changes

May 16, 2024

View reviewed changes

mafredri approved these changes

May 16, 2024

View reviewed changes

Copy link

Member

mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.

provisionersdk/scripts/bootstrap_linux.sh

		export CODER_AGENT_URL="${ACCESS_URL}"
		exec ./$BINARY_NAME agent

		output=$(./${BINARY_NAME} --version\| head -n1)

Copy link

Member

mafredriMay 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd still like to see stderr redirected here so we can give a sensible output. Thoughts?

❯ output=$(echo '<html>' >hi; chmod +x hi; ./hi --version); declare -p output./hi: line 1: syntax error near unexpected token `newline'./hi: line 1: `<html>'typeset output=''

github-actionsbot added the staleThis issue is like stale bread. label

May 24, 2024

mtojek removed the staleThis issue is like stale bread. label

May 24, 2024

Copy link

ContributorAuthor

dannykopping commentedMay 24, 2024

I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.

Absolutely agreed; I will get around to testing this as soon as I have a couple hours to focus.

Alternatively@mtojek offered a hand and he might pick this up.

Copy link

Member

mtojek commentedMay 24, 2024

I tested the PR and have some observations:

To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?
UI does not indicate what is wrong (see below). Did I mess up something?

dannykopping changed the title~~fix: throw an error if agent init script fails to download valid binary~~fix: error out if agent init script fails to download a valid binary

May 30, 2024

Merge branch 'main' of github.com:/coder/coder into dk/verify-agent

3e121e0

dannykopping commented

May 30, 2024

View reviewed changes

scripts/deploy-pr.sh Outdated

		# get branch name and pr number
		branchName=$(gh pr view --json headRefName\| jq -r .headRefName)
		prNumber=$(gh pr view --json number\| jq -r .number)
		info=$(gh pr status --repo=coder/coder --json headRefName,number --jq'.createdBy[0]')

Copy link

ContributorAuthor

dannykoppingMay 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This didn't support forks before, but you cannot specify--repo inpr view so I changed it topr status.

Copy link

ContributorAuthor

dannykopping commentedMay 30, 2024

@mtojek thanks for taking a look! I'm going to look into this today and answer your questions.

dannykopping force-pushed thedk/verify-agent branch from36ad275 to7f4de67Compare

May 30, 2024 06:34

dannykopping mentioned this pull request

May 30, 2024

chore: modify preview deployment script to work with forks#13404

Closed

dannykopping force-pushed thedk/verify-agent branch from7f4de67 to3e121e0Compare

May 30, 2024 08:03

dannykopping mentioned this pull request

May 30, 2024

In-repo clone of #13280#13408

Closed

Copy link

ContributorAuthor

dannykopping commentedMay 30, 2024

Created#13408 so I can deploy this in the preview environment.

Copy link

ContributorAuthor

dannykopping commentedMay 30, 2024

Testing Linux

Using the preview environment, I spun up a workspace with the kubernetes template.
I shelled into thecoder pod and replacedcoder-linux-amd64 with the following shell script:

coder-54c9c56d67-7ddbc:~/.cache/coder/site/bin$ cat coder-linux-amd64#!/usr/bin/env bashecho"I am not the agent you are loooking for"

The agent produced this:

+ curl -fsSL --compressed https://pr13408.test.cdr.dev/bin/coder-linux-amd64 -o coder+break+ chmod +x coder+ [-n  ]+export CODER_AGENT_AUTH=token+export CODER_AGENT_URL=https://pr13408.test.cdr.dev/+ ./coder --version+ head -n1+ output=I am not the agent you are loookingfor+echo I am not the agent you are loookingfor+ grep -q Coder+echo ERROR: Downloaded agent binary returned unexpected version outputERROR: Downloaded agent binary returned unexpected version output+echo coder --version output:"I am not the agent you are loooking for"coder --version output:"I am not the agent you are loooking for"+exit 2+ waitonexit+echo === Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...=== Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...+ sleep 86400

@bpmct what do you think we should do here? I don't think we currently stream the agent logs to the workspace detail page, only the provisioner logs, so I'm not sure how we'll display a specific error in this case.

I will continue testing on both Mac (Darwin) and Windows to ensure the changes to the scripts work correctly.

Copy link

Member

johnstcn commentedMay 30, 2024

Troubleshooting failed workspace agent bootstrapping has historically been one of the more difficult issues to troubleshoot, and tends to require manually inspecting the execution environment.

There are a number of scenarios that can cause a bootstrapping a workspace to fail, including but not limited to:

Init script fails because of bad syntax (developer error, or possibly avery strange execution environment)
Init scripts fails due to missing dependencies (e.g.wget,curl)
Init script fails to download agent (DNS resolution failure etc.)
Init script fails to execute agent (missing libs, bad arch, etc.)

The above error you caused would fall under the last category. At this point, we should have reasonable confidence that we can connect to the control plane, and we have all the dependencies needed to download the agent binary. If executing the binary fails, we could potentially do a best-effortcurl -XPOST back to the control plane to send some troubleshooting information. I would consider it outside of the scope of this PR though.

Copy link

ContributorAuthor

dannykopping commentedMay 30, 2024

That's a cool idea@johnstcn 👍
Agreed it's out of scope for this PR. It's definitely in scope for the attached issue this PR is trying to fix, though, so I think we can merge this regardless once the testing is complete and once we've agreed on the mechanism forward we can address that.

Copy link

ContributorAuthor

dannykopping commentedMay 30, 2024

I've tried my best to set up the preview environment to test on Windows (seethis thread), but it's quickly turning into more trouble than it's worth IMHO.

I'm going to merge this PR and test in dogfood.
If there are any problems there I'll revert.

dannykopping changed the title~~fix: error out if agent init script fails to download a valid binary~~fix: return error if agent init script fails to download valid binary

May 30, 2024

Copy link

ContributorAuthor

dannykopping commentedMay 30, 2024

Not sure whyhttps://github.com/coder/coder/pull/13280/checks?check_run_id=25592484459 is failing, because the titledoes match the regex...

dannykopping merged commit59ab505 intocoder:main

May 30, 2024

dannykopping deleted the dk/verify-agent branch

May 30, 2024 11:33

github-actionsbot locked and limited conversation to collaborators

May 30, 2024

Copy link

ContributorAuthor

dannykopping commentedMay 30, 2024

I tested the PR and have some observations:
To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?
UI does not indicate what is wrong (see below). Did I mess up something?

@mtojek to answer your questions:

To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?

Nope, I tested with the same template and the changes were present.

UI does not indicate what is wrong (see below). Did I mess up something?

See#13280 (comment).

Copy link

ContributorAuthor

dannykopping commentedMay 30, 2024

Testing Darwin

Pretty much the same as Linux...

Testing Windows

Setting thecoder-windows-amd64.exe file to a simplehelloworld.exe file leads to this outcome:

We deploy our Windows VMs in GCP, and the init script output goes to one of the serial ports which GCP keeps the logs of (the screenshot above).

I used the following command to retrieve those logs:

$ gcloud compute --project=<project> instances get-serial-port-output coder-danny-windows-rdp --zone=europe-west4-b --port=1

All looks fine to me 👍
This didn't require any changes from the workspace owner or the admin.

Labels

None yet

4 participants

Movatterモバイル変換

fix: return error if agent init script fails to download valid binary#13280

fix: return error if agent init script fails to download valid binary#13280

Uh oh!

Conversation

dannykopping commentedMay 15, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredriMay 16, 2024

Choose a reason for hiding this comment

Uh oh!

dannykoppingMay 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

mafredriMay 16, 2024

Choose a reason for hiding this comment

Uh oh!

dannykopping commentedMay 24, 2024

Uh oh!

mtojek commentedMay 24, 2024

Uh oh!

dannykoppingMay 30, 2024

Choose a reason for hiding this comment

Uh oh!

dannykopping commentedMay 30, 2024

Uh oh!

dannykopping commentedMay 30, 2024

Uh oh!

dannykopping commentedMay 30, 2024

Testing Linux

Uh oh!

johnstcn commentedMay 30, 2024

Uh oh!

dannykopping commentedMay 30, 2024

Uh oh!

dannykopping commentedMay 30, 2024

Uh oh!

dannykopping commentedMay 30, 2024

Uh oh!

dannykopping commentedMay 30, 2024

Uh oh!

dannykopping commentedMay 30, 2024

Testing Darwin

Testing Windows

Uh oh!

Uh oh!

dannykopping commentedMay 15, 2024•
edited
Loading