- Notifications
You must be signed in to change notification settings - Fork924
fix: return error if agent init script fails to download valid binary#13280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Signed-off-by: Danny Kopping <danny@coder.com>
@@ -31,4 +31,12 @@ fi | |||
export CODER_AGENT_AUTH="${AUTH_TYPE}" | |||
export CODER_AGENT_URL="${ACCESS_URL}" | |||
exec ./$BINARY_NAME agent | |||
output=$(./${BINARY_NAME} --version | head -n1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I can't recall now, but I worry if${}
can be interpreted as a terraform variable here. This is good practice but I think we should avoid it in the bootstrap scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It's a good call-out, butBINARY_NAME
is not replaced by the provider, it seems:
https://github.com/coder/terraform-provider-coder/blob/7815596401d6e69aebb4ceefe1e84369cb63c4ac/provider/agent.go#L345-L376
IMHO I think we should use a different template replacement syntax than Bash's variable expansion, to make it very clear that these are replaced by a script and not accepted into the script as env vars.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.
@@ -86,4 +86,12 @@ fi | |||
export CODER_AGENT_AUTH="${AUTH_TYPE}" | |||
export CODER_AGENT_URL="${ACCESS_URL}" | |||
exec ./$BINARY_NAME agent | |||
output=$(./${BINARY_NAME} --version | head -n1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd still like to see stderr redirected here so we can give a sensible output. Thoughts?
❯ output=$(echo '<html>' >hi; chmod +x hi; ./hi --version); declare -p output./hi: line 1: syntax error near unexpected token `newline'./hi: line 1: `<html>'typeset output=''
Absolutely agreed; I will get around to testing this as soon as I have a couple hours to focus. Alternatively@mtojek offered a hand and he might pick this up. |
scripts/deploy-pr.sh Outdated
@@ -71,8 +71,9 @@ fi | |||
gh_auth | |||
# get branch name and pr number | |||
branchName=$(gh pr view --json headRefName | jq -r .headRefName) | |||
prNumber=$(gh pr view --json number | jq -r .number) | |||
info=$(gh pr status --repo=coder/coder --json headRefName,number --jq '.createdBy[0]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This didn't support forks before, but you cannot specify--repo
inpr view
so I changed it topr status
.
@mtojek thanks for taking a look! I'm going to look into this today and answer your questions. |
Created#13408 so I can deploy this in the preview environment. |
Testing LinuxUsing the preview environment, I spun up a workspace with the kubernetes template. coder-54c9c56d67-7ddbc:~/.cache/coder/site/bin$ cat coder-linux-amd64#!/usr/bin/env bashecho"I am not the agent you are loooking for" The agent produced this: + curl -fsSL --compressed https://pr13408.test.cdr.dev/bin/coder-linux-amd64 -o coder+break+ chmod +x coder+ [-n ]+export CODER_AGENT_AUTH=token+export CODER_AGENT_URL=https://pr13408.test.cdr.dev/+ ./coder --version+ head -n1+ output=I am not the agent you are loookingfor+echo I am not the agent you are loookingfor+ grep -q Coder+echo ERROR: Downloaded agent binary returned unexpected version outputERROR: Downloaded agent binary returned unexpected version output+echo coder --version output:"I am not the agent you are loooking for"coder --version output:"I am not the agent you are loooking for"+exit 2+ waitonexit+echo === Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...=== Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...+ sleep 86400 @bpmct what do you think we should do here? I don't think we currently stream the agent logs to the workspace detail page, only the provisioner logs, so I'm not sure how we'll display a specific error in this case. I will continue testing on both Mac (Darwin) and Windows to ensure the changes to the scripts work correctly. |
Troubleshooting failed workspace agent bootstrapping has historically been one of the more difficult issues to troubleshoot, and tends to require manually inspecting the execution environment. There are a number of scenarios that can cause a bootstrapping a workspace to fail, including but not limited to:
The above error you caused would fall under the last category. At this point, we should have reasonable confidence that we can connect to the control plane, and we have all the dependencies needed to download the agent binary. If executing the binary fails, we could potentially do a best-effort |
That's a cool idea@johnstcn 👍 |
I've tried my best to set up the preview environment to test on Windows (seethis thread), but it's quickly turning into more trouble than it's worth IMHO. I'm going to merge this PR and test in dogfood. |
Not sure whyhttps://github.com/coder/coder/pull/13280/checks?check_run_id=25592484459 is failing, because the titledoes match the regex... |
59ab505
intocoder:mainUh oh!
There was an error while loading.Please reload this page.
@mtojek to answer your questions:
Nope, I tested with the same template and the changes were present.
See#13280 (comment). |
Testing DarwinPretty much the same as Linux... Testing WindowsSetting the We deploy our Windows VMs in GCP, and the init script output goes to one of the serial ports which GCP keeps the logs of (the screenshot above). I used the following command to retrieve those logs: $ gcloud compute --project=<project> instances get-serial-port-output coder-danny-windows-rdp --zone=europe-west4-b --port=1 All looks fine to me 👍 |
Uh oh!
There was an error while loading.Please reload this page.
Partially addresses#6711
This change runs the downloaded binary with a
--version
flag and checks if the command responds with text which matchesCoder
. This is not the strictest of checks, but it's the most pragmatic in terms of backwards-compatibility (i.e. if we added a new--verify
command, some agents would not have this yet by definition).We considered using SHA1 hash comparisons but that was a heavier lift and gets us effectively to the same point.