Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fix: return error if agent init script fails to download valid binary#13280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
dannykopping merged 7 commits intocoder:mainfromdannykopping:dk/verify-agent
May 30, 2024

Conversation

dannykopping
Copy link
Contributor

@dannykoppingdannykopping commentedMay 15, 2024
edited
Loading

Partially addresses#6711

This change runs the downloaded binary with a--version flag and checks if the command responds with text which matchesCoder. This is not the strictest of checks, but it's the most pragmatic in terms of backwards-compatibility (i.e. if we added a new--verify command, some agents would not have this yet by definition).

We considered using SHA1 hash comparisons but that was a heavier lift and gets us effectively to the same point.

@dannykoppingdannykopping changed the titleThrow an error if agent init script fails to download valid binaryWIP: Throw an error if agent init script fails to download valid binaryMay 15, 2024
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@dannykoppingdannykopping changed the titleWIP: Throw an error if agent init script fails to download valid binaryfix: throw an error if agent init script fails to download valid binaryMay 16, 2024
@dannykoppingdannykopping marked this pull request as ready for reviewMay 16, 2024 08:23
Signed-off-by: Danny Kopping <danny@coder.com>
@@ -31,4 +31,12 @@ fi

export CODER_AGENT_AUTH="${AUTH_TYPE}"
export CODER_AGENT_URL="${ACCESS_URL}"
exec ./$BINARY_NAME agent

output=$(./${BINARY_NAME} --version | head -n1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I can't recall now, but I worry if${} can be interpreted as a terraform variable here. This is good practice but I think we should avoid it in the bootstrap scripts.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's a good call-out, butBINARY_NAME is not replaced by the provider, it seems:
https://github.com/coder/terraform-provider-coder/blob/7815596401d6e69aebb4ceefe1e84369cb63c4ac/provider/agent.go#L345-L376

IMHO I think we should use a different template replacement syntax than Bash's variable expansion, to make it very clear that these are replaced by a script and not accepted into the script as env vars.

johnstcn and mafredri reacted with thumbs up emoji
Copy link
Member

@mafredrimafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.

@@ -86,4 +86,12 @@ fi

export CODER_AGENT_AUTH="${AUTH_TYPE}"
export CODER_AGENT_URL="${ACCESS_URL}"
exec ./$BINARY_NAME agent

output=$(./${BINARY_NAME} --version | head -n1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd still like to see stderr redirected here so we can give a sensible output. Thoughts?

❯ output=$(echo '<html>' >hi; chmod +x hi; ./hi --version); declare -p output./hi: line 1: syntax error near unexpected token `newline'./hi: line 1: `<html>'typeset output=''

@github-actionsgithub-actionsbot added the staleThis issue is like stale bread. labelMay 24, 2024
@mtojekmtojek removed the staleThis issue is like stale bread. labelMay 24, 2024
@dannykopping
Copy link
ContributorAuthor

I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.

Absolutely agreed; I will get around to testing this as soon as I have a couple hours to focus.

Alternatively@mtojek offered a hand and he might pick this up.

@mtojek
Copy link
Member

I tested the PR and have some observations:

  1. To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?
  2. UI does not indicate what is wrong (see below). Did I mess up something?
Screenshot 2024-05-24 at 14 04 53Screenshot 2024-05-24 at 14 03 49

@dannykoppingdannykopping changed the titlefix: throw an error if agent init script fails to download valid binaryfix: error out if agent init script fails to download a valid binaryMay 30, 2024
@@ -71,8 +71,9 @@ fi
gh_auth

# get branch name and pr number
branchName=$(gh pr view --json headRefName | jq -r .headRefName)
prNumber=$(gh pr view --json number | jq -r .number)
info=$(gh pr status --repo=coder/coder --json headRefName,number --jq '.createdBy[0]')
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This didn't support forks before, but you cannot specify--repo inpr view so I changed it topr status.

@dannykopping
Copy link
ContributorAuthor

@mtojek thanks for taking a look! I'm going to look into this today and answer your questions.

@dannykopping
Copy link
ContributorAuthor

Created#13408 so I can deploy this in the preview environment.

@dannykopping
Copy link
ContributorAuthor

Testing Linux

Using the preview environment, I spun up a workspace with the kubernetes template.
I shelled into thecoder pod and replacedcoder-linux-amd64 with the following shell script:

coder-54c9c56d67-7ddbc:~/.cache/coder/site/bin$ cat coder-linux-amd64#!/usr/bin/env bashecho"I am not the agent you are loooking for"

The agent produced this:

+ curl -fsSL --compressed https://pr13408.test.cdr.dev/bin/coder-linux-amd64 -o coder+break+ chmod +x coder+ [-n  ]+export CODER_AGENT_AUTH=token+export CODER_AGENT_URL=https://pr13408.test.cdr.dev/+ ./coder --version+ head -n1+ output=I am not the agent you are loookingfor+echo I am not the agent you are loookingfor+ grep -q Coder+echo ERROR: Downloaded agent binary returned unexpected version outputERROR: Downloaded agent binary returned unexpected version output+echo coder --version output:"I am not the agent you are loooking for"coder --version output:"I am not the agent you are loooking for"+exit 2+ waitonexit+echo === Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...=== Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...+ sleep 86400

image

@bpmct what do you think we should do here? I don't think we currently stream the agent logs to the workspace detail page, only the provisioner logs, so I'm not sure how we'll display a specific error in this case.

I will continue testing on both Mac (Darwin) and Windows to ensure the changes to the scripts work correctly.

@johnstcn
Copy link
Member

Troubleshooting failed workspace agent bootstrapping has historically been one of the more difficult issues to troubleshoot, and tends to require manually inspecting the execution environment.

There are a number of scenarios that can cause a bootstrapping a workspace to fail, including but not limited to:

  1. Init script fails because of bad syntax (developer error, or possibly avery strange execution environment)
  2. Init scripts fails due to missing dependencies (e.g.wget,curl)
  3. Init script fails to download agent (DNS resolution failure etc.)
  4. Init script fails to execute agent (missing libs, bad arch, etc.)

The above error you caused would fall under the last category. At this point, we should have reasonable confidence that we can connect to the control plane, and we have all the dependencies needed to download the agent binary. If executing the binary fails, we could potentially do a best-effortcurl -XPOST back to the control plane to send some troubleshooting information. I would consider it outside of the scope of this PR though.

@dannykopping
Copy link
ContributorAuthor

That's a cool idea@johnstcn 👍
Agreed it's out of scope for this PR. It's definitely in scope for the attached issue this PR is trying to fix, though, so I think we can merge this regardless once the testing is complete and once we've agreed on the mechanism forward we can address that.

@dannykopping
Copy link
ContributorAuthor

I've tried my best to set up the preview environment to test on Windows (seethis thread), but it's quickly turning into more trouble than it's worth IMHO.

I'm going to merge this PR and test in dogfood.
If there are any problems there I'll revert.

@dannykoppingdannykopping changed the titlefix: error out if agent init script fails to download a valid binaryfix: return error if agent init script fails to download valid binaryMay 30, 2024
@dannykopping
Copy link
ContributorAuthor

Not sure whyhttps://github.com/coder/coder/pull/13280/checks?check_run_id=25592484459 is failing, because the titledoes match the regex...

@dannykoppingdannykopping merged commit59ab505 intocoder:mainMay 30, 2024
55 of 56 checks passed
@dannykoppingdannykopping deleted the dk/verify-agent branchMay 30, 2024 11:33
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsMay 30, 2024
@dannykopping
Copy link
ContributorAuthor

I tested the PR and have some observations:

  1. To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?
  2. UI does not indicate what is wrong (see below). Did I mess up something?

@mtojek to answer your questions:

  1. To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?

Nope, I tested with the same template and the changes were present.

  1. UI does not indicate what is wrong (see below). Did I mess up something?

See#13280 (comment).

@dannykopping
Copy link
ContributorAuthor

Testing Darwin

Pretty much the same as Linux...

Testing Windows

Setting thecoder-windows-amd64.exe file to a simplehelloworld.exe file leads to this outcome:

image

We deploy our Windows VMs in GCP, and the init script output goes to one of the serial ports which GCP keeps the logs of (the screenshot above).

I used the following command to retrieve those logs:

$ gcloud compute --project=<project> instances get-serial-port-output coder-danny-windows-rdp --zone=europe-west4-b --port=1

All looks fine to me 👍
This didn't require any changes from the workspace owner or the admin.

Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@mafredrimafredrimafredri approved these changes

@johnstcnjohnstcnjohnstcn approved these changes

@kylecarbskylecarbsAwaiting requested review from kylecarbs

Assignees

@dannykoppingdannykopping

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@dannykopping@mtojek@johnstcn@mafredri

[8]ページ先頭

©2009-2025 Movatter.jp