Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

refactor: PTY & SSH#7100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
spikecurtis merged 22 commits intomainfromspike/pty-ssh
Apr 24, 2023
Merged

refactor: PTY & SSH#7100

spikecurtis merged 22 commits intomainfromspike/pty-ssh
Apr 24, 2023

Conversation

spikecurtis
Copy link
Contributor

@spikecurtisspikecurtis commentedApr 12, 2023
edited
Loading

Our agentssh package handles running commands in a pseudoterminal, and we've had several problems with not returning command output.

The implementation prior to this PR seems to behave how we want it to, (at least on Linux), but it

  1. uses needlessly complex concurrency and synchronization
  2. incorrectly explainswhy it works
  3. doesn't fix the problem on Windows

This leaves us in a precarious situation where if someone tries to fix or enhance this function, they'll misunderstand how and why the current code works.

This PR attempts to address these problems.

This was the subject of severalfixes:#6777#6833 and#6862

The initial bug was that the SSH handler runsio.Copy() in a goroutine to copy the command output to the SSH session, but we didn't wait for this copy to finish before closing down the session, causing a race that sometimes lost output.

However, simply waiting for theio.Copy() to finish would result in a deadlock because of another, much more subtle bug in the way pseudoterminals are handled by the OS.

Linux/macOS

The current code makes an unnecessary call to the low-level syscalldup to duplicate a file descriptor. I'll try to explain what was happening and why it was bad:

On POSIX systems, the way pseudo-TTYs work is that open a pair of files, which I'll refer to as thepty and thetty to match the terminology in our code (Linux itself calls themptm andpts). Thetty acts just like a "real" teletype device for any program interacting with it --- and when we launch commands, we pass thetty asall three of the stdio file descriptors (stdin, stdout, stderr) to the child process.

After forking/exec'ing to start the child process this is what things looked like:

go processPTY:  - pty  ------------------------------> pty <--- linux kernel ----> tty <---- child process  - tty  ------------------------------------------------------------^

Notice that our go process kept thetty file open. What happens is when you read from thepty, and there is no data available to read, it blocks until either

  1. there is data to read
  2. all file descriptors to thetty are closed

We kept an open file descriptor to thetty so even after the child process exited, we never hit condition 2, until we calledClose() on ourPTY object, which then closed bothpty andtty at the same time.

The current code gets around this by callingdup on thepty file descriptor, so we end up with two references to the PTY.

go processSSH Handler:  - stdout ------------------------------+PTY:                                     V  - pty  ------------------------------> pty <--- linux kernel ----> tty <---- child process  - tty  ------------------------------------------------------------^

The explanation given for this is that it allows us to closestdin andstdout separately, but that's a misunderstanding. The child process interacts only with thetty for bothstdin andstdout --- duplicating thepty doesn't change this.

However, thisdoes allow us to break the deadlock, because callingClose() on thePTY closestty, but since we have the duplicated file, we can still read the output as long as the child process is still alive or there is unread buffered data in the kernel. But, this is needlessly wasteful of file descriptors (which are a limited kernel resource), and essentially uses a kernel primitive to accomplish the task of closingtty without closingpty.

The essential features of this PR on Linux/macOS are to

  1. Close thetty immediately after creating the child process - this allows us to safely read all the output from the child process without fear of deadlocking.
  2. Fix the SSH handler to simply callio.Copy() to copy output, then grab the exit code and return --- no extra file descriptors, wait groups, or synchronization.

Windows

Instead of a single read-write file descriptor, Windows uses a pair of pipes (which are unidirectional). On Windows, though, the pseudoconsole is hosted by a separate process, called the Console Host (conhost.exe), rather than being handled entirely in-kernel like on Linux/macOS. The lifecycle of the pseudoterminal is explicit via a Handle returned on creation. So on Windows, we create 5 handles:

  1. Input Read Pipe
  2. Input Write Pipe
  3. Output Read Pipe
  4. Output Write Pipe
  5. Console

In the current implementation, we keep all 5 handles open until the PTY is closed, at which point we close all 5. This then causes a race for a reader trying to read all the data before the pipe gets closed.

Docs for PseudoConsoles explain that once you pass the Input Read Pipe and Output Write Pipe to the pseudoconsole host, you should close them.

On Windows we also have to close the Pseudoconsole itself to unblock our output reads, since the Pseudoconsole is a separate process from the child application, and it has an explicit lifetime. So, in this PR, we also add code that closes the Pseudoconsole as soon as the child application exits (for PTYs attached to commands). Crucially, we leave the Output Read Pipe open, so that we can still read any data buffered in the OS.

Other features

  • split thePTY interface in 2, since there are still use cases in the code where we create a PTY but don't start a child process. So, we have one interface when we are interacting with a child, and a different interface when we hold both ends of the PTY -- this is done for safety so that you can't attempt to read from thetty after we've closed it.
  • add a unit test to check that we clean up orphaned processes when the SSH client disconnects
  • add a unit test to check for truncated output from commands started in a PTY

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtisspikecurtis changed the titlePTY & SSH fixesrefactor: PTY & SSHApr 12, 2023
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtisspikecurtis marked this pull request as ready for reviewApril 19, 2023 07:13
Copy link
Member

@mafredrimafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for digging into this! I was aware of the method by which it worked but seems did a poor job explaining it... not to mention being brave enough to venture this deep into PTYs. 😄

I did a little testing locally and couldn't immediately find any issues with this approach, so I'm hoping there won't be any edge-cases with processes closing their inputs/outputs while still running.

Had a few remarks but all-in-all, looks good, good job!

// may have already closed the PseudoConsole when the command exited, so that
// output reads can get to EOF. In that case, we don't need to close it
// again here.
if p.console != windows.InvalidHandle {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Assigned both here and in waitInternal, probably needs protection.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Protected by thecloseMutex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah right, thep.pw threw me off. I noticedResize is not protected though!

func (p *PTY) Close() error {
p.t.Helper()
pErr := p.PTY.Close()
eErr := p.outExpecter.close("close")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A lot of the ceremony inclose is due to closingpty, which is no longer done there. We should probably simplify it and make it less spammy. (The theory was that we might be hanging in some of the cleanup, resulting in weird flakes, but that has not been the case.)

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

TheoutExpecter.close() doesn't close the pty itself, but it does monitor the cleanup of the various goroutines we created to log the output and run expectations against it. I've left that part intact as my change is meant to leaveptytest functionally equivalent to what it was before.

b := make([]byte, 1)
go func() {
_, err := r.Read(b)
readErrs <- err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
readErrs<-err
select {
casereadErrs<-err:
default:
}

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think this is needed.

The whole purpose of this little goroutine is to allow me to callRead() but still be able to timeout in this function if the context expires. So I read in a goroutine, thenselect whether the read completed before the context expires. Thechan error is buffered, so even if the context expires, the goroutine will be able to finish onceRead() returns. Note that an expired context callsreturn so there can only be one read goroutine started at a time.

mafredri reacted with thumbs up emoji
Writer: p.tty,
}
}

func (p *otherPty) OutputReader() io.Reader {
return &ptmReader{p.pty}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's unfortunate we can't pass the*os.File here sinceio.Copy can sometimes do some slightly more performant operations with it. (That's more of a nice-to-have though, so this is fine.)

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think that applies for copyingfrom an*os.File.io.Copy() looks forWriteTo() on the source,which*os.File does not implement.

It does apply when you are copyingto an*os.File, since it implementsReadFrom() ---InputWriter() still returns the*os.File directly, so we can take advantage of that optimization.

mafredri reacted with thumbs up emoji
type readNopCloser struct{ io.Reader }
// ptySession is the interface to the ssh.Session that startPTYSession uses
// we use an interface here so that we can fake it in tests.
type ptySession interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there a reason you chose a stripped down interface vs usingssh.Session directly? I see the test but we also have a fake context there with all the methods so I'm not seeing the benefit per-se. I feel this creates a bit of needless indirection. Not critical to change now, I'll see if I make some changes down the line as I do some refactoring for the session handling anyway.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

ssh.Session is an even biggerinterface!

I couldn't see any way to directly instantiate any of the concrete implementations in likegliderlabs.

My first stab at the test I wanted to do created the whole server like the other ssh tests do, but I found the even after callingClose() on the network connection, the session context in the handler wasn't closed quickly. So, this construction narrows down the bigssh.Session interface into just what we need, and allows me to more precisely control when the context expires in tests.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's a bit annoying that I have to mock out the context methods, but unfortunately,ssh.Session is defined to have a methodContext() that returnsssh.Context rather thancontext.Context.

Unfortunately the go type-checker isn't smart enough to understand thatContext() ssh.Contextmust also satisfyContext() context.Context sincecontext.Context is a strict subset ofssh.Context. Alas.

// if we already have an error from the command, prefer that error
// but if the command succeeded and closing the PseudoConsole fails
// then record that error so that we have a chance to see it
p.cmdErr = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It'd be nice to log the error whether it's kept or discarded

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be nice, but I don't have a logger, and I'd have to change all the init signatures to get one, so I decided on this.

Another alternative I considered is panicking---something pretty messed up at the OS level is happening if we get an error here. We're pretty much guaranteed to crash the agent if we panic, so it's likely we'd get told about it.

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Copy link
Member

@mafredrimafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The windows Resize method still needs protection on readingp.console and I'm not sold on theoutExpecter naming (as commented), but won't block the PR for that.

Once the the resize is fixed and the test pass, LGTM.

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Copy link
Member

@kylecarbskylecarbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Feel free to defer my reviews to Dean and Mathias.

I'mextremely happy you dug into this!

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtisspikecurtis merged commitdaee91c intomainApr 24, 2023
@spikecurtisspikecurtis deleted the spike/pty-ssh branchApril 24, 2023 10:53
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsApr 24, 2023
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@mafredrimafredrimafredri approved these changes

@kylecarbskylecarbskylecarbs left review comments

@deansheatherdeansheatherdeansheather approved these changes

Assignees

@spikecurtisspikecurtis

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

4 participants
@spikecurtis@mafredri@kylecarbs@deansheather

[8]ページ先頭

©2009-2025 Movatter.jp