NotificationsYou must be signed in to change notification settings
Fork1.1k
Star11.4k

refactor: PTY & SSH#7100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

spikecurtis merged 22 commits intomainfromspike/pty-ssh

Apr 24, 2023

Merged

refactor: PTY & SSH#7100

spikecurtis merged 22 commits intomainfromspike/pty-ssh

Apr 24, 2023

Conversation

Copy link

Contributor

spikecurtis commentedApr 12, 2023•
edited
Loading

Our agentssh package handles running commands in a pseudoterminal, and we've had several problems with not returning command output.

The implementation prior to this PR seems to behave how we want it to, (at least on Linux), but it

uses needlessly complex concurrency and synchronization
incorrectly explainswhy it works
doesn't fix the problem on Windows

This leaves us in a precarious situation where if someone tries to fix or enhance this function, they'll misunderstand how and why the current code works.

This PR attempts to address these problems.

This was the subject of severalfixes:#6777 #6833 and#6862

The initial bug was that the SSH handler runsio.Copy() in a goroutine to copy the command output to the SSH session, but we didn't wait for this copy to finish before closing down the session, causing a race that sometimes lost output.

However, simply waiting for theio.Copy() to finish would result in a deadlock because of another, much more subtle bug in the way pseudoterminals are handled by the OS.

Linux/macOS

The current code makes an unnecessary call to the low-level syscalldup to duplicate a file descriptor. I'll try to explain what was happening and why it was bad:

On POSIX systems, the way pseudo-TTYs work is that open a pair of files, which I'll refer to as thepty and thetty to match the terminology in our code (Linux itself calls themptm andpts). Thetty acts just like a "real" teletype device for any program interacting with it --- and when we launch commands, we pass thetty asall three of the stdio file descriptors (stdin, stdout, stderr) to the child process.

After forking/exec'ing to start the child process this is what things looked like:

go processPTY:  - pty  ------------------------------> pty <--- linux kernel ----> tty <---- child process  - tty  ------------------------------------------------------------^

Notice that our go process kept thetty file open. What happens is when you read from thepty, and there is no data available to read, it blocks until either

there is data to read
all file descriptors to thetty are closed

We kept an open file descriptor to thetty so even after the child process exited, we never hit condition 2, until we calledClose() on ourPTY object, which then closed bothpty andtty at the same time.

The current code gets around this by callingdup on thepty file descriptor, so we end up with two references to the PTY.

go processSSH Handler:  - stdout ------------------------------+PTY:                                     V  - pty  ------------------------------> pty <--- linux kernel ----> tty <---- child process  - tty  ------------------------------------------------------------^

The explanation given for this is that it allows us to closestdin andstdout separately, but that's a misunderstanding. The child process interacts only with thetty for bothstdin andstdout --- duplicating thepty doesn't change this.

However, thisdoes allow us to break the deadlock, because callingClose() on thePTY closestty, but since we have the duplicated file, we can still read the output as long as the child process is still alive or there is unread buffered data in the kernel. But, this is needlessly wasteful of file descriptors (which are a limited kernel resource), and essentially uses a kernel primitive to accomplish the task of closingtty without closingpty.

The essential features of this PR on Linux/macOS are to

Close thetty immediately after creating the child process - this allows us to safely read all the output from the child process without fear of deadlocking.
Fix the SSH handler to simply callio.Copy() to copy output, then grab the exit code and return --- no extra file descriptors, wait groups, or synchronization.

Windows

Instead of a single read-write file descriptor, Windows uses a pair of pipes (which are unidirectional). On Windows, though, the pseudoconsole is hosted by a separate process, called the Console Host (conhost.exe), rather than being handled entirely in-kernel like on Linux/macOS. The lifecycle of the pseudoterminal is explicit via a Handle returned on creation. So on Windows, we create 5 handles:

Input Read Pipe
Input Write Pipe
Output Read Pipe
Output Write Pipe
Console

In the current implementation, we keep all 5 handles open until the PTY is closed, at which point we close all 5. This then causes a race for a reader trying to read all the data before the pipe gets closed.

Docs for PseudoConsoles explain that once you pass the Input Read Pipe and Output Write Pipe to the pseudoconsole host, you should close them.

On Windows we also have to close the Pseudoconsole itself to unblock our output reads, since the Pseudoconsole is a separate process from the child application, and it has an explicit lifetime. So, in this PR, we also add code that closes the Pseudoconsole as soon as the child application exits (for PTYs attached to commands). Crucially, we leave the Output Read Pipe open, so that we can still read any data buffered in the OS.

Other features

split thePTY interface in 2, since there are still use cases in the code where we create a PTY but don't start a child process. So, we have one interface when we are interacting with a child, and a different interface when we hold both ends of the PTY -- this is done for safety so that you can't attempt to read from thetty after we've closed it.
add a unit test to check that we clean up orphaned processes when the SSH client disconnects
add a unit test to check for truncated output from commands started in a PTY

spikecurtis added2 commits

April 11, 2023 15:00

Add ssh tests for longoutput, orphan

0075d7d

Signed-off-by: Spike Curtis <spike@coder.com>

PTY/SSH tests & improvements

a491d4f

Signed-off-by: Spike Curtis <spike@coder.com>

github-actionsbot assignedspikecurtis

Apr 12, 2023

spikecurtis added2 commits

April 12, 2023 15:14

Fix some tests

2cf357a

Signed-off-by: Spike Curtis <spike@coder.com>

Fix linting

28c0646

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis changed the title~~PTY & SSH fixes~~refactor: PTY & SSH

Apr 12, 2023

spikecurtis added11 commits

April 12, 2023 12:08

fmt

872e357

Signed-off-by: Spike Curtis <spike@coder.com>

Fix windows test

3f21e30

Signed-off-by: Spike Curtis <spike@coder.com>

Windows copy test

e83ff6e

Signed-off-by: Spike Curtis <spike@coder.com>

WIP Windows pty handling

b610579

Signed-off-by: Spike Curtis <spike@coder.com>

Fix truncation tests

90bfe94

Signed-off-by: Spike Curtis <spike@coder.com>

Appease linter/fmt

d6e131c

Signed-off-by: Spike Curtis <spike@coder.com>

Fix typo

2c9c6ef

Signed-off-by: Spike Curtis <spike@coder.com>

Rework truncation test to not assume OS buffers

e39e885

Signed-off-by: Spike Curtis <spike@coder.com>

Disable orphan test on Windows --- uses sh

8ec3d1f

Signed-off-by: Spike Curtis <spike@coder.com>

agent_test running SSH in pty use ptytest.Start

df424e6

Signed-off-by: Spike Curtis <spike@coder.com>

More detail about closing pseudoconsole on windows

50e3fec

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis marked this pull request as ready for review

April 19, 2023 07:13

spikecurtis requested review fromdeansheather,kylecarbs andmafredri

April 19, 2023 07:13

mafredri reviewed

Apr 19, 2023

View reviewed changes

Copy link

Member

mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for digging into this! I was aware of the method by which it worked but seems did a poor job explaining it... not to mention being brave enough to venture this deep into PTYs. 😄

I did a little testing locally and couldn't immediately find any issues with this approach, so I'm hoping there won't be any edge-cases with processes closing their inputs/outputs while still running.

Had a few remarks but all-in-all, looks good, good job!

pty/pty_windows.goShow resolvedHide resolved

pty/pty_windows.go OutdatedShow resolvedHide resolved

pty/pty_windows.go

		// may have already closed the PseudoConsole when the command exited, so that
		// output reads can get to EOF. In that case, we don't need to close it
		// again here.
		ifp.console!=windows.InvalidHandle {

Copy link

Member

mafredriApr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Assigned both here and in waitInternal, probably needs protection.

Copy link

ContributorAuthor

spikecurtisApr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Protected by thecloseMutex

Copy link

Member

mafredriApr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah right, thep.pw threw me off. I noticedResize is not protected though!

pty/ptytest/ptytest.go OutdatedShow resolvedHide resolved

pty/ptytest/ptytest.go Outdated

		func (p*PTY)Close()error {
		p.t.Helper()
		pErr:=p.PTY.Close()
		eErr:=p.outExpecter.close("close")

Copy link

Member

mafredriApr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A lot of the ceremony inclose is due to closingpty, which is no longer done there. We should probably simplify it and make it less spammy. (The theory was that we might be hanging in some of the cleanup, resulting in weird flakes, but that has not been the case.)

Copy link

ContributorAuthor

spikecurtisApr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

TheoutExpecter.close() doesn't close the pty itself, but it does monitor the cleanup of the various goroutines we created to log the output and run expectations against it. I've left that part intact as my change is meant to leaveptytest functionally equivalent to what it was before.

pty/start_test.go OutdatedShow resolvedHide resolved

pty/start_test.go

		b:=make([]byte,1)
		gofunc() {
		_,err:=r.Read(b)
		readErrs<-err

Copy link

Member

mafredriApr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change

	readErrs<-err
	select {
	casereadErrs<-err:
	default:
	}

Copy link

ContributorAuthor

spikecurtisApr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think this is needed.

The whole purpose of this little goroutine is to allow me to callRead() but still be able to timeout in this function if the context expires. So I read in a goroutine, thenselect whether the read completed before the context expires. Thechan error is buffered, so even if the context expires, the goroutine will be able to finish onceRead() returns. Note that an expired context callsreturn so there can only be one read goroutine started at a time.

pty/pty_other.go

		}

		func (p*otherPty)OutputReader() io.Reader {
		return&ptmReader{p.pty}

Copy link

Member

mafredriApr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's unfortunate we can't pass the*os.File here sinceio.Copy can sometimes do some slightly more performant operations with it. (That's more of a nice-to-have though, so this is fine.)

Copy link

ContributorAuthor

spikecurtisApr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think that applies for copyingfrom an*os.File.io.Copy() looks forWriteTo() on the source,which*os.File does not implement.

It does apply when you are copyingto an*os.File, since it implementsReadFrom() ---InputWriter() still returns the*os.File directly, so we can take advantage of that optimization.

agent/agentssh/agentssh.go

		typereadNopCloserstruct{ io.Reader }
		// ptySession is the interface to the ssh.Session that startPTYSession uses
		// we use an interface here so that we can fake it in tests.
		typeptySessioninterface {

Copy link

Member

mafredriApr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there a reason you chose a stripped down interface vs usingssh.Session directly? I see the test but we also have a fake context there with all the methods so I'm not seeing the benefit per-se. I feel this creates a bit of needless indirection. Not critical to change now, I'll see if I make some changes down the line as I do some refactoring for the session handling anyway.

Copy link

ContributorAuthor

spikecurtisApr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

ssh.Session is an even biggerinterface!

I couldn't see any way to directly instantiate any of the concrete implementations in likegliderlabs.

My first stab at the test I wanted to do created the whole server like the other ssh tests do, but I found the even after callingClose() on the network connection, the session context in the handler wasn't closed quickly. So, this construction narrows down the bigssh.Session interface into just what we need, and allows me to more precisely control when the context expires in tests.

Copy link

ContributorAuthor

spikecurtisApr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's a bit annoying that I have to mock out the context methods, but unfortunately,ssh.Session is defined to have a methodContext() that returnsssh.Context rather thancontext.Context.

Unfortunately the go type-checker isn't smart enough to understand thatContext() ssh.Contextmust also satisfyContext() context.Context sincecontext.Context is a strict subset ofssh.Context. Alas.

deansheather approved these changes

Apr 19, 2023

View reviewed changes

agent/agent_test.go OutdatedShow resolvedHide resolved

go.mod OutdatedShow resolvedHide resolved

pty/pty_other.go OutdatedShow resolvedHide resolved

pty/pty_windows.go

		// if we already have an error from the command, prefer that error
		// but if the command succeeded and closing the PseudoConsole fails
		// then record that error so that we have a chance to see it
		p.cmdErr=err

Copy link

Member

deansheatherApr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It'd be nice to log the error whether it's kept or discarded

Copy link

ContributorAuthor

spikecurtisApr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be nice, but I don't have a logger, and I'd have to change all the init signatures to get one, so I decided on this.

Another alternative I considered is panicking---something pretty messed up at the OS level is happening if we get an error here. We're pretty much guaranteed to crash the agent if we panic, so it's likely we'd get told about it.

pty/start_test.goShow resolvedHide resolved

spikecurtis added2 commits

April 20, 2023 11:07

Code review fixes

c09083e

Signed-off-by: Spike Curtis <spike@coder.com>

Rearrange ptytest method order

439107d

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis requested a review frommafredri

April 20, 2023 11:15

mafredri reviewed

Apr 20, 2023

View reviewed changes

pty/start_windows.go OutdatedShow resolvedHide resolved

mafredri approved these changes

Apr 20, 2023

View reviewed changes

Copy link

Member

mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The windows Resize method still needs protection on readingp.console and I'm not sold on theoutExpecter naming (as commented), but won't block the PR for that.

Once the the resize is fixed and the test pass, LGTM.

spikecurtis added2 commits

April 21, 2023 05:21

Protect pty.Resize on windows from races

c6a3229

Signed-off-by: Spike Curtis <spike@coder.com>

Fix windows bugs

aa94546

Signed-off-by: Spike Curtis <spike@coder.com>

kylecarbs reviewed

Apr 21, 2023

View reviewed changes

Copy link

Member

kylecarbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Feel free to defer my reviews to Dean and Mathias.

I'mextremely happy you dug into this!

pty/pty.goShow resolvedHide resolved

spikecurtis added3 commits

April 24, 2023 06:06

PTY doesn't extend PTYCmd

0f07cb9

Signed-off-by: Spike Curtis <spike@coder.com>

merge branch 'main' ofhttps://github.com/coder/coderinto spike/pty-ssh

eaabb3a

Fix windows types

50060d8

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis merged commitdaee91c intomain

Apr 24, 2023

spikecurtis deleted the spike/pty-ssh branch

April 24, 2023 10:53

github-actionsbot locked and limited conversation to collaborators

Apr 24, 2023

Labels

None yet

Movatterモバイル変換

refactor: PTY & SSH#7100

refactor: PTY & SSH#7100

Uh oh!

Conversation

spikecurtis commentedApr 12, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Linux/macOS

Windows

Other features

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

kylecarbs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

spikecurtis commentedApr 12, 2023•
edited
Loading