Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fix: Disable compression for websocket dRPC transport#145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

bryphe-coder
Copy link
Contributor

@bryphe-coderbryphe-coder commentedFeb 2, 2022
edited
Loading

There is a race condition in the interop between the websocket anddRPC:https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own thebyte[] being sent between them. This can lead to data races, in which bothdRPC and the websocket are writing.

This is just tracking some experimentation to fix that race condition

Run results:

status code 412: The provided project history is running. Wait for it to complete importing!`
    workspacehistory_test.go:122:         Error Trace:workspacehistory_test.go:122        Error:      Condition never satisfied        Test:       TestWorkspaceHistory/CreateHistory
  • Run 5: peer failure
  • Run 6: Pass ✅
  • Run 7: Peer failure

Open Questions:

IsdRPC orwebsocket at fault for the data race?

It looks like this condition is specifically happening whendRPC decides to [SendError]). This constructs a new byte payload fromMarshalError - sodRPC has created this buffer and owns it.

FromdRPC's perspective, the callstack looks like this:

// AppendFrame appends a marshaled form of the frame to the provided buffer.funcAppendFrame(buf []byte,frFrame) []byte {...out:=bufout=append(out,control).// <---------

This should be fine, sincedPRC create this buffer, and is taking the byte buffer constructed fromMarshalError and tacking a bunch of headers on it to create a proper frame.

OncedRPC is done writing, ithangs onto the buffer and resets it here_:https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73

However... the websocket implementation, once it gets the buffer, it runs astatelessDeflatehere, which compresses the buffer on the fly. This functionality actuallymutates the buffer in place, which is where get our race.

In the case where thebyte[] aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbingdRPC -> websocket, they both are manipulating it (dRPC is reusing the buffer for the nextwrite, andwebsocket is compressing on the fly).

Why does cloning onRead fail?

Get a bunch of errors like:

2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:02022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0

UPDATE:

We decided we could disable websocket compression, which would avoid the race because the in-placedeflate operaton would no longer be run. Trying that out now:

@codecov
Copy link

codecovbot commentedFeb 2, 2022
edited
Loading

Codecov Report

Merging#145 (eef4d3a) intoprovisionerdaemon (2a76b60) willdecrease coverage by0.21%.
The diff coverage is100.00%.

Impacted file tree graph

@@                  Coverage Diff                  @@##           provisionerdaemon     #145      +/-   ##=====================================================- Coverage              68.04%   67.83%   -0.22%=====================================================  Files                    101      101                Lines                   5095     5098       +3       Branches                  68       68              =====================================================- Hits                    3467     3458       -9- Misses                  1324     1333       +9- Partials                 304      307       +3
FlagCoverage Δ
unittest-go-macos-latest64.59% <100.00%> (-0.22%)⬇️
unittest-go-ubuntu-latest66.89% <100.00%> (-0.07%)⬇️
unittest-go-windows-latest64.30% <100.00%> (-0.08%)⬇️
unittest-js64.92% <ø> (ø)
Impacted FilesCoverage Δ
coderd/provisionerdaemons.go48.56% <100.00%> (-0.31%)⬇️
peerbroker/dial.go76.19% <0.00%> (-4.77%)⬇️
peer/channel.go84.75% <0.00%> (-2.44%)⬇️
peer/conn.go80.36% <0.00%> (-0.26%)⬇️

Continue to review full report at Codecov.

Legend -Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing data
Powered byCodecov. Last update2a76b60...eef4d3a. Read thecomment docs.

@bryphe-coderbryphe-coder self-assigned thisFeb 2, 2022
Comment on lines 59 to 61
cpy:=make([]byte,len(p))
copy(cpy,p)
returnc.conn.Write(cpy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You can do 1 line slice copies as well!

Suggested change
cpy:=make([]byte,len(p))
copy(cpy,p)
returnc.conn.Write(cpy)
returnc.conn.Write(append(([]byte)(nil),p...))

bryphe-coder reacted with thumbs up emoji
@bryphe-coderbryphe-coder changed the titleexperiment: Clone bytes on write between provisionerdaemon <-> websocketfix: Disable compression for websocket dRPC transportFeb 2, 2022
@bryphe-coderbryphe-coder marked this pull request as ready for reviewFebruary 2, 2022 21:31
@bryphe-coderbryphe-coder merged commit03ed951 intoprovisionerdaemonFeb 2, 2022
@bryphe-coderbryphe-coder deleted the bryphe/provisionerdaemon/experiment-race branchFebruary 2, 2022 21:32
kylecarbs added a commit that referenced this pull requestFeb 3, 2022
* feat: Add history middleware parametersThese will be used for streaming logs, checking status,and other operations related to workspace and projecthistory.* refactor: Move all HTTP routes to top-level structNesting all structs behind their respective structuresis leaky, and promotes naming conflicts between handlers.Our HTTP routes cannot have conflicts, so neither shouldfunction naming.* Add provisioner daemon routes* Add periodic updates* Skip pubsub if short* Return jobs with WorkspaceHistory* Add endpoints for extracting singular history* The full end-to-end operation works* fix: Disable compression for websocket dRPC transport (#145)There is a race condition in the interop between the websocket and `dRPC`:https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing.This is just tracking some experimentation to fix that race condition## Run results: ##- Run 1: peer test failure- Run 2: peer test failure- Run 3: `TestWorkspaceHistory/CreateHistory`  -https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45```status code 412: The provided project history is running. Wait for it to complete importing!````- Run 4: `TestWorkspaceHistory/CreateHistory` -https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176```    workspacehistory_test.go:122:         Error Trace:workspacehistory_test.go:122        Error:      Condition never satisfied        Test:       TestWorkspaceHistory/CreateHistory```- Run 5: peer failure- Run 6: Pass ✅ - Run 7: Peer failure## Open Questions: ##### Is `dRPC` or `websocket` at fault for the data race?It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it.From `dRPC`'s perspective, the callstack looks like this:- [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253)  - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65)    - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128)      - with finally the data race happening here:```go// AppendFrame appends a marshaled form of the frame to the provided buffer.func AppendFrame(buf []byte, fr Frame) []byte {...out := bufout = append(out, control).   // <---------```This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame.Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__:https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race.In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly).### Why does cloning on `Read` fail?Get a bunch of errors like:```2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:02022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0```# UPDATE:We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now:- Run 1: ✅ - Run 2:https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338- Run 3: ✅ - Run 4:https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168- Run 5: ✅* fix: Remove race condition with acquiredJobDone channel (#148)Found another data race while running the tests:https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83__Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs.__Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up.* fix: Bump up workspace history timeout (#149)This is an attempted fix for failures like:https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32Looking at the timing of the test:```    t.go:56: 2022-02-02 21:33:21.964 [DEBUG](terraform-provisioner)<provision.go:139>ran apply    t.go:56: 2022-02-02 21:33:21.991 [DEBUG](provisionerd)<provisionerd.go:162>skipping acquire; job is already running    t.go:56: 2022-02-02 21:33:22.050 [DEBUG](provisionerd)<provisionerd.go:162>skipping acquire; job is already running    t.go:56: 2022-02-02 21:33:22.090 [DEBUG](provisionerd)<provisionerd.go:162>skipping acquire; job is already running    t.go:56: 2022-02-02 21:33:22.140 [DEBUG](provisionerd)<provisionerd.go:162>skipping acquire; job is already running    t.go:56: 2022-02-02 21:33:22.195 [DEBUG](provisionerd)<provisionerd.go:162>skipping acquire; job is already running    t.go:56: 2022-02-02 21:33:22.240 [DEBUG](provisionerd)<provisionerd.go:162>skipping acquire; job is already running    workspacehistory_test.go:122:         Error Trace:workspacehistory_test.go:122        Error:      Condition never satisfied        Test:       TestWorkspaceHistory/CreateHistory```It  appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout.Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here.In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that.Co-authored-by: Bryan <bryan@coder.com>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@kylecarbskylecarbskylecarbs approved these changes

+1 more reviewer

@coadlercoadlercoadler left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@bryphe-coder@coadler@kylecarbs

[8]ページ先頭

©2009-2025 Movatter.jp