This repository was archived by the owner on Jan 23, 2023. It is now read-only.

dotnet/corefxPublic archive

NotificationsYou must be signed in to change notification settings
Fork4.9k
Star17.6k

Optimize DeflateStream.CopyToAsync#13184

Merged

stephentoub merged 3 commits intodotnet:masterfromstephentoub:deflatestream_copytoasync

Nov 1, 2016

Merged

Optimize DeflateStream.CopyToAsync#13184

stephentoub merged 3 commits intodotnet:masterfromstephentoub:deflatestream_copytoasync

Nov 1, 2016

Conversation

Copy link

Member

stephentoub commentedOct 31, 2016

Today, CopyToAsync on DeflateStream is the basic read/write loop. However, all CopyToAsync really needs to do on DeflateStream is transform the source data (via a synchronous operation) and pass it along to the destination. This means that we can improve its implementation by effectively having the source stream CopyToAsync to the destination stream, albeit with that transformation added in the middle. Doing this has a three key benefits:

It eliminates the read operations and the associated copy, allowing the source to effectively push directly to the inflater inside DeflateStream.
It takes advantage of any optimizations the source stream has in its CopyToAsync implementation, e.g. MemoryStream simply passing its whole internal buffer, NetworkStream using a SocketAsyncEventArgs (Two Socket/NetworkStream-related optimizations #12664), FileStream reusing its async state (Optimize overlapped I/O FileStream.CopyToAsync implementation on Windows #11569), etc.
CopyToAsync is typically used when no other reading has been done from the stream. Since this change causes CopyToAsync to bypass the internal buffer used by DeflateStream before the inflater itself, we can avoid allocating that buffer at all if CopyToAsync is used by itself. This avoids an 8K allocation per DeflateStream instance, and even if we switch to pulling this buffer from ArrayPool, it still avoids needing to rent/return with the ArrayPool in these cases.

(I also noticed an unnecessary allocation in Inflater, which I removed.)

The net effect of this change is a sizeable boost on throughput for many scenarios (I didn't find any scenarios where it regressed). Here are a few examples. In this table, "MS" is a memory stream, so "MS to MS" is using DeflateStream created around a compressed MemoryStream and using CopyToAsync on the DeflateStream to decompress it to another MemoryStream. "CMS" is a "CustomMemoryStream", which is simply a type that derives from MemoryStream and doesn't add or override any logic; this derivation defeats some the CopyToAsync optimization in MemoryStream, so this is essentially the worst case for this DeflateStream optimization. And "FS" is FileStream, where a "(true)" suffix means useAsync==true (in which case FileStream has its own CopyToAsync implementation) and a "(false)" suffix means useAsync==false (in which case it essentially delegates back to the base Stream's CopyToAsync). "Len" is the length of the original random input in bytes, and "Iters" is the number of iterations (the number of times positions were reset and CopyToAsync was called).

Test	Before (s)	After (s)
MS to MS (Len=1024 Iters=1024000)	2.767149	1.601566
MS to MS (Len=1048576 Iters=1000)	0.419615	0.139993
MS to MS (Len=10485760 Iters=100)	0.358287	0.223764
CMS to CMS (Len=1024 Iters=1024000)	2.832110	1.963891
CMS to CMS (Len=1048576 Iters=1000)	0.271432	0.202536
CMS to CMS (Len=10485760 Iters=100)	0.347356	0.316338
FS(True) to FS(True) (Len=1024 Iters=102400)	10.315941	6.909713
FS(True) to FS(True) (Len=1048576 Iters=100)	0.940410	0.124353
FS(True) to FS(True) (Len=10485760 Iters=10)	1.000786	0.152991
FS(True) to FS(False) (Len=1024 Iters=102400)	9.936322	6.967522
FS(True) to FS(False) (Len=1048576 Iters=100)	0.705946	0.085895
FS(True) to FS(False) (Len=10485760 Iters=10)	0.925196	0.159865
FS(False) toFS (True) (Len=1024 Iters=102400)	8.630263	6.699364
FS(False) toFS (True) (Len=1048576 Iters=100)	0.880577	0.116051
FS(False) toFS (True) (Len=10485760 Iters=10)	0.862978	0.142838
FS(False) toFS (False) (Len=1024 Iters=102400)	5.260478	4.388572
FS(False) toFS (False) (Len=1048576 Iters=100)	0.346608	0.078440
FS(False) toFS (False) (Len=10485760 Iters=10)	0.432393	0.157312

Fixeshttps://github.com/dotnet/corefx/issues/11571
Contributes tohttps://github.com/dotnet/corefx/issues/12549
cc:@ianhays,@JeremyKuhne,@geoffkizer,@benaadams,@davidfowl

(@ianhays, please pay careful attention to my use of the inflater. I essentially copied what was in ReadAsync and refactored it, and I want to make sure I'm using it correctly / didn't misunderstand anything about its behavior.)

stephentoub added2 commits

October 31, 2016 13:11

Optimize DeflateStream.CopyToAsync

be7667a

Today CopyToAsync on DeflateStream is the basic read/write loop.  However, all CopyToAsync really needs to do on DeflateStream is transform the source data (via a synchronous operation) and pass it along to the destination.  This means that we can improve its implementation by effectively having the source stream CopyToAsync to the destination stream, albeit with that transformation added in the middle.  This has a two key benefits:- It eliminates the read operations and the associated copy, allowing the source to effectively push directly to the inflater inside DeflateStream.- It takes advantage of any optimizations the source stream has in its CopyToAsync implementation, e.g. MemoryStream simply passing its whole internal buffer, NetworkStream using a SocketAsyncEventArgs, FileStream reusing its async state, etc.- CopyToAsync is typically used when no other reading has been done from the stream.  Since this change causes CopyToAsync to bypass the internal buffer used by DeflateStream before the inflater itself, we can avoid allocating that buffer at all if CopyToAsync is used by itself.  This avoids an 8K allocation per DeflateStream instance, and even if we switch to pulling this buffer from ArrayPool, it still avoids needing to rent/return with the ArrayPool in these cases.

Avoid an extra object allocation in Inflator

4df026a

dnfclas added the cla-already-signed label

Oct 31, 2016

karelz added the area-System.IO.Compression label

Oct 31, 2016

karelz assignedstephentoub

Oct 31, 2016

ianhays reviewed

Oct 31, 2016

View reviewed changes

src/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs

		returnnewCopyToAsyncStream(this,destination,ArrayPool<byte>.Shared.Rent(bufferSize),cancellationToken).CopyFromSourceToDestination();
		}

		privatesealedclassCopyToAsyncStream:Stream

Copy link

Contributor

ianhaysOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What is the purpose of having this be a stream and not just some helper function? Is this CopyToAsyncStream getting returned somewhere that I don't see?

Copy link

MemberAuthor

stephentoubOct 31, 2016•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It's getting passed as the destination argument to the wrapped Stream's CopyToAsync:
https://github.com/dotnet/corefx/pull/13184/files#diff-35ce8d1b7a1f37dccabcc9a644648242R738

ianhays reviewed

Oct 31, 2016

View reviewed changes

src/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs

		}
		}

		publicoverrideasyncTaskWriteAsync(byte[]buffer,intoffset,intcount,CancellationTokencancellationToken)

Copy link

Contributor

ianhaysOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Where is this being called? The only write I see is on the destination stream on line 732.

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This instance gets passed as the destination argument to the wrapped stream's CopyToAsync, so that wrapped stream's CopyToAsync pushes data to this instance via this WriteAsync. That's why this is a Stream ;)

Copy link

Contributor

ianhaysOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nvm, copyTo on 738

Copy link

Contributor

ianhaysOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Beat me by 8 seconds on the response :)

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

😄

ianhays reviewed

Oct 31, 2016

View reviewed changes

src/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs

		if(bytesRead>_arrayPoolBufferHighWaterMark)_arrayPoolBufferHighWaterMark=bytesRead;
		await_destination.WriteAsync(_arrayPoolBuffer,0,bytesRead,cancellationToken).ConfigureAwait(false);
		}
		elsebreak;

Copy link

Contributor

ianhaysOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Instead of breaking here, we should first check if _deflateStream._inflater.Finished == falselike we do in ReadAsync to handle the case where bytesRead == 0 but the inflation is not yet complete. This can happen in the rare case where the header is the exact amount of bytes as the inflation buffer.

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Interesting. Ok, I'll fix that. Thanks.

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Spoke about this offline with@ianhays; we think it's fine as-is, but he's putting a test case together.

geoffkizer reviewed

Oct 31, 2016

View reviewed changes

src/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs Outdated

		// a buffer for storing the inflated output and passing along to the destination stream, and whatever
		// buffer the source stream creates in its CopyToAsync implementation. As such, we halve the buffer size,
		// to split it across the two.
		bufferSize=Math.Max(1,bufferSize/2);

Copy link

geoffkizerOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I understand the reasoning here but I'm not sure it's what a user expects.

I guess the question is, if I specify a buffer of size N, do I expect that N is a bound on the memory used, or do I expect that the copy happens in chunks of N (when possible), regardless of the memory this would use?

Copy link

Contributor

ianhaysOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Another thing to note is that in the CopyTo ArrayPool helper we were using before this PR we did not do any halving of the bufferSize, so we did a ReadAsync(..bufferSize) followed by a WriteAsync(...bufferSize).

Personally I would expect a copyToAsync(bufferSize 10) to complete in one Read/Write call when there is only 10 bytes in the source stream.

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Fair question. My primary concern is around the case where the buffer pool is exhausted. If there are plenty of available buffers in the pool, then using 2 instead of 1 at the requested size isn't that big a deal. But if there aren't any buffers in the pool, we're going to now end up allocating 2 buffers where we previously would have only allocated 1. I figured we could at least allocate no more in total size than the user requested in such cases.

I don't have enough data to really argue one way or the other, though. If you think I should remove this and just go with the original bufferSize, I can.

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Personally I would expect a copyToAsync(bufferSize 10) to complete in one Read/Write call when there is only 10 bytes in the source stream.

Given that DeflateStream changes the amount of data as part of the transformation, you generally can't expect 10 bytes of input with a 10 byte buffer will result in a single write to the output stream. But I get the general point.

I'll remove the halving. We can add it later if this looks like an issue.

ianhays approved these changes

Oct 31, 2016

View reviewed changes

Copy link

Contributor

ianhays left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These changes look great,@stephentoub! Minus the one edge case around theFinished check, this LGTM.

geoffkizer reviewed

Oct 31, 2016

View reviewed changes

src/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs Outdated

		bufferSize=Math.Max(1,bufferSize/2);

		// Do the copy
		returnnewCopyToAsyncStream(this,destination,ArrayPool<byte>.Shared.Rent(bufferSize),cancellationToken).CopyFromSourceToDestination();

Copy link

geoffkizerOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Very minor nit: seems like it would be better to rent the ArrayPool in CopyToAsyncStream; that way it's more cleanly encapsulated if we ever wanted to change to a different buffer pool implementation.

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ok. We can rent it in the CopyToAsyncStream ctor, so I'll change it to do that. If we instead rent it in CopyFromSourceToDestination, which would lead to the best symmetry/encapsulation, we'd unfortunately pay a bit more for it, as we'd end up spending a field for each of the bufferSize and the rented byte[].

Copy link

geoffkizerOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Couldn't you pass bufferSize to CopyFromSourceToDestination?

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Couldn't you pass bufferSize to CopyFromSourceToDestination?

CopyFromSourceToDestination is an async method. Any arguments to that method will end up on its state machine object.

Copy link

geoffkizerOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, good point.

geoffkizer reviewed

Oct 31, 2016

View reviewed changes

src/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs

		privatevoidIncrementAsyncOperations()
		{
		intnewCount=Interlocked.Increment(ref_asyncOperations);
		Debug.Assert(newCount==1,$"DeflateStream was misused, with multiple pending async operations");

Copy link

geoffkizerOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I do wonder why this isn't an exception, if we're going to specifically check for it. (Or if we only want to check for this in debug, then the count should be debug-only too.)

Copy link

MemberAuthor

stephentoubOct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I wonder that as well, but I didn't want to mess with that as part of this change; I just encapsulated the check into its own method so that we can more easily do such things in the future (it also helps workaround a compiler warning regarding passing a ref on a MarshalByRefObject). There are earlier checks for whether _asyncOperations != 0, throwing an exception in such cases... I don't know why those checks aren't just based on these atomic operations so as to avoid race conditions, but such a change now would be a small breaking change, as it could impact which exceptions get thrown in a very niche corner cases. Probably not something we should care too much about, but definitely something to be left for later.