- Notifications
You must be signed in to change notification settings - Fork5.2k
Optimize percent-encoded UTF8 processing in Uri#32552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Optimize percent-encoded UTF8 processing in Uri#32552
Uh oh!
There was an error while loading.Please reload this page.
Conversation
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
| "http://con.tosoco.ntosoc.com/abcdefghi/jk"+"%C8%F3%B7%A2%B7%BF%B2%FA", | ||
| resultUri.ToString()); | ||
| resultUri.ToString(), | ||
| StringComparer.OrdinalIgnoreCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is an example of where the output would be different - the casing from the input would be preserved instead of upper-casing the hex chars
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| { | ||
| value=(value|0x20)-'a'+10; | ||
| } | ||
| elseif((value-'8')<=('9'-'8')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You could use your helpers in the other file for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I manually inlined it here, as we are only interested in non-ascii ones
| { | ||
| value=((value<<4)+second)-'0'; | ||
| } | ||
| elseif((uint)((second-'A')&~0x20)<=('F'-'A')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
And other places in here.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
GrabYourPitchforks commentedFeb 20, 2020
PR generally looks good, thanks! :) Waiting for the re-introduction of the removed tests in the meantime. |
MihaZupan commentedFeb 20, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change. |
Uh oh!
There was an error while loading.Please reload this page.
| dest.Append(pInput[i++]); | ||
| dest.Append(pInput[i++]); | ||
| dest.Append(pInput[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
EachAppend will perform a bounds check. Can you use theAppend(char *, int) overload here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That overload will do a Span slice as well. I was thinking of addingAppend(char, char) andAppend(char, char, char) overloads as they can have a measurable perf impact (that would be part of a separate PR adressing#22903).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Append(char, char) andAppend(char, char, char) seem quite awkward API choices, IMHO. Maybe make theAppend(char *, int) overload not create a Span slice?
src/libraries/System.Private.Uri/tests/FunctionalTests/PercentEncodingHelperTests.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
src/libraries/System.Private.Uri/tests/FunctionalTests/PercentEncodingHelperTests.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
src/libraries/System.Private.Uri/tests/FunctionalTests/PercentEncodingHelperTests.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
MihaZupan commentedMar 13, 2020
@stephentoub@GrabYourPitchforks @dotnet/ncl |
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
| if(Rune.DecodeFromUtf8(fourByteSpan.Slice(0,bytesLeftInBuffer),outRunerune,outbytesConsumed)==OperationStatus.Done) | ||
| { | ||
| Debug.Assert(bytesConsumed>=2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It is theoretically possible to violate this assertion. I'm having trouble following the logic of this method so I don't know if an assertion violation will end up possibly buffer overrunning or otherwise having a negative effect.
Entry to method:char* input = "%FA%FB%00"fourByteBuffer = 0bytesLeftInBuffer = 0totalCharsConsumed = 0charsToCopy = 0bytesConsumed = 0<after a few rounds of ReadByteFromInput>fourByteBuffer = 0xFBFA0000 (buffer = [ 00 00 FA FB ])bytesLeftInBuffer = 2<go to NoMoreOrInvalidInput>fourByteBuffer = 0x0000FBFA (buffer = [ FA FB 00 00 ])bytesLeftInBuffer = 2<go to DecodeRune>DecodeFromUtf8 = InvalidDatabytesConsumed = 1charsToCopy = 3<go to AfterDecodeRune>bytesLeftInBuffer = 1totalCharsConsumed = 3## meanwhile, another thread changes 'input' to read "%FA%FB%FC" ##<go to RefillBuffer>i = 3 + (1* 3) = 6<go to ReadByteFromInput>fourByteBuffer = 0xFC0000FB (buffer = [ FB 00 00 FC ])bytesLeftInBuffer = 2<go to NoMoreOrInvalidInput>## recall: bytesConsumed is still 1 from earlier DecodeRune callfourByteBuffer = 0x00FC0000 (buffer = [ 00 00 FC 00 ])bytesLeftInBuffer = 2<go to DecodeRune>DecodeFromUtf8 = DonebytesConsumed = 1Debug.Assert(bytesConsumed >= 2); // assertion would fail but not present in release branchdest.Append(input + 3 - 3, 3); // copy 3 charscharsToCopy = 0charsToCopy = 3<go to AfterDecodeRune>...
MihaZupanMay 12, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I believe there would be no negative side-effects in this case. While the assert would fail if present, we rely onRune.DecodeFromUtf8 to always consume at least 1 byte in each iteration, so theloop will eventually exit.bytesConsumed being 1 instead of>= 2 will not impact memory accesses into the input.
I added a comment to this assert indicating the scenario under which it may fail.
In general, is the scenario of a string instance being modified after creation something that we should be/are considering? I would be surprised if there aren't other APIs across runtime that make the assumption of string immutability with worse side-effects.
MihaZupan commentedMay 12, 2020
The recent changes improve the throughput by another 10-20% depending on input.
|
stephentoub commentedJun 20, 2020
@MihaZupan, what's the next step here? Are you waiting for reviews? Is it ready to merge? |
stephentoub commentedSep 14, 2020
@GrabYourPitchforks, I assume no response means you're good with this now? Thanks. |
stephentoub commentedOct 5, 2020
@MihaZupan, can you rebase this to resolve the conflicts? Thanks. |
70606d0 to2a78dc8Comparedanmoseley commentedOct 12, 2020
Noticed this is the oldest Microsoft PR in the repo now. @GrabYourPitchforks any remaining feedback or is this ready to go? |
danmoseley commentedOct 12, 2020
@scalablecory is your feedback addressed? Trying to get our 90th percentile PR age down.. |
MihaZupan commentedDec 7, 2020
Test failures are unrelated. Sent an email to@GrabYourPitchforks about finalizing the review here, otherwise I will be closing the PR until we can get someone to take a look at it. |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
src/libraries/System.Private.Uri/src/System/PercentEncodingHelper.cs OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
When unescaping percent-encoded non-ascii we currently:
This PR changes it into performing a single pass, writing to the ValueStringBuilder without temporary buffers.
Currently there is a behavioral change where before all hex characters would be upper-cased, now their input-casing is preserved. Keeping the old behavior is a trivial change with a bit of a perf penalty.
I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.
Perf goes up significantly whenever this unescaping path is hit
(The allocation win is hit whenever there is a single non-ascii char in the input)
Updated benchmarks:#32552 (comment)