Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Optimize percent-encoded UTF8 processing in Uri#32552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged

Conversation

@MihaZupan
Copy link
Member

@MihaZupanMihaZupan commentedFeb 19, 2020
edited
Loading

When unescaping percent-encoded non-ascii we currently:

  1. allocate a byte[] buffer
  2. decode the entire hex encoded uri into bytes
  3. allocate a char[] buffer
  4. convert the bytes into chars via Utf8Encoding
  5. analyze both buffers to see if any characters/bytes were skipped by converting chars to Runes to Utf8 bytes and comparing

This PR changes it into performing a single pass, writing to the ValueStringBuilder without temporary buffers.

Currently there is a behavioral change where before all hex characters would be upper-cased, now their input-casing is preserved. Keeping the old behavior is a trivial change with a bit of a perf penalty.

I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.

Perf goes up significantly whenever this unescaping path is hit
(The allocation win is hit whenever there is a single non-ascii char in the input)

MethodToolchainMeanRatioGen 0Allocated
NewUri_Chinese\clean\CoreRun.exe11,644.9 ns1.571.28175384 B
NewUri_Chinese\new\CoreRun.exe7,422.8 ns1.000.2136920 B
UnescapeDataString_Chinese\clean\CoreRun.exe9,514.7 ns2.241.09864664 B
UnescapeDataString_Chinese\new\CoreRun.exe4,245.9 ns1.000.0763344 B
UnescapeDataString_Chinese_Short\clean\CoreRun.exe1,402.5 ns3.030.1545656 B
UnescapeDataString_Chinese_Short\new\CoreRun.exe462.7 ns1.000.014864 B
UnescapeDataString_Emoji\clean\CoreRun.exe53,014.0 ns2.629.521540072 B
UnescapeDataString_Emoji\new\CoreRun.exe20,259.3 ns1.000.94604024 B

Updated benchmarks:#32552 (comment)

itsamelambda reacted with thumbs up emoji
@MihaZupanMihaZupan added NO-MERGEThe PR is not ready for merge yet (see discussion for detailed reasons) area-System.Net labelsFeb 19, 2020
@MihaZupanMihaZupan added this to the5.0 milestoneFeb 19, 2020
"http://con.tosoco.ntosoc.com/abcdefghi/jk"+"%C8%F3%B7%A2%B7%BF%B2%FA",
resultUri.ToString());
resultUri.ToString(),
StringComparer.OrdinalIgnoreCase);
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is an example of where the output would be different - the casing from the input would be preserved instead of upper-casing the hex chars

GrabYourPitchforks reacted with thumbs up emoji
{
value=(value|0x20)-'a'+10;
}
elseif((value-'8')<=('9'-'8'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You could use your helpers in the other file for this.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I manually inlined it here, as we are only interested in non-ascii ones

{
value=((value<<4)+second)-'0';
}
elseif((uint)((second-'A')&~0x20)<=('F'-'A'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

And other places in here.

@GrabYourPitchforks
Copy link
Member

Removed these UTF8SequenceTests as the method doesn't exist anymore. Will add similar edge-case utf8 tests targeting PercentEncodingHelper to this PR later

PR generally looks good, thanks! :)

Waiting for the re-introduction of the removed tests in the meantime.

@MihaZupan
Copy link
MemberAuthor

MihaZupan commentedFeb 20, 2020
edited
Loading

I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.

Comment on lines +116 to +117
dest.Append(pInput[i++]);
dest.Append(pInput[i++]);
dest.Append(pInput[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

EachAppend will perform a bounds check. Can you use theAppend(char *, int) overload here?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That overload will do a Span slice as well. I was thinking of addingAppend(char, char) andAppend(char, char, char) overloads as they can have a measurable perf impact (that would be part of a separate PR adressing#22903).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Append(char, char) andAppend(char, char, char) seem quite awkward API choices, IMHO. Maybe make theAppend(char *, int) overload not create a Span slice?

@MihaZupanMihaZupan removed the NO-MERGEThe PR is not ready for merge yet (see discussion for detailed reasons) labelFeb 22, 2020
@MihaZupan
Copy link
MemberAuthor

@stephentoub@GrabYourPitchforks @dotnet/ncl
Any objections against merging this as-is? Otherwise please approve.


if(Rune.DecodeFromUtf8(fourByteSpan.Slice(0,bytesLeftInBuffer),outRunerune,outbytesConsumed)==OperationStatus.Done)
{
Debug.Assert(bytesConsumed>=2);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It is theoretically possible to violate this assertion. I'm having trouble following the logic of this method so I don't know if an assertion violation will end up possibly buffer overrunning or otherwise having a negative effect.

Entry to method:char* input = "%FA%FB%00"fourByteBuffer = 0bytesLeftInBuffer = 0totalCharsConsumed = 0charsToCopy = 0bytesConsumed = 0<after a few rounds of ReadByteFromInput>fourByteBuffer = 0xFBFA0000 (buffer = [ 00 00 FA FB ])bytesLeftInBuffer = 2<go to NoMoreOrInvalidInput>fourByteBuffer = 0x0000FBFA (buffer = [ FA FB 00 00 ])bytesLeftInBuffer = 2<go to DecodeRune>DecodeFromUtf8 = InvalidDatabytesConsumed = 1charsToCopy = 3<go to AfterDecodeRune>bytesLeftInBuffer = 1totalCharsConsumed = 3## meanwhile, another thread changes 'input' to read "%FA%FB%FC" ##<go to RefillBuffer>i = 3 + (1* 3) = 6<go to ReadByteFromInput>fourByteBuffer = 0xFC0000FB (buffer = [ FB 00 00 FC ])bytesLeftInBuffer = 2<go to NoMoreOrInvalidInput>## recall: bytesConsumed is still 1 from earlier DecodeRune callfourByteBuffer = 0x00FC0000 (buffer = [ 00 00 FC 00 ])bytesLeftInBuffer = 2<go to DecodeRune>DecodeFromUtf8 = DonebytesConsumed = 1Debug.Assert(bytesConsumed >= 2); // assertion would fail but not present in release branchdest.Append(input + 3 - 3, 3); // copy 3 charscharsToCopy = 0charsToCopy = 3<go to AfterDecodeRune>...

Copy link
MemberAuthor

@MihaZupanMihaZupanMay 12, 2020
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I believe there would be no negative side-effects in this case. While the assert would fail if present, we rely onRune.DecodeFromUtf8 to always consume at least 1 byte in each iteration, so theloop will eventually exit.bytesConsumed being 1 instead of>= 2 will not impact memory accesses into the input.
I added a comment to this assert indicating the scenario under which it may fail.

In general, is the scenario of a string instance being modified after creation something that we should be/are considering? I would be surprised if there aren't other APIs across runtime that make the assumption of string immutability with worse side-effects.

@MihaZupan
Copy link
MemberAuthor

The recent changes improve the throughput by another 10-20% depending on input.

MethodToolchainMeanRatioGen 0Allocated
UnescapeDataString_Chineese\base\CoreRun.exe9.373 us2.621.11394664 B
UnescapeDataString_Chineese\new\CoreRun.exe3.586 us1.000.0801344 B
UnescapeDataString_Emoji\base\CoreRun.exe55.043 us2.949.521540074 B
UnescapeDataString_Emoji\new\CoreRun.exe18.711 us1.000.94604025 B

@stephentoub
Copy link
Member

@MihaZupan, what's the next step here? Are you waiting for reviews? Is it ready to merge?

@karelzkarelz modified the milestones:5.0.0,6.0.0Aug 31, 2020
@stephentoub
Copy link
Member

@GrabYourPitchforks, I assume no response means you're good with this now? Thanks.

@stephentoub
Copy link
Member

@MihaZupan, can you rebase this to resolve the conflicts? Thanks.

@MihaZupanMihaZupanforce-pushed theuri-cleanup-percent-encoded-utf8 branch from70606d0 to2a78dc8CompareOctober 6, 2020 10:46
@danmoseley
Copy link
Member

Noticed this is the oldest Microsoft PR in the repo now.

@GrabYourPitchforks any remaining feedback or is this ready to go?

@danmoseley
Copy link
Member

@scalablecory is your feedback addressed? Trying to get our 90th percentile PR age down..

@MihaZupan
Copy link
MemberAuthor

Test failures are unrelated.

Sent an email to@GrabYourPitchforks about finalizing the review here, otherwise I will be closing the PR until we can get someone to take a look at it.

@MihaZupanMihaZupan merged commite53e543 intodotnet:masterDec 14, 2020
@ghostghost locked asresolvedand limited conversation to collaboratorsJan 13, 2021
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@stephentoubstephentoubstephentoub left review comments

@GrabYourPitchforksGrabYourPitchforksGrabYourPitchforks approved these changes

+3 more reviewers

@lpereiralpereiralpereira left review comments

@scalablecoryscalablecoryscalablecory left review comments

@gfoidlgfoidlgfoidl left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

@MihaZupanMihaZupan

Projects

None yet

Milestone

6.0.0

Development

Successfully merging this pull request may close these issues.

8 participants

@MihaZupan@GrabYourPitchforks@stephentoub@danmoseley@lpereira@scalablecory@gfoidl@karelz

[8]ページ先頭

©2009-2025 Movatter.jp