- Notifications
You must be signed in to change notification settings - Fork5.2k
Description
EDITED BY@stephentoub on 10/26/2022 to add revised API shape:
namespaceSystem{publicstaticclassIndexOfAnyValues{publicstaticIndexOfAnyValues<T>Create<T>(ReadOnlySpan<T>values)whereT:IEquatable<T>?;// in the future, could add an `IndexOfAnyValues<char> Create(ReadOnlySpan<string> values) overload}publicclassIndexOfAnyValues<T>whereT:IEquatable<T>?{// no public/protected members, including ctors}publicstaticpartialclassMemoryExtensions{// These are identical to the existing overloads, just with IndexOfAnyValues<T> instead of ReadOnlySpan<T> for the values parameter typepublicstaticintIndexOfAny<T>(thisReadOnlySpan<T>span,IndexOfAnyValues<T>values)whereT:IEquatable<T>?;publicstaticintIndexOfAnyExcept<T>(thisReadOnlySpan<T>span,IndexOfAnyValues<T>values)whereT:IEquatable<T>?;publicstaticintLastIndexOfAny<T>(thisReadOnlySpan<T>span,IndexOfAnyValues<T>values)whereT:IEquatable<T>?;publicstaticintLastIndexOfAnyExcept<T>(thisReadOnlySpan<T>span,IndexOfAnyValues<T>values)whereT:IEquatable<T>?;// ... repeat for Span overloads}}
API Usage
internalstaticclassHttpRuleParser{privatestaticreadonlyIndexOfAnyValues<char>s_tokenChars=IndexOfAnyValues.Create("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");publicstaticboolIsToken(ReadOnlySpan<char>input)=>input.IndexOfAnyExcept(s_tokenChars)<0;}
EDITED BY@MihaZupan on 10/24/2022 to add proposed API shape:
API Proposal
namespaceSystem{publicstaticpartialclassMemoryExtensions{publicstaticintIndexOfAny(thisReadOnlySpan<byte>span,IndexOfAnyAsciiValuesasciiValues);publicstaticintIndexOfAnyExcept(thisReadOnlySpan<byte>span,IndexOfAnyAsciiValuesasciiValues);publicstaticintLastIndexOfAny(thisReadOnlySpan<byte>span,IndexOfAnyAsciiValuesasciiValues);publicstaticintLastIndexOfAnyExcept(thisReadOnlySpan<byte>span,IndexOfAnyAsciiValuesasciiValues);publicstaticintIndexOfAny(thisReadOnlySpan<char>span,IndexOfAnyAsciiValuesasciiValues);publicstaticintIndexOfAnyExcept(thisReadOnlySpan<char>span,IndexOfAnyAsciiValuesasciiValues);publicstaticintLastIndexOfAny(thisReadOnlySpan<char>span,IndexOfAnyAsciiValuesasciiValues);publicstaticintLastIndexOfAnyExcept(thisReadOnlySpan<char>span,IndexOfAnyAsciiValuesasciiValues);// ... repeat for Span overloads}}namespaceSystem.Buffers.Text{publicsealedclassIndexOfAnyAsciiValues{publicIndexOfAnyAsciiValues(ReadOnlySpan<char>asciiValues);// No other public members}}
API Usage
internalstaticclassHttpRuleParser{privatestaticreadonlyIndexOfAnyAsciiValuess_tokenChars=new("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");publicstaticboolIsToken(ReadOnlySpan<byte>input)=>input.IndexOfAnyExcept(s_tokenChars)<0;}
Alternative Designs
Static members onAscii instead of extension methods inMemoryExtensions.
Based on feedback from API review on 2022-10-25.
namespaceSystem.Buffers.Text;publicstaticclassAscii{publicstaticintIndexOfAny(ReadOnlySpan<byte>span,IndexOfAnyAsciiValuesvalues);publicstaticintIndexOfAny(ReadOnlySpan<char>span,IndexOfAnyAsciiValuesvalues);// ...}publicsealedclassIndexOfAnyAsciiValues{publicIndexOfAnyAsciiValues(ReadOnlySpan<char>asciiValues);}
Original post:
Earlier in the .NET 7 release, we combined the best of string.IndexOfAny and MemoryExtensions.IndexOfAny to provide a single implementation used by both. As part of this, whereas <= 5 characters are vectorized, > 5 characters use a "probabilistic map":
runtime/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs
Lines 1140 to 1167 ind58efd1
| privatestaticunsafeintIndexOfAnyProbabilistic(refcharsearchSpace,intsearchSpaceLength,refcharvalues,intvaluesLength) | |
| { | |
| Debug.Assert(searchSpaceLength>=0); | |
| Debug.Assert(valuesLength>=0); | |
| ReadOnlySpan<char>valuesSpan=newReadOnlySpan<char>(refvalues,valuesLength); | |
| ProbabilisticMapmap=default; | |
| uint*charMap=(uint*)↦ | |
| ProbabilisticMap.Initialize(charMap,valuesSpan); | |
| refcharcur=refsearchSpace; | |
| while(searchSpaceLength!=0) | |
| { | |
| intch=cur; | |
| if(ProbabilisticMap.IsCharBitSet(charMap,(byte)ch)&& | |
| ProbabilisticMap.IsCharBitSet(charMap,(byte)(ch>>8))&& | |
| ProbabilisticMap.SpanContains(valuesSpan,(char)ch)) | |
| { | |
| return(int)((nint)Unsafe.ByteOffset(refsearchSpace,refcur)/sizeof(char)); | |
| } | |
| searchSpaceLength--; | |
| cur=refUnsafe.Add(refcur,1); | |
| } | |
| return-1; | |
| } |
essentially a Bloom filter to quickly screen out characters that are definitely not in the set so that the more expensive check need only be done if it's likely the character is in the set. While faster than comparing every input character against every character in the set, we still have to walk the input character by character.
@MihaZupan'sxoofx/markdig@da756f4 is an implementation ofhttp://0x80.pl/articles/simd-byte-lookup.html#universal-algorithm, and highlights that it's possible to vectorize this operation instead. It would require startup overhead of determining whether all of the set characters are ASCII and generating a bitmap from them, but we're already doing such a loop in order to build the probabilistic map:
runtime/src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs
Lines 31 to 67 ind58efd1
| publicstaticunsafevoidInitialize(uint*charMap,ReadOnlySpan<char>values) | |
| { | |
| #ifDEBUG | |
| for(inti=0;i<Size;i++) | |
| { | |
| Debug.Assert(charMap[i]==0,"Expected charMap to be zero-initialized."); | |
| } | |
| #endif | |
| boolhasAscii=false; | |
| uint*charMapLocal=charMap;// https://github.com/dotnet/runtime/issues/9040 | |
| for(inti=0;i<values.Length;++i) | |
| { | |
| intc=values[i]; | |
| // Map low bit | |
| SetCharBit(charMapLocal,(byte)c); | |
| // Map high bit | |
| c>>=8; | |
| if(c==0) | |
| { | |
| hasAscii=true; | |
| } | |
| else | |
| { | |
| SetCharBit(charMapLocal,(byte)c); | |
| } | |
| } | |
| if(hasAscii) | |
| { | |
| // Common to search for ASCII symbols. Just set the high value once. | |
| charMapLocal[0]|=1u; | |
| } | |
| } |
We could simply augment that to also build up the ASCII bitmap, and if after initialization determine that every character was indeed ASCII, take the vectorized path of using that ASCII bitmap, otherwise fall back to using the probabilistic map as we do today.