Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork9.7k
[String] a new component for object-oriented strings management with an abstract unit system#33553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
6a1137d toc613662Comparestof commentedSep 11, 2019
no way to unignore case ? and no way to know whether the current object is ignoring case ? This makes the API unusable for code wanting to deal with the string in a case sensitive way while accepting an external string object. Also should we merge this new component in 4.4, which would mean that its first release is already non-experimental ? We are not allowed to have experimental components in LTS versions, per our LTS policy. |
nicolas-grekas commentedSep 11, 2019 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
That's something we need to decide indeed. On my side, I think we can make it non-experimental. |
stof commentedSep 11, 2019
and what happens for all methods accepting a string as argument, when passing non-UTF-8 strings to the method on a Regarding the naming, should Note that these comments are based purely on your PR description. I haven't looked at the code yet. |
c613662 to8945735Comparenicolas-grekas commentedSep 11, 2019 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
an
I think UTF-8 is more common vocabulary. The previous PR used |
Uh oh!
There was an error while loading.Please reload this page.
stof commentedSep 11, 2019
@nicolas-grekas but the whole component is about UTF-8 strings. AFAICT, even BinaryString is not meant to operate on other encodings, as it does not validate that the string is valid UTF-8 before converting it to other implementations. |
Uh oh!
There was an error while loading.Please reload this page.
nicolas-grekas commentedSep 11, 2019 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
It does, dunno why you think otherwise. If you try to convert a random binary string to UTF-8/Grapheme, you'll get an
Thus the name of the component. |
fabpot commentedSep 11, 2019
To me, this should go as experimental in 5.0. |
drupol commentedSep 11, 2019
Definitely supporting this :-) Nice ! |
Uh oh!
There was an error while loading.Please reload this page.
javiereguiluz commentedSep 11, 2019 • edited by nicolas-grekas
Loading Uh oh!
There was an error while loading.Please reload this page.
edited by nicolas-grekas
Uh oh!
There was an error while loading.Please reload this page.
Sorry to sound naive, but I can't find in this pull request or the previous one, some brief explanation about when/where should developers use this. Why/when should we use these classes/methods instead of the normal str_ PHP functions or the mb_str UTF8 functions? Thanks! Note: I'm not questioning this ... I just want to know where this fits in Symfony developers and Symfony itself. Thanks a lot! edit: see#33553 (comment) |
5ccd5a7 tof9b903bComparejaviereguiluz commentedSep 11, 2019
For your consideration, we could turn these 4 methods: function ensureLeft(string$prefix):selffunction ensureRight(string$suffix): selffunction padLeft(int$length, string$padStr = ' '): selffunction padRight(int$length, string$padStr = ' '): self Into these 2 methods if we change the order of the arguments: function padLeft(string$padStr ='',int$length =null):selffunction padRight(string$padStr = ' ', int $length = null): self Example: // BEFORE$s1 =u('lorem')->ensureLeft('abc');// $s1 = 'abclorem'$s2 =u('lorem')->ensureRight('abc');// $s2 = 'loremabc'$s3 =u('lorem')->padLeft(8,'abc');// $s3 = 'abcabcablorem'$s4 =u('lorem')->padRight(8,'abc');// $s4 = 'loremabcabcab'// AFTER$s1 =u('lorem')->padLeft('abc');// $s1 = 'abclorem'$s2 =u('lorem')->padRight('abc');// $s2 = 'loremabc'$s3 =u('lorem')->padLeft('abc',8);// $s3 = 'abcabcablorem'$s4 =u('lorem')->padRight('abc',8);// $s4 = 'loremabcabcab' |
nicolas-grekas commentedSep 11, 2019
All the time would be fine. e.g. More specifically, I've observed ppl randomly add an |
nicolas-grekas commentedSep 11, 2019
This would be totally unexpected to me. I've seen no other libraries have this API and I'm not sure it works actually.
Absolutely! That's critical design concern, not just an implementation detail :) I added a note about it in the desription. Thanks for asking. |
azjezz left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is great, i believe this would make it easier for developers to deal with string encoding, just few notes about method naming :)
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
javiereguiluz commentedSep 11, 2019
@nicolas-grekas thanks for the explanation. It's perfectly clear now! Another question: some methods are called "left", "right" instead of "prefix/suffix" or "start/end". What happens when the text is Arabic/Persian/Hebrew and uses right-to-text direction? For example, |
leofeyer commentedSep 11, 2019
We have been using |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Devristo commentedSep 11, 2019
It looks amazing. I am curious how it would work together with the rest of the ecosystem. Lets say compatibility with doctrine, intl, symfony/validator, etc? I am sure it will take time before it trickles down to other components, but the future seems bright ;) |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
azjezz left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
i suggest addingAbstractString::contains(string ...$needles): bool, where it returns true in case the string contains one of the needles.
if ($text->contains(...$blacklisted)) {echo'nope!';}
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
5676e2a tof398091CompareUh oh!
There was an error while loading.Please reload this page.
892f621 to921b92fCompare
xabbuh left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I am not finished reviewing this PR, but here are some ideas I got so far.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
8900c38 to3b6f46aCompareUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| return \strlen($this->string) - \strlen($suffix) === ($this->ignoreCase ? strripos($this->string, $suffix) : strrpos($this->string, $suffix)); | ||
| } | ||
| public function equalsTo($string): bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
| publicfunction equalsTo($string):bool | |
| /** | |
| * @param AbstractString|string|string[]$string | |
| */ | |
| publicfunction equalsTo($string):bool |
Can be useful for autocompletion, static analysis and so on. Other methods could benefit from this doc.
nicolas-grekasSep 25, 2019 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Any object that implements__toString() is allowed actually. That's whatstring means already to me. What's the relation with autocompletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
3b6f46a to278fd29CompareUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
278fd29 todd8745aComparefabpot commentedSep 26, 2019
Thank you@nicolas-grekas. |
…anagement with an abstract unit system (nicolas-grekas, hhamon, gharlan)This PR was merged into the 5.0-dev branch.Discussion----------[String] a new component for object-oriented strings management with an abstract unit system| Q | A| ------------- | ---| Branch? | master| Bug fix? | no| New feature? | yes| Deprecations? | no| Tickets | -| License | MIT| Doc PR | -This is a reboot of#22184 (thanks@hhamon for working on it) and a generalization of my previous work on the topic ([patchwork/utf8](https://github.com/tchwork/utf8)). Unlike existing libraries (including `patchwork/utf8`), this component provides a unified API for the 3 unit systems of strings: bytes, code points and grapheme clusters.The unified API is defined by the `AbstractString` class. It has 2 direct child classes: `BinaryString` and `AbstractUnicodeString`, itself extended by `Utf8String` and `GraphemeString`.All objects are immutable and provide clear edge-case semantics, using exceptions and/or (nullable) types!Two helper functions are provided to create such strings:```phpnew GraphemeString('foo') == u('foo'); // when dealing with Unicode, prefer grapheme unitsnew BinaryString('foo') == b('foo');````GraphemeString` is the most linguistic-friendly variant of them, which means it's the one ppl should use most of the time *when dealing with written text*.Future ideas: - improve tests - add more docblocks (only where they'd add value!) - consider adding more methods in the string API (`is*()?`, `*Encode()`?, etc.) - first class Emoji support - merge the Inflector component into this one - use `width()` to improve `truncate()` and `wordwrap()` - move method `slug()` to a dedicated locale-aware service class - propose your ideas (send PRs after merge)Out of (current) scope: - what [intl](https://php.net/intl) provides (collations, transliterations, confusables, segmentation, etc)Here is the unified API I'm proposing in this PR, borrowed from looking at many existing libraries, but also Java, Python, JavaScript and Go.```phpfunction __construct(string $string = '');static function unwrap(array $values): arraystatic function wrap(array $values): arrayfunction after($needle, bool $includeNeedle = false, int $offset = 0): self;function afterLast($needle, bool $includeNeedle = false, int $offset = 0): self;function append(string ...$suffix): self;function before($needle, bool $includeNeedle = false, int $offset = 0): self;function beforeLast($needle, bool $includeNeedle = false, int $offset = 0): self;function camel(): self;function chunk(int $length = 1): array;function collapseWhitespace(): selffunction endsWith($suffix): bool;function ensureEnd(string $suffix): self;function ensureStart(string $prefix): self;function equalsTo($string): bool;function folded(): self;function ignoreCase(): self;function indexOf($needle, int $offset = 0): ?int;function indexOfLast($needle, int $offset = 0): ?int;function isEmpty(): bool;function join(array $strings): self;function jsonSerialize(): string;function length(): int;function lower(): self;function match(string $pattern, int $flags = 0, int $offset = 0): array;function padBoth(int $length, string $padStr = ' '): self;function padEnd(int $length, string $padStr = ' '): self;function padStart(int $length, string $padStr = ' '): self;function prepend(string ...$prefix): self;function repeat(int $multiplier): self;function replace(string $from, string $to): self;function replaceMatches(string $fromPattern, $to): self;function slice(int $start = 0, int $length = null): self;function snake(): self;function splice(string $replacement, int $start = 0, int $length = null): self;function split(string $delimiter, int $limit = null, int $flags = null): array;function startsWith($prefix): bool;function title(bool $allWords = false): self;function toBinary(string $toEncoding = null): BinaryString;function toGrapheme(): GraphemeString;function toUtf8(): Utf8String;function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;function truncate(int $length, string $ellipsis = ''): self;function upper(): self;function width(bool $ignoreAnsiDecoration = true): int;function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self;function __clone();function __toString(): string;````AbstractUnicodeString` adds these:```phpstatic function fromCodePoints(int ...$codes): self;function ascii(array $rules = []): self;function codePoint(int $index = 0): ?int;function folded(bool $compat = true): parent;function normalize(int $form = self::NFC): self;function slug(string $separator = '-'): self;```and `BinaryString`:```phpstatic function fromRandom(int $length = 16): self;function byteCode(int $index = 0): ?int;function isUtf8(): bool;function toUtf8(string $fromEncoding = null): Utf8String;function toGrapheme(string $fromEncoding = null): GraphemeString;```Case insensitive operations are done with the `ignoreCase()` method.e.g. `b('abc')->ignoreCase()->indexOf('B')` will return `1`.For reference, CLDR transliterations (used in the `ascii()` method) are defined here:https://github.com/unicode-org/cldr/tree/master/common/transformsCommits-------dd8745a [String] add more tests82a0095 [String] add tests012e92a [String] a new component for object-oriented strings management with an abstract unit system
nicolas-grekas commentedSep 26, 2019
Thank you everyone for the reviews, it's been invaluable! |
…tring (nicolas-grekas)This PR was merged into the 5.0-dev branch.Discussion----------[String] renamed core classes to Byte/CodePoint/UnicodeString| Q | A| ------------- | ---| Branch? | master| Bug fix? | no| New feature? | no| Deprecations? | no| Tickets | -| License | MIT| Doc PR | -In#33553 there have been discussions about the naming of the classes - eg. "what's a grapheme", "why `Utf8String`", "lowercase on binary is weird", etc.What about these names? Would they get your votes *vs* the current ones?- `BinaryString` -> `ByteString`- `Utf8String` -> `CodePointString`- `GraphemeString` -> `UnicodeString`Commits-------63c105d [String] renamed core classes to Byte/CodePoint/UnicodeString
MaPePeR commentedNov 21, 2019
I think this answer is incomplete, because there is nothing that stops someone from calling a function like This might also happen unintentionally when someone intends to do a lot of case insensitive operations, so they do You might say "Use the code in a wrong way and you will get wrong results", but i still think that this behavior is kind of odd. Maybe have the |

Uh oh!
There was an error while loading.Please reload this page.
[EDIT: classes have been renamed in#33816]
This is a reboot of#22184 (thanks@hhamon for working on it) and a generalization of my previous work on the topic (patchwork/utf8). Unlike existing libraries (including
patchwork/utf8), this component provides a unified API for the 3 unit systems of strings: bytes, code points and grapheme clusters.The unified API is defined by the
AbstractStringclass. It has 2 direct child classes:BinaryStringandAbstractUnicodeString, itself extended byUtf8StringandGraphemeString.All objects are immutable and provide clear edge-case semantics, using exceptions and/or (nullable) types!
Two helper functions are provided to create such strings:
GraphemeStringis the most linguistic-friendly variant of them, which means it's the one ppl should use most of the timewhen dealing with written text.Future ideas:
is*()?,*Encode()?, etc.)width()to improvetruncate()andwordwrap()move methodsee[String] Introduce a locale-aware Slugger in the String component #33768slug()to a dedicated locale-aware service classOut of (current) scope:
Here is the unified API I'm proposing in this PR, borrowed from looking at many existing libraries, but also Java, Python, JavaScript and Go.
AbstractUnicodeStringadds these:and
BinaryString:Case insensitive operations are done with the
ignoreCase()method.e.g.
b('abc')->ignoreCase()->indexOf('B')will return1.For reference, CLDR transliterations (used in the
ascii()method) are defined here:https://github.com/unicode-org/cldr/tree/master/common/transforms