NotificationsYou must be signed in to change notification settings
Fork9.7k
Star30.8k

[Utf8] New component with Bytes, CodePoints and Graphemes implementations of string objects#22184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

nicolas-grekas wants to merge5 commits intosymfony:4.4fromhhamon:utf8

Closed

[Utf8] New component with Bytes, CodePoints and Graphemes implementations of string objects#22184

nicolas-grekas wants to merge5 commits intosymfony:4.4fromhhamon:utf8

Conversation

Copy link

Member

nicolas-grekas commentedMar 27, 2017•
edited
Loading

Q	A
Branch?	master
Bug fix?	no
New feature?	yes
BC breaks?	no
Deprecations?	no
Tests pass?	yes
Fixed tickets	-
License	MIT
Doc PR	-

[edit: continued in#33553]

This is a port oftchwork/utf8 to Symfony.
tchwork/utf8 has 7M downloads on packagist, and I'd be really happy to maintain it under the Symfony umbrella.

It provides 3 classes that wrap PHP strings into objects, and deal with the 3 usual unit spaces of strings: bytes, utf8 chars and grapheme clusters.

All 3 classes implement theGenericStringInterface, so that one can type hint any of them, then potentially select which appropriate unit system one want to deal with (see above) with "converter" methods.GenericStringInterface is annotated@final to tag it as not-implementable by userland - thus allow us to change it and add more methods later on if we want, without being blocked by our BC promise.

In order to help the implementation, the component has a PHP 7.0 requirement. It'd be nice if this could be accepted as such - this helps a lot to make the code clean.

Test coverage is at 100%.

A big thank to@hhamon who did the port.

(for cross ref, here is a related package:https://packagist.org/packages/danielstjules/stringy)

nicolas-grekas added this to the3.x milestone

Mar 27, 2017

carsonbot added Status: Needs Review Feature labels

Mar 27, 2017

javiereguiluz reviewed

Mar 27, 2017

View reviewed changes

src/Symfony/Component/Utf8/CodePoints.phpShow resolvedHide resolved

hhamon force-pushed theutf8 branch 4 times, most recently fromb807183 to9edd61aCompare

March 27, 2017 18:51

Fleshgrinder reviewed

Mar 27, 2017

View reviewed changes

Copy link

Fleshgrinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The inheritance does not really make sense. Programming against theGenericStringInterface does not provide me with any confidence over the kind of string I will receive, in other words, I could program against literal PHP strings as well.

What actually would make sense is that the UTF-8 variations extendBytes, while implementing a commonUTF8String interface.

This approach would also be extensible for the future, e.g. to add anASCIIString that extendsBytes and implements theUTF8String interface as well. Since any valid ASCII string is valid UTF-8.

To put it differently: anyone capable of processing bytes is capable of processing UTF-8, anyone capable of processing UTF-8 is capable of processing ASCII, … you may continue this chain until you reach a pure Latin Alphabet (e.g.[a-z]).

src/Symfony/Component/Utf8/Bytes.phpShow resolvedHide resolved

src/Symfony/Component/Utf8/Bytes.php OutdatedShow resolvedHide resolved

src/Symfony/Component/Utf8/Bytes.phpShow resolvedHide resolved

src/Symfony/Component/Utf8/CodePoints.phpShow resolvedHide resolved

src/Symfony/Component/Utf8/Utf8Trait.phpShow resolvedHide resolved

src/Symfony/Component/Utf8/Utf8Trait.php OutdatedShow resolvedHide resolved

src/Symfony/Component/Utf8/Utf8Trait.php

		}
		} else {
		throw new InvalidArgumentException('Pattern replacement must be a valid string or array of strings.');
		}

Copy link

FleshgrinderMar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Same problem as before, second argument can be an array only if first argument is an array.

Copy link

MemberAuthor

nicolas-grekasMar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

nope, see above

Copy link

FleshgrinderMar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

hmmm…

Copy link

MemberAuthor

nicolas-grekas commentedMar 27, 2017•
edited
Loading

Programming against the GenericStringInterface does not provide me with any confidence over the kind of string I will receive, in other words, I could program against literal PHP strings as well.

PHP doesn't provide generic programming (as e.g. C++ "templates"), so this is the way we found to simulate generic programming.
If one wants contractual type safety, one should just type hint the proper class (eg Bytes). Classes are final for this purpose. If one wants to code some generic algo that doesn't care about the specific unit system, then that's when the interface should be used. And when one type-hints against the interface but still wants control over the unit system, then thetoCodePoints,toBytes andtoGraphemes are provided for this purpose.

What actually would make sense is that the UTF-8 variations extend Bytes, while implementing a common UTF8String interface.

Certainly not: an utf8 string is not an instanceof bytes, for the purpose of this component. The return value of eg thelength methods of each corresponding classes donot adhere to the same contract: one return a "bytes" unit - the other a "code point" unit. Even if the interfaces look like the same, they are not.

To put it differently: anyone capable of processing bytes is capable of processing UTF-8, anyone capable of processing UTF-8 is capable of processing ASCII, … you may continue this chain until you reach a pure Latin Alphabet (e.g. [a-z]).

That is generic programming : ignoring the type of things do to similar operations on objects.GenericStringInterface is provided exactly for this purpose. But as far as the type system is considered, the three kinds of strings provided here are not and should not be "instanceof" each others.

nicolas-grekas commented

Mar 27, 2017

View reviewed changes

src/Symfony/Component/Utf8/GenericStringInterface.php OutdatedShow resolvedHide resolved

Copy link

Fleshgrinder commentedMar 27, 2017•
edited
Loading

I can treat any and every UTF-8 string as a series of bytes. The current implementation ofBytes assumes an ASCII encoded string, this is most certainly not the case, it can have any and all encodings. PHP’sstring type is already generic, wrapping it is only exchanging a well known API against a new one. I thought that this is actually meant to provide more control over a string’s content, as well as confidence that that content is of a certain character set (I count a byte stream as a kind of characters set, it just provides the least confidence over what I am dealing with; which is actually not really interesting in the first place and my inheritance chain would actually start with UTF-8). Seems like I am wrong here, and this is just an OO flavored utility implementation.

If this is considered to be useful, than it’s fine with me.

I probably have to add, that I truly like the initiative and effort. String handling is very complicated, and I thought very often about creating a similar thing. I mean, I would not take the time to review this and give constructive feedback if I would consider this being crap. So, please feel encouraged and not discouraged by all my comments. 🐱

hhamon force-pushed theutf8 branch 5 times, most recently from624ee76 to8f52667Compare

March 28, 2017 08:30

stof reviewed

Mar 28, 2017

View reviewed changes

src/Symfony/Component/Utf8/Bytes.php OutdatedShow resolvedHide resolved

src/Symfony/Component/Utf8/Tests/AbstractAsciiTestCase.php OutdatedShow resolvedHide resolved

src/Symfony/Component/Utf8/README.mdShow resolvedHide resolved

geoffrey-brier reviewed

Mar 29, 2017

View reviewed changes

src/Symfony/Component/Utf8/Utf8Trait.phpShow resolvedHide resolved

ro0NL reviewed

Mar 30, 2017

View reviewed changes

Copy link

Contributor

ro0NL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

👍 cool stuff.

in general there's a lot of the same-ish code.. could it be further simplified with a commonmb_ trait using either8bit orUTF-8?

src/Symfony/Component/Utf8/Bytes.phpShow resolvedHide resolved

src/Symfony/Component/Utf8/GenericStringInterface.phpShow resolvedHide resolved

hhamon force-pushed theutf8 branch 10 times, most recently froma4fe716 to551d790Compare

March 31, 2017 08:39

hhamon force-pushed theutf8 branch from9a12070 to679106fCompare

April 3, 2017 07:03

nicolas-grekas self-assigned this

Apr 10, 2017

Copy link

Contributor

soullivaneuh commentedMay 4, 2017•
edited
Loading

So if I well understood the component, we have to instantiate an object containing the string to do utf-8 safe manipulation and comparison?

Why not having static methods likevoku/portable-utf8 package does?

BTW, maybe this was already discussed and I'll be glad to have a thread link in this case, but why not requiring and using a library likevoku/portable-utf8 or yours instead of creating a new component?

Maybe I'm too curious, but I like to have some elucidation. 👼 😉

Copy link

Member

stof commentedMay 4, 2017

@soullivaneuh this is an API upgrade of an existing package from@nicolas-grekas, meant to bring this package in the symfony ecosystem (so that it is maintained by the core team rather than@nicolas-grekas alone)

Copy link

Contributor

hhamon commentedMay 4, 2017

@soullivaneuh because package from voku doesn't allow to manipulate strings as bytes, codepoints or graphemes units. Depending on your domain specific context, you'll choose one of the 3 implementations.Bytes for simple string and fast string manipulations.CodePoints for simple UTF8 strings andGraphemes when you need to deal with advanced chars map where you have combined characters.

Copy link

Contributor

soullivaneuh commentedMay 5, 2017

@stof Seems indeed legit, I was thinking like you after posting this question. 😉

voku doesn't allow to manipulate strings as bytes, codepoints or graphemes units.

Indeed, this is maybe the package class name isUTF8. :trollface:

But still. Why class instantiation for string manipulation? It look a little bit too heavy if I just want a lenght of a string and do a sub string.

Maybe it will be clearer for me when the related documentation will come. 😉

Copy link

Fleshgrinder commentedMay 5, 2017

The reason for using the type system is usually to be able to use the type system. Util classes do not give you any kind of security. If I need a valid UTF-8 string I should be able to type hint that to you. Usingstring basically tells you, well, nothing.

That being said, I still don't like the implementation, sorry.

nicolas-grekas modified the milestones:3.4,4.1

Sep 28, 2017

nicolas-grekasand others added2 commits

December 12, 2017 21:42

[Utf8] Create the component

cc1b8bc

[Utf8] added Bytes, CodePoints and Graphemes implementations

8be629d

hhamon force-pushed theutf8 branch from679106f to8be629dCompare

December 12, 2017 20:44

hhamon added3 commits

December 12, 2017 22:03

Add PHP 7.1 support

572fbc9

Fixes CS

493509d

Drop HHVM support

6bc725a

nicolas-grekas modified the milestones:4.1,4.2

Apr 20, 2018

nicolas-grekas added Status: Needs Work and removed Status: Needs Review labels

Jun 6, 2018

nicolas-grekas mentioned this pull request

Jun 8, 2018

Grapheme cluster boundary supporttwigphp/Twig#2703

Closed

Copy link

Member

fabpot commentedMar 24, 2019

@nicolas-grekas What about this one? Does it make sense to finish it and merge it?

Copy link

MemberAuthor

nicolas-grekas commentedApr 6, 2019

I really want to finish it :)

nicolas-grekas changed the base branch frommaster to4.4

June 2, 2019 20:05

Copy link

MemberAuthor

nicolas-grekas commentedSep 4, 2019

I'm closing here so we can keep the discussion relevant to the attached patch.
I'm going to open a new PR soon, I'll post the ref here so that everyone interested can join it.

nicolas-grekas closed this

Sep 4, 2019

nicolas-grekas mentioned this pull request

Sep 11, 2019

[String] a new component for object-oriented strings management with an abstract unit system#33553

Merged

Copy link

MemberAuthor

nicolas-grekas commentedSep 11, 2019

Continued in#33553

fabpot added a commit that referenced this pull request

Sep 26, 2019

feature#33553[String] a new component for object-oriented strings m…

5d154fb

…anagement with an abstract unit system (nicolas-grekas, hhamon, gharlan)This PR was merged into the 5.0-dev branch.Discussion----------[String] a new component for object-oriented strings management with an abstract unit system| Q             | A| ------------- | ---| Branch?       | master| Bug fix?      | no| New feature?  | yes| Deprecations? | no| Tickets       | -| License       | MIT| Doc PR        | -This is a reboot of#22184 (thanks@hhamon for working on it) and a generalization of my previous work on the topic ([patchwork/utf8](https://github.com/tchwork/utf8)). Unlike existing libraries (including `patchwork/utf8`), this component provides a unified API for the 3 unit systems of strings: bytes, code points and grapheme clusters.The unified API is defined by the `AbstractString` class. It has 2 direct child classes: `BinaryString` and `AbstractUnicodeString`, itself extended by `Utf8String` and `GraphemeString`.All objects are immutable and provide clear edge-case semantics, using exceptions and/or (nullable) types!Two helper functions are provided to create such strings:```phpnew GraphemeString('foo') == u('foo'); // when dealing with Unicode, prefer grapheme unitsnew BinaryString('foo') == b('foo');````GraphemeString` is the most linguistic-friendly variant of them, which means it's the one ppl should use most of the time *when dealing with written text*.Future ideas: - improve tests - add more docblocks (only where they'd add value!) - consider adding more methods in the string API (`is*()?`, `*Encode()`?, etc.) - first class Emoji support - merge the Inflector component into this one - use `width()` to improve `truncate()` and `wordwrap()` - move method `slug()` to a dedicated locale-aware service class - propose your ideas (send PRs after merge)Out of (current) scope: - what [intl](https://php.net/intl) provides (collations, transliterations, confusables, segmentation, etc)Here is the unified API I'm proposing in this PR, borrowed from looking at many existing libraries, but also Java, Python, JavaScript and Go.```phpfunction __construct(string $string = '');static function unwrap(array $values): arraystatic function wrap(array $values): arrayfunction after($needle, bool $includeNeedle = false, int $offset = 0): self;function afterLast($needle, bool $includeNeedle = false, int $offset = 0): self;function append(string ...$suffix): self;function before($needle, bool $includeNeedle = false, int $offset = 0): self;function beforeLast($needle, bool $includeNeedle = false, int $offset = 0): self;function camel(): self;function chunk(int $length = 1): array;function collapseWhitespace(): selffunction endsWith($suffix): bool;function ensureEnd(string $suffix): self;function ensureStart(string $prefix): self;function equalsTo($string): bool;function folded(): self;function ignoreCase(): self;function indexOf($needle, int $offset = 0): ?int;function indexOfLast($needle, int $offset = 0): ?int;function isEmpty(): bool;function join(array $strings): self;function jsonSerialize(): string;function length(): int;function lower(): self;function match(string $pattern, int $flags = 0, int $offset = 0): array;function padBoth(int $length, string $padStr = ' '): self;function padEnd(int $length, string $padStr = ' '): self;function padStart(int $length, string $padStr = ' '): self;function prepend(string ...$prefix): self;function repeat(int $multiplier): self;function replace(string $from, string $to): self;function replaceMatches(string $fromPattern, $to): self;function slice(int $start = 0, int $length = null): self;function snake(): self;function splice(string $replacement, int $start = 0, int $length = null): self;function split(string $delimiter, int $limit = null, int $flags = null): array;function startsWith($prefix): bool;function title(bool $allWords = false): self;function toBinary(string $toEncoding = null): BinaryString;function toGrapheme(): GraphemeString;function toUtf8(): Utf8String;function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;function truncate(int $length, string $ellipsis = ''): self;function upper(): self;function width(bool $ignoreAnsiDecoration = true): int;function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self;function __clone();function __toString(): string;````AbstractUnicodeString` adds these:```phpstatic function fromCodePoints(int ...$codes): self;function ascii(array $rules = []): self;function codePoint(int $index = 0): ?int;function folded(bool $compat = true): parent;function normalize(int $form = self::NFC): self;function slug(string $separator = '-'): self;```and `BinaryString`:```phpstatic function fromRandom(int $length = 16): self;function byteCode(int $index = 0): ?int;function isUtf8(): bool;function toUtf8(string $fromEncoding = null): Utf8String;function toGrapheme(string $fromEncoding = null): GraphemeString;```Case insensitive operations are done with the `ignoreCase()` method.e.g. `b('abc')->ignoreCase()->indexOf('B')` will return `1`.For reference, CLDR transliterations (used in the `ascii()` method) are defined here:https://github.com/unicode-org/cldr/tree/master/common/transformsCommits-------dd8745a [String] add more tests82a0095 [String] add tests012e92a [String] a new component for object-oriented strings management with an abstract unit system