NotificationsYou must be signed in to change notification settings
Fork9.7k
Star30.8k

[Routing] Add seamless support for unicode requirements#19604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

fabpot merged 1 commit intosymfony:masterfromnicolas-grekas:utf8-routes-ter

Aug 25, 2016

Merged

[Routing] Add seamless support for unicode requirements#19604

fabpot merged 1 commit intosymfony:masterfromnicolas-grekas:utf8-routes-ter

Aug 25, 2016

Conversation

Copy link

Member

nicolas-grekas commentedAug 12, 2016•
edited
Loading

Q	A
Branch?	master
Bug fix?	no
New feature?	yes
BC breaks?	no
Deprecations?	yes
Tests pass?	yes
Fixed tickets	#3629,#5236,#19562
License	MIT
Doc PR	symfony/symfony-docs#6890

This PR adds unicode support to route matching and generation by automatically adding theu modifier to regexps that use either unicode characters or unicode enabled character classes (e.g.\p...\x{...}\X).

As a side note, if one wants to match a single unicode character (vs a single byte), one should use\PM or\X instead of.or set theunicode parameter to true.

carsonbot added Status: Needs Review Routing Feature labels

Aug 12, 2016

nicolas-grekas mentioned this pull request

Aug 12, 2016

[Routing] Support UTF-8 in paths and parameters#19562

Closed

nicolas-grekas force-pushed theutf8-routes-ter branch 7 times, most recently from6385b48 to2098128Compare

August 16, 2016 06:58

Copy link

MemberAuthor

nicolas-grekas commentedAug 18, 2016

ping@Tobion

Copy link

Contributor

c960657 commentedAug 19, 2016

I like how this approach is simpler than my own PR#19562. I think UTF-8 is the only reasonable encoding to use for URLs (because this is what browsers use when pretty-printing the URLs in the location field), and I assume most people will use UTF-8 internally in their data and source code, so I like how this patch favours UTF-8 while still allowing people to do otherwise if they want.

I think the auto-detection of UTF-8 patterns is a bit too magic. E.g. whether. matches a byte or a character is determined by the whether the pattern contains some of the triggering characters elsewhere. So I am wondering whether we can make a more explicit but backwards-compatible way of enabling/disabling UTF-8.

Here are two examples that are explicit. They are not native PCRE syntax but reuse syntax used in PCRE.

PCRE allows enabling UTF-8 mode from within the pattern using(*UTF8) (mentionedhere. This is only valid at the beginning of the pattern, but we could reuse the syntax for triggering UTF-8 and just strip it when concatenating the pattern.
PCRE allows settinginternal options using(?, e.g.(?i) and(?-i)will enable and disable case-insensitivity, respectively. This does not work for theu pattern, but perhaps we could reuse the syntax and strip it. This syntax also allows us to make UTF-8 on by default in Symfony 4, so that it has to be explicitly disabled.

According to themanual, using\w is faster than\pL, so it would be nice if one could trigger UTF-8 mode without using\pL.

Copy link

MemberAuthor

nicolas-grekas commentedAug 19, 2016•
edited
Loading

I didn't know about the(*UTF8) prefix thanks for the link! I don't think we should target adding theu flag by default in 4.0: most of the time, URLs are plain ASCII string. But I agree with you: having magic behavior for. is an issue and recommending\X or\PM to opt into unicode comes with a perf overhead.
Since unicode should really be enabled at the route level (not at the requirement level), I think we could add a conventional prefix to routes for enabling unicode. I propose* for now.
The current magic that detects unicode could be turned into a deprecation (an exception in 4.0) for warning the user when one uses unicode characters/properties while the* prefix is missing.

nicolas-grekas force-pushed theutf8-routes-ter branch from2098128 tode3a063Compare

August 19, 2016 08:34

Copy link

MemberAuthor

nicolas-grekas commentedAug 19, 2016

The prefix idea doesn't play well with route prefixing as done by loaders...
The PR now handles a* prefix to enable unicode at the requirements level.
See test cases also.

Copy link

Member

stof commentedAug 19, 2016

Another solution is to use a route option to opt in unicode mode. Route options are precisely meant to give hints to the route compiler.
This would be easier to explain and remember than the fact of adding a magical(*UTF8) needing to be stripped.

Copy link

MemberAuthor

nicolas-grekas commentedAug 19, 2016

@stof but that would be a DX nightmare: if you'd use unicode, you'd need to repeat yourself thousands of times. The current way is a bit magic, but really seamless: if you use any unicode chars or any unicode property, unicode is enabled. There is only one special case: the* prefix to requirements to force unicode matching for. when needed.

Copy link

MemberAuthor

nicolas-grekas commentedAug 19, 2016

Doc PR added:symfony/symfony-docs#6890

nicolas-grekas force-pushed theutf8-routes-ter branch fromde3a063 to3f0718bCompare

August 19, 2016 09:26

Copy link

Member

fabpot commentedAug 19, 2016

I don't really like the* convention, but I don't have any better idea, so 👍

Copy link

Member

stof commentedAug 19, 2016

@fabpot what about my proposal of using my proposal ?

@nicolas-grekas we could keep the autodetection of unicode. The option would be used instead of using a magic* (the other question is whether setting the option to false explicitly should disable the autodetection)

Copy link

Member

fabpot commentedAug 19, 2016

@stof Indeed, I think an option is better than an obscure convention.

nicolas-grekas force-pushed theutf8-routes-ter branch from3f0718b to7d6c262Compare

August 19, 2016 16:28

Copy link

MemberAuthor

nicolas-grekas commentedAug 19, 2016

updated

nicolas-grekas force-pushed theutf8-routes-ter branch 2 times, most recently fromc12f06a tod52bda3Compare

August 19, 2016 16:36

Copy link

Contributor

c960657 commentedAug 23, 2016•
edited
Loading

The patch uses the term “Unicode” in variable names etc., but wouldn't it be more correct to use “UTF-8” (the specific encoding we are supporting)? “Unicode” is a much wider concept. Note that the Unicode character properties escape sequences,\p{xx} etc., works even in non-UTF-8-mode.

Also, I think we disagree a bit on what exactly this feature does:

    /**     * Returns the unicode enforcement status.     *     * @return bool Whether unicode matching is enforced or not     */    public function getUnicode()

In my eyes, this flag enabled UTF-8 support for regular expressions, i.e. basically adds theu modifier – nothing else. Do you agree, or could you elaborate what "unicode enforcement" means?

c960657 reviewed

Aug 23, 2016

View reviewed changes

src/Symfony/Component/Routing/CHANGELOG.md


		* Added support for`bool`,`int`,`float`,`string`,`list` and`map` defaults in XML configurations.
		* Added support for unicode requirements

Copy link

Contributor

c960657Aug 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Unicode is a proper noun, so it should be capitalized.

c960657 reviewed

Aug 23, 2016

View reviewed changes

src/Symfony/Component/Routing/RouteCollection.php Outdated

		*
		* Existing settings will be overridden.
		*
		* @param string $unicode Whether unicode matching is enforced or not

Copy link

Contributor

c960657Aug 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@param bool

nicolas-grekas force-pushed theutf8-routes-ter branch 2 times, most recently from8276d4b toa1d4bc4Compare

August 24, 2016 05:43

Copy link

MemberAuthor

nicolas-grekas commentedAug 24, 2016•
edited
Loading

The patch uses the term “Unicode” in variable names etc., but wouldn't it be more correct to use “UTF-8”

I hesitated on this, and choose Unicode to match what the PCRE doc uses. On the web, Unicode is synonym for UTF-8. Nobody uses any other Unicode encoding there. And the PCRE doc says about "unicode properties", the "unicode" flag, etc. I thought it'd be better to make the vocabularies match.

Unicode character properties escape sequences, \p{xx} etc., works even in non-UTF-8-mode.

Yes, yet it's undocumented (in the PHP doc at least, where on the contrary it's specifically documented under the "Unicode properties" chapter), and thus nobody knows what it does, esp. when considering high-ASCII chars.

this flag [...] adds the u modifier – nothing else.

Yes when giventrue. But when givenfalse, it doesn't "remove" theu modifier. Instead, it then relies on detecting if a Unicode char or prop to also add theu modifier.
Thus this flag "enforces" theu modifier, whereas setting this flag to false "turns on" auto-detection.

Copy link

Contributor

c960657 commentedAug 24, 2016•
edited
Loading

I hesitated on this, and choose Unicode to match what the PCRE doc uses

In most places, the PCRE man pages refer to "UTF-8 mode" or "UTF mode" (the library also supports UTF-16 and UTF-32).

Yes, yet it's undocumented (in the PHP doc at least, where on the contrary it's specifically documented under the "Unicode properties" chapter), and thus nobody knows what it does, esp. when considering high-ASCII chars.

On theman page it says:
"When in 8-bit non-UTF-8 mode, these sequences are of course limited to testing characters whose codepoints are less than 256, but they do work in this mode."

Copy link

MemberAuthor

nicolas-grekas commentedAug 24, 2016

I never read the man page of pcre but I read many times the one on php.net. I guess I'm more like the usual PHP user on this one :)

nicolas-grekas force-pushed theutf8-routes-ter branch froma1d4bc4 to02b1214Compare

August 24, 2016 20:42

Copy link

Contributor

c960657 commentedAug 25, 2016

I never read the man page of pcre but I read many times the one on php.net.

I'm not sure which parts of the PHP manual you are referring to. I couldn't find any occurrences of the phrase "Unicode mode" outside user comments. "UTF-8 mode" occurs a few places, e.g. in thedescription of PREG_BAD_UTF8_OFFSET_ERROR.

Unicode mode:
https://www.google.dk/search?q=%22unicode+mode%22+site%3Aphp.net+inurl%3Amanual

UTF-8 mode:
https://www.google.dk/search?q=%22utf-8+mode%22+site%3Aphp.net+inurl%3Amanual

Additionally, the work "Unicode" is used in connection with character classes, but as mentioned above they work with and without UTF-8 mode.

nicolas-grekas force-pushed theutf8-routes-ter branch from02b1214 to8f62888Compare

August 25, 2016 08:09

Copy link

MemberAuthor

nicolas-grekas commentedAug 25, 2016

@c960657 thanks for you input. I've just renamed unicode to utf8 everywhere.

I've also added a deprecation targeting the current magic: I propose to encourage people to explicitly enable the utf8 flag; throw a deprecation when they don't, and throw a LogicException in 4.0.

nicolas-grekas added the Deprecation label

Aug 25, 2016

Copy link

Contributor

c960657 commentedAug 25, 2016

Sounds like a good solution :)

stof reviewed

Aug 25, 2016

View reviewed changes

src/Symfony/Component/Routing/Annotation/Route.php Outdated

		$this->utf8 =$utf8;
		}

		publicfunctionisUtf8()

Copy link

Member

stofAug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

do we need a new property on the route ? I suggested using options, which are already meant to provide hints to the route compiler

Copy link

MemberAuthor

nicolas-grekasAug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That would complicate the transition: options are validated and you can't add an unsupported key (no forward compat). This means it wouldn't be possible to write routes in a forward/backward compatible manner in Yml/Xml/Annotations.

nicolas-grekas force-pushed theutf8-routes-ter branch 2 times, most recently from21c3992 to70c40f0Compare

August 25, 2016 09:02

nicolas-grekas reviewed

Aug 25, 2016

View reviewed changes

src/Symfony/Component/Routing/Route.php

		* Available options:
		*
		* * compiler_class: A class name able to compile this route instance (RouteCompiler by default)
		* * utf8: Whether UTF-8 matching is enforced ot not

Copy link

MemberAuthor

nicolas-grekasAug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

utf8 is now an option on the route. ping@stof (I don't know were I've got this idea that it wasn't possible...)

nicolas-grekas force-pushed theutf8-routes-ter branch from70c40f0 toe5cb1f4Compare

August 25, 2016 09:16

Copy link

Member

stof commentedAug 25, 2016

👍

[Routing] Add seamless support for unicode requirements

a829d34

nicolas-grekas force-pushed theutf8-routes-ter branch frome5cb1f4 toa829d34Compare

August 25, 2016 09:23

nicolas-grekas mentioned this pull request

Aug 25, 2016

[Routing] Add doc about unicode requirementssymfony/symfony-docs#6890

Merged

Copy link

Member

fabpot commentedAug 25, 2016

Thank you@nicolas-grekas.

fabpot merged commita829d34 intosymfony:master

Aug 25, 2016

fabpot added a commit that referenced this pull request

Aug 25, 2016

feature#19604[Routing] Add seamless support for unicode requirement…

98051e9

…s (nicolas-grekas)This PR was merged into the 3.2-dev branch.Discussion----------[Routing] Add seamless support for unicode requirements| Q             | A| ------------- | ---| Branch?       | master| Bug fix?      | no| New feature?  | yes| BC breaks?    | no| Deprecations? | yes| Tests pass?   | yes| Fixed tickets |#3629,#5236,#19562| License       | MIT| Doc PR        |symfony/symfony-docs#6890This PR adds unicode support to route matching and generation by automatically adding the `u` modifier to regexps that use either unicode characters or unicode enabled character classes (e.g. `\p...` `\x{...}` `\X`).As a side note, if one wants to match a single unicode character (vs a single byte), one should use `\PM` or `\X` instead of `.` *or* set the `unicode` parameter to true.Commits-------a829d34 [Routing] Add seamless support for unicode requirements

nicolas-grekas deleted the utf8-routes-ter branch

September 1, 2016 07:52

xabbuh added a commit to symfony/symfony-docs that referenced this pull request

Sep 21, 2016

feature#6890[Routing] Add doc about unicode requirements (nicolas-g…

e9ce9ec

…rekas)This PR was merged into the master branch.Discussion----------[Routing] Add doc about unicode requirementsRef.symfony/symfony#19604Commits-------75ed392 Add doc about unicode requirements

fabpot mentioned this pull request

Oct 27, 2016

Release v3.2.0-BETA1#20317

Merged

Labels

Deprecation Feature Routing Status: Needs Review

Movatterモバイル変換

Uh oh!

[Routing] Add seamless support for unicode requirements#19604

[Routing] Add seamless support for unicode requirements#19604

Uh oh!

Conversation

nicolas-grekas commentedAug 12, 2016• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

nicolas-grekas commentedAug 18, 2016

Uh oh!

c960657 commentedAug 19, 2016

Uh oh!

nicolas-grekas commentedAug 19, 2016• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

nicolas-grekas commentedAug 19, 2016

Uh oh!

stof commentedAug 19, 2016

Uh oh!

nicolas-grekas commentedAug 19, 2016

Uh oh!

nicolas-grekas commentedAug 19, 2016

Uh oh!

fabpot commentedAug 19, 2016

Uh oh!

stof commentedAug 19, 2016

Uh oh!

fabpot commentedAug 19, 2016

Uh oh!

nicolas-grekas commentedAug 19, 2016

Uh oh!

c960657 commentedAug 23, 2016• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

c960657Aug 23, 2016

Choose a reason for hiding this comment

Uh oh!

c960657Aug 23, 2016

Choose a reason for hiding this comment

Uh oh!

nicolas-grekas commentedAug 24, 2016• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

c960657 commentedAug 24, 2016• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

nicolas-grekas commentedAug 24, 2016

Uh oh!

c960657 commentedAug 25, 2016

Uh oh!

nicolas-grekas commentedAug 25, 2016

Uh oh!

c960657 commentedAug 25, 2016

Uh oh!

stofAug 25, 2016

Choose a reason for hiding this comment

Uh oh!

nicolas-grekasAug 25, 2016

Choose a reason for hiding this comment

Uh oh!

nicolas-grekasAug 25, 2016

Choose a reason for hiding this comment

Uh oh!

stof commentedAug 25, 2016

Uh oh!

fabpot commentedAug 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nicolas-grekas commentedAug 12, 2016•
edited
Loading

nicolas-grekas commentedAug 19, 2016•
edited
Loading

c960657 commentedAug 23, 2016•
edited
Loading

nicolas-grekas commentedAug 24, 2016•
edited
Loading

c960657 commentedAug 24, 2016•
edited
Loading