NotificationsYou must be signed in to change notification settings
Fork1.1k
Star5.7k

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Add Unicode character escape \u{H} to OCaml string literals.#1232

Merged

alainfrisch merged 4 commits intoocaml:trunkfromdbuenzli:string-unicode-escapes

Jul 20, 2017

Merged

Add Unicode character escape \u{H} to OCaml string literals.#1232

alainfrisch merged 4 commits intoocaml:trunkfromdbuenzli:string-unicode-escapes

Jul 20, 2017

Conversation

Copy link

Contributor

dbuenzli commentedJul 9, 2017•
edited
Loading

This is a bit outside of my scope of expertise and should be carefully considered by people in charge of the evolution of the syntax of the language as I'm not sure I fully grasp all the consequences...

It also firmly promotes the idea thatstring values, if they are to be interpreted as text, are UTF-8 encoded. But with#1200 in the pipeline this is really becoming a fact. The implementation is enabled by the recent addition ofBuffer.add_utf_8_uchar in#1091.

The syntax of string literals is changed to add the\u{H} escape sequence which replaces the escape by the UTF-8 encoding of the Unicode scalar value denoted by the hexadecimal numberH made of one to six lower or uppercase hexadecimal digits.

The syntax chosen is the one usually used for Unicode character escapes with a variable number of hex digits. Seewikipedia and for example thenotation used in the standard for Unicode regular expression syntax or inrust. I think it's a much better notation than for example theC99 orgo's escape sequences which have two fixed width notations: either\uHHHH or\UHHHHHHHH.

Backward compatibility

Unfortunately due to OCaml'slax policy about escape sequences this change can break program that trigger warning 14 on compilation. More precisely programs which have literals with"\u" as a substring (the backslash being unescaped) will now either fail or the string literal will be silently compiled to another representation in case the literal has a subsequence that respects the\u{H} syntax.

dbuenzli force-pushed thestring-unicode-escapes branch 2 times, most recently from9a4ee42 tode0cc10Compare

July 9, 2017 02:25

Copy link

Contributor

lpw25 commentedJul 9, 2017•
edited
Loading

I'm not 100% sure about this change. Essentially because I'm not clear about what the overall plan is for Unicode in OCaml. Fundamentally, it is not clear to me what the typestring is supposed to mean. Historically, it has been used to mean:

An arbitrary sequence of bytes
An immutable arbitrary sequence of bytes
A sequence of ASCII characters
An immutable sequence of ASCII characters
A sequence of Latin1 characters
An immutable sequence of Latin1 characters
A sequence of UTF8 characters
An immutable sequence of UTF8 characters

With the addition of thebytes type 1 has now been carved off into its own type. The-safe-string option removes 3, 5 and 7 as possible meanings -- withBuffer.t presumably now handling those case. There also seems to be a broad consensus that 6 is not really needed in the standard library. But the question remains over which of the remaining meanings (2, 4 and 8) should be ascribed tostring.

There are features ofstring within the language and the standard library (and by extension in third party libraries) which only make sense for some of these interpretations. For example:

Pattern matching on strings is a bad idea for 8, and is primarily used for 4
Indexing by integer to get achar is a bad idea for 8
Printing withPrintf is odd for 2
Printing withFormat is odd for 2 and broken for 8

This patch and some other recent changes seem to be heading for a world wherestring is used to mean 2,4 and 8 with some of the things that only work on some of these options marked withascii orutf8. This essentially requires users to maintain the distinction between these different meanings themselves, and seems like a recipe for having a lot of code that handles text be subtly broken.

My personal preference would probably be to add atext type to mean 8 and then havestring mean 4 and gradually switch people from usingstring totextfor handling human text -- for example, havetext expression literals be"foo"u at first and then eventual change it so that an unannotated"foo" also produces atext value.

There are other options that are also perfectly reasonable. For example, addtext and makestring mean 2 deprecating anything which relies on the ASCII interpretation. Or we could makestring mean 8 and deprecate the parts which don't work properly for unicode (e.g. string pattern literals). They each have pros and cons and I'm sure most of them could be made to work, what worries me is that we don't seem to have had this discussion yet and are instead drifting to a situation where the intended meaning ofstring is unclear.

Copy link

ContributorAuthor

dbuenzli commentedJul 9, 2017•
edited
Loading

Thanks for your feedback@lpw25.

First I will note that some of your interpretations don't really make sense to me, there is no such things as an "UTF-8 character" so, as this may change the perspective a bit, I will slightly tweak your points that interest me to change them to:

4. An US-ASCII valid immutable byte sequence8. An UTF-8 valid immutable byte sequence

Now for me it is very clear that the interpretation ofstring should be 2. However for practical reasons you sometimes want tosee them as 4 or 8 (the latter being a strict superset of the former and both being included 2) and the need for these interpretations justify to me the existence of functionality that act solely on these encodings (orviews).

In particular this is needed for interoperability reasons, e.g. all the OS APIs should be allowed to take/return byte strings for filenames, on Windows this will be valid UTF-8 after#1200, but for POSIX APIs this could be arbitrary sequence of bytes as it all depends on the file system you are interacting with.

Now regarding this proposal and in the the light of (my redefinition) of 4 and 8 I personally don't see it as changing themeaning of strings interpreted as 2: the escape sequences simply expand to a sequence of bytes which is consistant with 2. It only helps those who need tospecify them in programs as 4 or 8.

Regarding a newtext type, not only we could not use it to represent filenames but I wouldn't personally let it represent 8. Atext type should represent a sequence of Unicode scalar values without reference to any form of encoding; but then it could also represent a sequence of grapheme clusters like in Swift --- the design for this being a little bit less clear I'd prefer to leave it outside of the stdlib talks for now.

While in#1091 point 6. I justify why I intentionally left corresponding decoding APIs. I think that leaving aside decodingUTF-16 for now to mirror theBuffer additions, further support for 8 could maybe be providednow by adding anUTF_8 module and a private type withthis signature in theString module.

Copy link

Contributor

xavierleroy commentedJul 9, 2017

How do I write "unicode character 0x1234 followed by a0 character" ?\u12340 will read as "unicode character 0x12340"...

Copy link

ContributorAuthor

dbuenzli commentedJul 9, 2017

How do I write "unicode character 0x1234 followed by a 0 character" ? \u12340 will read as "unicode character 0x12340"...

That's not the syntax this PR implements. The answer to your question is"\u{1234}0".

Copy link

Contributor

xavierleroy commentedJul 9, 2017

Ah, indeed, I missed the braces, so my question is moot. Perhaps because I spend too much time reading the C standards lately, I was expecting\uNNNN or\UNNNNNNNN ...

Copy link

Contributor

lpw25 commentedJul 10, 2017

there is no such things as an "UTF-8 character"

Indeed my original comment was a bit imprecise.

Now for me it is very clear that the interpretation of string should be 2.

That's certainly a reasonable meaning to give it, although it does then become a bit odd that we have functions which just assume such a sequence is ASCII -- e.g. the printing functions. This could probably be explained as a hangover from C which also confuses it's ASCII string type with it's array of bytes type.

the need for these interpretations justify to me the existence of functionality that act solely on these encodings

For me, this doesn't follow. I think the need for these interpretations justifies the existence of conversion functions into types that properly represent them (e.g.Text.of_utf8_string), not of functions that operate directly on that interpretation.

A text type should represent a sequence of Unicode scalar values without reference to any form of encoding

It's not clear from what I wrote, but that is also what I had in mind.

we could not use it to represent filenames

This is a good thing in my opinion. Assuming that all filenames are valid utf8 is exactly the kind of subtle bug that I suspect is quite common in OCaml at the moment, and will continue to be without a type level distinction between these encodings. With atext type you would need to use the conversion function on thestring returned by the system before you could use it as atext value which should make clear the need to handle the error case in that conversion.

the design for this being a little bit less clear I'd prefer to leave it outside of the stdlib talks for now

I certainly agree we shouldn't rush the design for such a thing. However, I think that we should already be discussing whether we intend to make such a thing in the future. I think that decision is required to clarify our intended design and would make it easier to produce good decisions on existing proposals.

further support for 8 could maybe be provided now by adding an UTF_8 module

Do you think this module is useful after the addition of aText module? Or is it just to improve the situation until such a type exists?

Now regarding this proposal [...] the escape sequences simply expand to a sequence of bytes which is consistant with 2

I could certainly buy that argument for its inclusion, something like "it is a convenient short-hand to make it easy to produce a utf8 encoded string directly rather than viaText". Or even "for consistency string literals and text literals should accept the same escape sequences (with utf8 the natural choice for unicode in the string case)". But those are different arguments than something like "it is how you create utf8 encoded strings, which is the primary representation for text in OCaml". I'm just trying to clarify which argument it is that I am being asked to agree with.

Copy link

Contributor

alainfrisch commentedJul 10, 2017

In principle, I agree that relying more on the type system to distinguish between byte arrays and an text (= sequence of Unicode scalar values) would be good. In particular, it would allow to untie the representation of text with a specific encoding, which would be good for Javascript backends (that could implement text with native JS strings). This is coherent with the proposal on "unsigned boxed integers" (where the idea is to avoid mixing different "types" just because they happen to share a common representation).

That said, I don't see a realistic migration path here, at least for the stdlib. Alternative "standard" libraries which put less emphasis on keeping backward compatibility could indeed decide to keep "bytes" and export a new "text" type (or rename it to "string") implemented either as utf-8 encoded buffers (bytes) or as arrays of scalar values (int array), with only proper operations exposed.

So I'm in favor of the change which provides a useful feature for the de facto situation that the "string" type is used (sometimes) to represent utf-8 encoded text. I've no opinion on the proposed syntax, though (and did not review the PR).

Copy link

Contributor

nojb commentedJul 10, 2017

Using braces to allow eliding leading zeros is not consistent with existing syntax for escapes (\ddd,\xhh, etc) where one cannot skip leading zeros. Requiring four or six digits (\unnnn or\unnnnnn) would be more consistent with existing syntax, allow copy-pasting escapes to/from C, and typically not much longer.

Copy link

ContributorAuthor

dbuenzli commentedJul 10, 2017•
edited
Loading

@lpw25 I will respond to your message later as it is an interesting discussion to be had.

Requiring four or six digits (\unnnn or \unnnnnn) would be more consistent with existing syntax,

You cannot really do this without introducing two escape syntaxes like in C (i.e.\u and\U) otherwise things becomes very ambiguous and/or context sensitive, e.g. you cannot use\unnnn if you have digits following.

Personally I don't want to think about the value of my character or the context in which it occurs to know which escape I should use; it artificially divides the Unicode code space. C's escape design is basically a remnant of the wide chars design and I'd rather not use it (note that C's\U escape mandates8 hex digits, a full 32-bit spec).

The\u{} notation being not fixed width you can easily express character values with their digit following the Unicode notation conventions ("%04X") which makes the escapes easier to read if you are familiar with it, e.g. when you implement standards.

allow copy-pasting escapes to/from C, and typically not much longer.

First if you want to copy-paste to/from C you'd need\uhhhh and\Uhhhhhhhh, something I hope I have shown is undesirable. Second this is a forward looking proposal: you will be able to cut&paste to/fromrust.

alainfrisch reviewed

Jul 10, 2017

View reviewed changes

manual/manual/refman/lex.etex Outdated

		table given above for character literals, or an Unicode character
		escape sequence.

		An Unicode character escape sequence is substituted by the UTF-8

Copy link

Contributor

alainfrischJul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

An native English speaker would now better, but I would have written "A Unicode".

Copy link

ContributorAuthor

dbuenzliJul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was always tormented by that. Now I think I eventually understood the rule. The rule is not about the vowel as a letter but as a sound and apparentlyy is not a vowel in english (e.g. a university, not an university). So I think you are right, I'd still like an native english speaker to confirm.

Copy link

Contributor

lpw25Jul 10, 2017•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I would say "A Unicode".

alainfrisch reviewed

Jul 10, 2017

View reviewed changes

manual/manual/refman/lex.etex Outdated

		escape sequence.

		An Unicode character escape sequence is substituted by the UTF-8
		encoding of the Unicode character specified using 1 to 6 hexadecimal

Copy link

Contributor

alainfrischJul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Unicodecharacter? Why not refer directly to a Unicode scalar value?

alainfrisch reviewed

Jul 10, 2017

View reviewed changes

parsing/lexer.mll Outdated

		begin
		Buffer.reset uchar_utf_8_enc;
		Buffer.add_utf_8_uchar uchar_utf_8_enc u;
		store_string (Buffer.contents uchar_utf_8_enc);

Copy link

Contributor

alainfrischJul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not directly related to this PR, but does anyone know why lexer.mml doesn't simply use a Buffer.t instead of its own version? This would simplify the chunk above.

Copy link

Member

damiendoligezJul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Probably for historical reasons: lexer.ml predates Buffer by a number of years.

alainfrisch reviewed

Jul 10, 2017

View reviewed changes

parsing/lexer.mll Outdated

		raise
		(Error
		(Illegal_escape
		(Lexing.lexeme lexbuf ^ ", " ^ Printf.sprintf "%X" cp ^

Copy link

Contributor

alainfrischJul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What's the point of showing both the lexbuf content (\u{hhhh}) and the parsed cp?

Copy link

ContributorAuthor

dbuenzliJul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The lexbuf content is what the user wrote which may have for example leading zeroes. The parsedcp what the program understood. Showing how the program interpreted your input sometimes leads to easier diagnostics.

alainfrisch reviewed

Jul 10, 2017

View reviewed changes

parsing/lexer.mll Outdated

		@@ -627,6 +662,10 @@ and string = parse
		\| '\\' 'x' ['0'-'9' 'a'-'f' 'A'-'F'] ['0'-'9' 'a'-'f' 'A'-'F']
		{ store_escaped_char lexbuf (char_for_hexadecimal_code lexbuf 2);
		string lexbuf }
		\| '\\' 'u' '{'
		hex_digit hex_digit? hex_digit? hex_digit? hex_digit? hex_digit? '}'

Copy link

Contributor

alainfrischJul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What abouthex_digit+ and an explicit error message if more than 6 digits? Otherwise the lexer falls in the generic "Illegal backslash" warning.

You might also want to allow '_' as for other integer literals (of non-fixed length).

Copy link

ContributorAuthor

dbuenzliJul 10, 2017•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Will do thehex_digit+.

Regarding_ I'd rather not, this would diverge from the standard\u{h}notation (which I would like to stress I didn't invent out of thin air).

alainfrisch reviewed

Jul 10, 2017

View reviewed changes

parsing/lexer.mll Outdated

		let rec hex_num_value acc i =
		if i > epos then acc else
		let digit = Char.code (Lexing.lexeme_char lexbuf i) in
		let value =

Copy link

Contributor

alainfrischJul 10, 2017•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suggest introduce aval hex_digit: char -> int and use it also inchar_for_hexadecimal. Or perhaps directlyval hex_from_lexbuf: lexbuf -> int (*pos*)-> int (*len*)-> int.

Copy link

ContributorAuthor

dbuenzliJul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looking at this file there are many redundencies that could benefit to be factorized (e.g. the hex digits are repeated all over the place). I can certainly do this one in a separate commit.

dbuenzli force-pushed thestring-unicode-escapes branch fromde0cc10 to4c31a0dCompare

July 10, 2017 16:59

Copy link

ContributorAuthor

dbuenzli commentedJul 10, 2017

Updated the PR to take Alain's review (thanks !) into account.

dbuenzli force-pushed thestring-unicode-escapes branch from4c31a0d to4fc7508Compare

July 10, 2017 17:05

Copy link

ContributorAuthor

dbuenzli commentedJul 11, 2017•
edited
Loading

@lpw25 raises good questions which I try to answer below.

That's certainly a reasonable meaning to give it, although it does then become a bit odd that we have functions which just assume such a sequence is ASCII -- e.g. the printing functions.

That's a bit of a nitpick, but since the same printing functions are used to output on streams open inbinary mode I wouldn't say that those assume sequences of US-ASCII bytes.

For me, this doesn't follow. I think the need for these interpretations justifies the existence of conversion functions into types that properly represent them (e.g. Text.of_utf8_string), not of functions that operate directly on that interpretation.

I agree with this. But I also argue that this PR does not define a new way ofoperating on such interpretations. It gives a new, more convenient (you can already handle what it does via\xhh escapes), way ofdefining string literals with interpretation 8.

I'm not advocating for adding new text (i.e. non interpretation 2) operations onstrings --- aside, my position is different forchar though, I think a goodChar.Ascii module is a valuable tool to interact with the output of unix tools.

I do however advocate support for defining (e.g. via literals), encoding and decoding the interpretations 2, 4, and 8 ofstring as well as programmatically handling the UTF-{8,16BE,16LE} encodings by providing good interoperability betweenstring,Uchar.t andBuffer.t. This is in my opinion orthogonal to the question of having the ultimate unicode text datastructure in the standard library.

Assuming that all filenames are valid utf8 is exactly the kind of subtle bug that I suspect is quite common in OCaml at the moment, and will continue to be without a type level distinction between these encodings.

I'm sure most OCaml programs at the moment do not make that kind of assumption. Most of the time when you use the OS interface you get bytes from the OS and at most you remove/replace/append a few US-ASCII bytes and treat the rest as opaque. If this is already too much for you, since this is all baked byopen(2) and friends and is needed for the notion of filepath ('/' or'\') then I'm afraid there's not much you can do in any language.

I certainly agree we shouldn't rush the design for such a thing. However, I think that we should already be discussing whether we intend to make such a thing in the future. I think that decision is required to clarify our intended design and would make it easier to produce good decisions on existing proposals.

So let us assume such as unicode text type is added to the stdliband the language. We would get a new data type and a new syntax for literals of this type. The only way I see this PR interfering with such a prospect is consistency in the syntax adopted for escaping Unicode scalar values in the new literals and in thestring (byte sequence) literals.

Do you think this module is useful after the addition of a Text module? Or is it just to improve the situation until such a type exists?

To be honest I'm not sure I'm convinced by the addition of such a module at all (which is now implemented inthis branch of Uutf). If it were only me and algebraic effects I think I would only provide theUutf folding functions inString for interoperability between UTF-X encodings instrings andUchar.t andBuffer.t value.

I could certainly buy that argument for its inclusion, something like "it is a convenient short-hand to make it easy to produce a utf8 encoded string directly rather than via Text". Or even "for consistency string literals and text literals should accept the same escape sequences (with utf8 the natural choice for unicode in the string case)". But those are different arguments than something like "it is how you create utf8 encoded strings, which is the primary representation for text in OCaml". I'm just trying to clarify which argument it is that I am being asked to agree with.

It can't be the first or second argument since that situation doesn't exist; if it did I would however say that the second (consistency between string and text lits) would be a reasonable justification in my opinion (see however [1] below).

I wouldn't say it's the third either. First it's not a foolproof way to create UTF-8 encoded strings since string literals remain fundamentally 2. Second this doesn't say that UTF-8 is the primary representation for text in OCaml as this doesn't preclude the introduction of a text type and associated literal syntax in the future.

However what it does clearly say is that the interpretation 8 is prevalent in OCaml and by this PR definitively favoured by the language over the UTF-16 encodings (which are perfectly allowed to live instring according to interpretation 2). This favouring can absolutely be disputed, a quite strong rebuttal to this is, however, to point out the way libraries have been dealing with Unicode for quite some time now and PR's like#1200.

I can perfectly accept the argument that this PR is just convenience (write\u{1F42B} instead of\xF0\x9F\x90\xAB in your literals) to cope with the current state of affair, that it is dubious to introduce language level features to deal with a current state of affair and as such should be rejected. However as Alain Frisch mentions I think that we will have to deal with interpretation 8 ofstrings and hence the current state of affairs for quite some time in the future, would even a hypothetic text type be introduced.

I hope the above clears things a little bit up and will allow the language designers to better assess the meaning and cost/benefit of this proposal.

[1]
I had a look at the rustdesign which basically has two separate types each with associated literals. One for a Unicode string (whose UTF-8 representation is revealed) and another one for arbitrary byte sequences. Note that in the latter Unicode escapes are not allowed. Now if a Unicode text type with literal syntax would be introduced in OCaml and you would want to follow their design, this PR should definitively be rejected.

I could mentiongo's string design as an inspiration to accept this PR but I don't think anyone should be inspired by what they have done in the era in which they did.

Copy link

Contributor

lpw25 commentedJul 11, 2017•
edited
Loading

that it is dubious to introduce language level features to deal with a current state of affair

I don't mind introducing features to deal with the current state of affairs as long as its possible to give a good explanation/narrative for those features after the state of affairs changes. At this point I'm fairly convinced that this proposal meets that standard: it improves things now and if/when something like atext type exists then the explanation that unicode escapes are supported instring literals for consistency withtext literals seems reasonable to me.

This favouring can absolutely be disputed

I don't know enough about Unicode in the wild to have an opinion about this. Are there specific pros/cons for choosing UTF-8 over UTF-16 beyond the fact of more people using UTF-8 in the OCaml world?

Copy link

Contributor

lpw25 commentedJul 11, 2017•
edited
Loading

That said, I don't see a realistic migration path here, at least for the stdlib.

Whilst obviously difficult I think it is achievable. One option would be to take a similar approach to the-safe-string change: expose the representation oftext as a (UTF-8 encoded)string for backwards compatibility and add an option which hides this equality, then slowly move everyone into the world where the equality doesn't exist.

Looking over the stdlib the main things which need changing are inPrintf,Format,Scanf andPervasives. There aren't actually many other uses ofstring which should betext. Possibly some of the stuff inArg andSys should usetext but -- as pointed out earlier in this thread -- things like filenames are actually more accurately represented asstring anyway. I would probably leaveBuffer usingstring, but maybe add aTextBuffer providing a similar facility fortext.

Copy link

ContributorAuthor

dbuenzli commentedJul 11, 2017

Are there specific pros/cons for choosing UTF-8 over UTF-16 beyond the fact of more people using UTF-8 in the OCaml world?

From a storage size perspective UTF-8 is fine for most latin scripts but a bit more wasteful for east-asian scripts, the exact converse is true for UTF-16.

Now what follows may well be the point of view of a westerner so take it with a grain of salt.

I still have the impression that UTF-8 took a bit more over the world due to the US-ASCII and 8-bit compatibility story, especially at theunix tooling level.

Another random reference is the W3C (which may also be a westerner's point of view for that matter) thatadvises in its high-level documentation to use UTF-8 for documents served on the web. That same page indicates that UTF-8 + ASCII (which is UTF-8 in disguise) makes it for 80% of the webpage encodings. Another page has some (unsourced)stats on the presence of UTF-16 on the web: less than 0.01%.

This is not to say that UTF-16 is dead in the water, you will still find it tucked in some binary formats, Imentioned that whenBuffer.add_utf_*_uchar were added.

Copy link

Contributor

nojb commentedJul 11, 2017

A nice summary of UTF-8 advantages:https://research.swtch.com/utf8

Copy link

ContributorAuthor

dbuenzli commentedJul 11, 2017

A nice summary of UTF-8 advantages:https://research.swtch.com/utf8

It is but note that the very first sentence of this webpage is wrong...

Copy link

Contributor

alainfrisch commentedJul 11, 2017

This is not to say that UTF-16 is dead in the water

UTF-16 is used at least in the Windows API, and as the natural encoding for text in Java, .Net, Javascript. Note that these languages all use their usual String type (which is really just a sequence of 16-bit integers, in the same way that OCaml strings are sequence of 8-bit integers) also to represent filenames. Under Unix at least, this means they must make an assumption about the encoding of filenames on the file system.

Copy link

Contributor

alainfrisch commentedJul 11, 2017

Looking over the stdlib the main things which need changing are in Printf, Format, Scanf and Pervasives

Would you duplicate output_string to output_string/output_text, or declare that output_string operates on text and people should use output_bytes to work with binary /explicitly encoded data? Same question for int_of_string / string_of_int?

I think external "standard" libraries would be a good place to experiment with such design. Did Core/Base do something like that?

alainfrisch approved these changes

Jul 11, 2017

View reviewed changes

Copy link

Contributor

alainfrisch commentedJul 11, 2017•
edited
Loading

Anyway, nobody seems to be opposed to the current proposal, in principle. Considering it impacts the language syntax, it would be good if another core developer could explicitly approve the PR before it is merged. The choice of syntax seems less consensual.@xavierleroy @nojb ?

Copy link

Contributor

nojb commentedJul 11, 2017

Personally the proposed syntax seems fine; my only point was that it is inconsistent with the existing escape syntax (decimal, hex, octal), but OTOH it has other advantages (as discussed above).

alainfrisch added this to the4.06.0 milestone

Jul 11, 2017

Copy link

Contributor

xavierleroy commentedJul 14, 2017

I support this proposal but at this point I'm not 100% convinced by the proposed syntax. Could we please check what other languages are using, just to make sure that we are not gratuitously different?

C:\uXXXX or\UXXXXXXXX

#"true">

Copy link

ContributorAuthor

dbuenzli commentedJul 14, 2017•
edited
Loading

Here's a small sample of the different forms followed by the language in which they can be used

\uXXXX and\UXXXXXXXX, C, C++, C#, D, go, python, racket
\uXXXX as UTF-16 code points, Java, JavaScript, Ruby, Scala, clojure, PHP, racket
\u{X+}, JavaScript (>=ES6), Swift, Rust, Perl, Ruby, PHP, standard Unicoderegexs

I firmly stand behind this syntax choice for the usability reasons mentionedabove

gasche added the caml-weekly-news label

Jul 15, 2017

Copy link

Contributor

alainfrisch commentedJul 18, 2017•
edited
Loading

Java apparently allows multipleu after the backslash (seehttps://stackoverflow.com/questions/21522770/unicode-escape-syntax-in-java ), which can be used by tools that translate between "Unicode" source code and "pure ASCII" in a bijective way. Do we want to support that? (I'm really not arguing for it, but while we are at comparing syntax with other language...)

Copy link

Contributor

alainfrisch commentedJul 18, 2017

After reading@dbuenzli's summary and some other documents such ashttps://tools.ietf.org/html/rfc5137, I'm now quite convinced that the proposed syntax is the right one.

@xavierleroy Do you approve?

Copy link

Member

damiendoligez commentedJul 19, 2017

Note that we are already gratuitously different from C because our\ddd are decimal while C's are octal. Thus it's generally a bad idea to copy string literals from C to OCaml as it may introduce silent failures.

Also, using\u{X+} now doesn't preclude adding\uXXXX and\UXXXXXXXX later.

damiendoligez requested changes

Jul 19, 2017

View reviewed changes

Copy link

Member

damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be reasonable to add tests that exercise the lexer errors?

parsing/lexer.mll Outdated

		begin
		Buffer.reset uchar_utf_8_enc;
		Buffer.add_utf_8_uchar uchar_utf_8_enc u;
		store_string (Buffer.contents uchar_utf_8_enc);

Copy link

Member

damiendoligezJul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Probably for historical reasons: lexer.ml predates Buffer by a number of years.

testsuite/tests/lexing/Makefile Outdated

		@@ -0,0 +1,18 @@
		#**************************************************************************

Copy link

Member

damiendoligezJul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please don't include copyright headers in the test suite, I'm trying to get rid of them.

Copy link

ContributorAuthor

dbuenzli commentedJul 19, 2017

Would it be reasonable to add tests that exercise the lexer errors?

Yes. I actually wanted to do this but I didn't quickly find something to copy cat from (e.g. tests exercising parse errors) at the test suite level. I will have a look again. If someone can point to something welcome.

dbuenzli added2 commits

July 20, 2017 01:11

Add Unicode character escape \u{X+} to OCaml string literals.

dae520c

The syntax of string literals is changed to add the \u{X+} escapesequence which replaces the escape by the UTF-8 encoding of theUnicode scalar value denoted by the hexadecimal number H made of oneto six lower or uppercase hexadecimal digits.Unfortunately due to OCaml lax policy about escape sequences thischange can break program that trigger warning 14 on compilation. Moreprecisely programs which have literals with "\u" as a substring willnow either fail or the string literal will be silently compiled toanother representation in case the literal has a subsequence thatrespects the \u{X+} syntax.

Lexer escapes: factorize hex number parsing.

201c4f1

dbuenzli force-pushed thestring-unicode-escapes branch from4fc7508 to4d787f1Compare

July 19, 2017 23:34

Copy link

ContributorAuthor

dbuenzli commentedJul 19, 2017

So I addressed@damiendoligez's comments. To answer my own question it seems toploop testing is the way to test lex/parse/type errors.

Given his comment aboutBuffer's history I went ahead and added a last commit that replaces the local growable string buffer implementation by aBuffer.t value and adaptedstore_escape_uchar to simply use this buffer instead of using its own private one. Tell me if you'd prefer to have that in a seperate PR.

Lexer string literals: use Buffer module.

8546522

Replace the local implementation of a growable string buffer bya Buffer.t value.

dbuenzli force-pushed thestring-unicode-escapes branch from4d787f1 to8546522Compare

July 19, 2017 23:43

damiendoligez approved these changes

Jul 20, 2017

View reviewed changes

Copy link

Contributor

xavierleroy commentedJul 20, 2017

Thanks for the little discussion of concrete syntax. I'm OK with the current proposal.

Merge branch 'trunk' into string-unicode-escapes

d6d1739

alainfrisch merged commit9c1927b intoocaml:trunk

Jul 20, 2017

dbuenzli deleted the string-unicode-escapes branch

July 20, 2017 14:32

This was referencedFeb 13, 2019

Text wrappingrevery-ui/revery#300

Merged

Can't use unicode escape in native reasonreasonml/reason#2337

Closed

nojb mentioned this pull request

Jul 15, 2021

Unicode support#10518

Closed

dbuenzli mentioned this pull request

Feb 12, 2023

Add functions to help printing Uchar.t values#11999

Open

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request

Jan 12, 2024

Update wf_03_metaprogramming.md (ocaml#1232)

f133592

Fix a 404 link to the dune docs from the meta-programming tutorial.

Labels

caml-weekly-news

7 participants

[8]ページ先頭

Movatterモバイル変換

Add Unicode character escape \u{H} to OCaml string literals.#1232

Add Unicode character escape \u{H} to OCaml string literals.#1232

Conversation

dbuenzli commentedJul 9, 2017• editedLoading

Backward compatibility

lpw25 commentedJul 9, 2017• editedLoading

dbuenzli commentedJul 9, 2017• editedLoading

xavierleroy commentedJul 9, 2017

dbuenzli commentedJul 9, 2017

xavierleroy commentedJul 9, 2017

lpw25 commentedJul 10, 2017

alainfrisch commentedJul 10, 2017

nojb commentedJul 10, 2017

dbuenzli commentedJul 10, 2017• editedLoading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lpw25Jul 10, 2017• editedLoading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbuenzliJul 10, 2017• editedLoading

Choose a reason for hiding this comment

alainfrischJul 10, 2017• editedLoading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbuenzli commentedJul 10, 2017

dbuenzli commentedJul 11, 2017• editedLoading

lpw25 commentedJul 11, 2017• editedLoading

lpw25 commentedJul 11, 2017• editedLoading

dbuenzli commentedJul 11, 2017

nojb commentedJul 11, 2017

dbuenzli commentedJul 11, 2017

alainfrisch commentedJul 11, 2017

alainfrisch commentedJul 11, 2017

alainfrisch commentedJul 11, 2017• editedLoading

nojb commentedJul 11, 2017

xavierleroy commentedJul 14, 2017

dbuenzli commentedJul 14, 2017• editedLoading

alainfrisch commentedJul 18, 2017• editedLoading

alainfrisch commentedJul 18, 2017

damiendoligez commentedJul 19, 2017

damiendoligez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbuenzli commentedJul 19, 2017

dbuenzli commentedJul 19, 2017

xavierleroy commentedJul 20, 2017

dbuenzli commentedJul 9, 2017•
edited
Loading

lpw25 commentedJul 9, 2017•
edited
Loading

dbuenzli commentedJul 9, 2017•
edited
Loading

dbuenzli commentedJul 10, 2017•
edited
Loading

lpw25Jul 10, 2017•
edited
Loading

dbuenzliJul 10, 2017•
edited
Loading

alainfrischJul 10, 2017•
edited
Loading

dbuenzli commentedJul 11, 2017•
edited
Loading

lpw25 commentedJul 11, 2017•
edited
Loading

lpw25 commentedJul 11, 2017•
edited
Loading

alainfrisch commentedJul 11, 2017•
edited
Loading

dbuenzli commentedJul 14, 2017•
edited
Loading

alainfrisch commentedJul 18, 2017•
edited
Loading