- Notifications
You must be signed in to change notification settings - Fork1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Add Unicode character escape \u{H} to OCaml string literals.#1232
Conversation
9a4ee42
tode0cc10
CompareI'm not 100% sure about this change. Essentially because I'm not clear about what the overall plan is for Unicode in OCaml. Fundamentally, it is not clear to me what the type
With the addition of the There are features of
This patch and some other recent changes seem to be heading for a world where My personal preference would probably be to add a There are other options that are also perfectly reasonable. For example, add |
Thanks for your feedback@lpw25. First I will note that some of your interpretations don't really make sense to me, there is no such things as an "UTF-8 character" so, as this may change the perspective a bit, I will slightly tweak your points that interest me to change them to:
Now for me it is very clear that the interpretation of In particular this is needed for interoperability reasons, e.g. all the OS APIs should be allowed to take/return byte strings for filenames, on Windows this will be valid UTF-8 after#1200, but for POSIX APIs this could be arbitrary sequence of bytes as it all depends on the file system you are interacting with. Now regarding this proposal and in the the light of (my redefinition) of 4 and 8 I personally don't see it as changing themeaning of strings interpreted as 2: the escape sequences simply expand to a sequence of bytes which is consistant with 2. It only helps those who need tospecify them in programs as 4 or 8. Regarding a new While in#1091 point 6. I justify why I intentionally left corresponding decoding APIs. I think that leaving aside decoding |
How do I write "unicode character 0x1234 followed by a |
That's not the syntax this PR implements. The answer to your question is |
Ah, indeed, I missed the braces, so my question is moot. Perhaps because I spend too much time reading the C standards lately, I was expecting |
Indeed my original comment was a bit imprecise.
That's certainly a reasonable meaning to give it, although it does then become a bit odd that we have functions which just assume such a sequence is ASCII -- e.g. the printing functions. This could probably be explained as a hangover from C which also confuses it's ASCII string type with it's array of bytes type.
For me, this doesn't follow. I think the need for these interpretations justifies the existence of conversion functions into types that properly represent them (e.g.
It's not clear from what I wrote, but that is also what I had in mind.
This is a good thing in my opinion. Assuming that all filenames are valid utf8 is exactly the kind of subtle bug that I suspect is quite common in OCaml at the moment, and will continue to be without a type level distinction between these encodings. With a
I certainly agree we shouldn't rush the design for such a thing. However, I think that we should already be discussing whether we intend to make such a thing in the future. I think that decision is required to clarify our intended design and would make it easier to produce good decisions on existing proposals.
Do you think this module is useful after the addition of a
I could certainly buy that argument for its inclusion, something like "it is a convenient short-hand to make it easy to produce a utf8 encoded string directly rather than via |
In principle, I agree that relying more on the type system to distinguish between byte arrays and an text (= sequence of Unicode scalar values) would be good. In particular, it would allow to untie the representation of text with a specific encoding, which would be good for Javascript backends (that could implement text with native JS strings). This is coherent with the proposal on "unsigned boxed integers" (where the idea is to avoid mixing different "types" just because they happen to share a common representation). That said, I don't see a realistic migration path here, at least for the stdlib. Alternative "standard" libraries which put less emphasis on keeping backward compatibility could indeed decide to keep "bytes" and export a new "text" type (or rename it to "string") implemented either as utf-8 encoded buffers (bytes) or as arrays of scalar values (int array), with only proper operations exposed. So I'm in favor of the change which provides a useful feature for the de facto situation that the "string" type is used (sometimes) to represent utf-8 encoded text. I've no opinion on the proposed syntax, though (and did not review the PR). |
Using braces to allow eliding leading zeros is not consistent with existing syntax for escapes ( |
@lpw25 I will respond to your message later as it is an interesting discussion to be had.
You cannot really do this without introducing two escape syntaxes like in C (i.e. Personally I don't want to think about the value of my character or the context in which it occurs to know which escape I should use; it artificially divides the Unicode code space. C's escape design is basically a remnant of the wide chars design and I'd rather not use it (note that C's The
First if you want to copy-paste to/from C you'd need |
manual/manual/refman/lex.etex Outdated
table given above for character literals, or an Unicode character | ||
escape sequence. | ||
An Unicode character escape sequence is substituted by the UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
An native English speaker would now better, but I would have written "A Unicode".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I was always tormented by that. Now I think I eventually understood the rule. The rule is not about the vowel as a letter but as a sound and apparentlyy
is not a vowel in english (e.g. a university, not an university). So I think you are right, I'd still like an native english speaker to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would say "A Unicode".
manual/manual/refman/lex.etex Outdated
escape sequence. | ||
An Unicode character escape sequence is substituted by the UTF-8 | ||
encoding of the Unicode character specified using 1 to 6 hexadecimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Unicodecharacter? Why not refer directly to a Unicode scalar value?
parsing/lexer.mll Outdated
begin | ||
Buffer.reset uchar_utf_8_enc; | ||
Buffer.add_utf_8_uchar uchar_utf_8_enc u; | ||
store_string (Buffer.contents uchar_utf_8_enc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Not directly related to this PR, but does anyone know why lexer.mml doesn't simply use a Buffer.t instead of its own version? This would simplify the chunk above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Probably for historical reasons: lexer.ml predates Buffer by a number of years.
parsing/lexer.mll Outdated
raise | ||
(Error | ||
(Illegal_escape | ||
(Lexing.lexeme lexbuf ^ ", " ^ Printf.sprintf "%X" cp ^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What's the point of showing both the lexbuf content (\u{hhhh}
) and the parsed cp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The lexbuf content is what the user wrote which may have for example leading zeroes. The parsedcp
what the program understood. Showing how the program interpreted your input sometimes leads to easier diagnostics.
parsing/lexer.mll Outdated
@@ -627,6 +662,10 @@ and string = parse | |||
| '\\' 'x' ['0'-'9' 'a'-'f' 'A'-'F'] ['0'-'9' 'a'-'f' 'A'-'F'] | |||
{ store_escaped_char lexbuf (char_for_hexadecimal_code lexbuf 2); | |||
string lexbuf } | |||
| '\\' 'u' '{' | |||
hex_digit hex_digit? hex_digit? hex_digit? hex_digit? hex_digit? '}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What abouthex_digit+
and an explicit error message if more than 6 digits? Otherwise the lexer falls in the generic "Illegal backslash" warning.
You might also want to allow '_' as for other integer literals (of non-fixed length).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Will do thehex_digit+
.
Regarding_
I'd rather not, this would diverge from the standard\u{h}
notation (which I would like to stress I didn't invent out of thin air).
parsing/lexer.mll Outdated
let rec hex_num_value acc i = | ||
if i > epos then acc else | ||
let digit = Char.code (Lexing.lexeme_char lexbuf i) in | ||
let value = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I suggest introduce aval hex_digit: char -> int
and use it also inchar_for_hexadecimal
. Or perhaps directlyval hex_from_lexbuf: lexbuf -> int (*pos*)-> int (*len*)-> int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looking at this file there are many redundencies that could benefit to be factorized (e.g. the hex digits are repeated all over the place). I can certainly do this one in a separate commit.
de0cc10
to4c31a0d
CompareUpdated the PR to take Alain's review (thanks !) into account. |
4c31a0d
to4fc7508
Compare@lpw25 raises good questions which I try to answer below.
That's a bit of a nitpick, but since the same printing functions are used to output on streams open inbinary mode I wouldn't say that those assume sequences of US-ASCII bytes.
I agree with this. But I also argue that this PR does not define a new way ofoperating on such interpretations. It gives a new, more convenient (you can already handle what it does via I'm not advocating for adding new text (i.e. non interpretation 2) operations on I do however advocate support for defining (e.g. via literals), encoding and decoding the interpretations 2, 4, and 8 of
I'm sure most OCaml programs at the moment do not make that kind of assumption. Most of the time when you use the OS interface you get bytes from the OS and at most you remove/replace/append a few US-ASCII bytes and treat the rest as opaque. If this is already too much for you, since this is all baked by
So let us assume such as unicode text type is added to the stdliband the language. We would get a new data type and a new syntax for literals of this type. The only way I see this PR interfering with such a prospect is consistency in the syntax adopted for escaping Unicode scalar values in the new literals and in the
To be honest I'm not sure I'm convinced by the addition of such a module at all (which is now implemented inthis branch of Uutf). If it were only me and algebraic effects I think I would only provide theUutf folding functions in
It can't be the first or second argument since that situation doesn't exist; if it did I would however say that the second (consistency between string and text lits) would be a reasonable justification in my opinion (see however [1] below). I wouldn't say it's the third either. First it's not a foolproof way to create UTF-8 encoded strings since string literals remain fundamentally 2. Second this doesn't say that UTF-8 is the primary representation for text in OCaml as this doesn't preclude the introduction of a text type and associated literal syntax in the future. However what it does clearly say is that the interpretation 8 is prevalent in OCaml and by this PR definitively favoured by the language over the UTF-16 encodings (which are perfectly allowed to live in I can perfectly accept the argument that this PR is just convenience (write I hope the above clears things a little bit up and will allow the language designers to better assess the meaning and cost/benefit of this proposal. [1] I could mention |
I don't mind introducing features to deal with the current state of affairs as long as its possible to give a good explanation/narrative for those features after the state of affairs changes. At this point I'm fairly convinced that this proposal meets that standard: it improves things now and if/when something like a
I don't know enough about Unicode in the wild to have an opinion about this. Are there specific pros/cons for choosing UTF-8 over UTF-16 beyond the fact of more people using UTF-8 in the OCaml world? |
Whilst obviously difficult I think it is achievable. One option would be to take a similar approach to the Looking over the stdlib the main things which need changing are in |
From a storage size perspective UTF-8 is fine for most latin scripts but a bit more wasteful for east-asian scripts, the exact converse is true for UTF-16. Now what follows may well be the point of view of a westerner so take it with a grain of salt. I still have the impression that UTF-8 took a bit more over the world due to the US-ASCII and 8-bit compatibility story, especially at the Another random reference is the W3C (which may also be a westerner's point of view for that matter) thatadvises in its high-level documentation to use UTF-8 for documents served on the web. That same page indicates that UTF-8 + ASCII (which is UTF-8 in disguise) makes it for 80% of the webpage encodings. Another page has some (unsourced)stats on the presence of UTF-16 on the web: less than 0.01%. This is not to say that UTF-16 is dead in the water, you will still find it tucked in some binary formats, Imentioned that when |
A nice summary of UTF-8 advantages:https://research.swtch.com/utf8 |
It is but note that the very first sentence of this webpage is wrong... |
UTF-16 is used at least in the Windows API, and as the natural encoding for text in Java, .Net, Javascript. Note that these languages all use their usual String type (which is really just a sequence of 16-bit integers, in the same way that OCaml strings are sequence of 8-bit integers) also to represent filenames. Under Unix at least, this means they must make an assumption about the encoding of filenames on the file system. |
Would you duplicate output_string to output_string/output_text, or declare that output_string operates on text and people should use output_bytes to work with binary /explicitly encoded data? Same question for int_of_string / string_of_int? I think external "standard" libraries would be a good place to experiment with such design. Did Core/Base do something like that? |
Anyway, nobody seems to be opposed to the current proposal, in principle. Considering it impacts the language syntax, it would be good if another core developer could explicitly approve the PR before it is merged. The choice of syntax seems less consensual.@xavierleroy@nojb ? |
Personally the proposed syntax seems fine; my only point was that it is inconsistent with the existing escape syntax (decimal, hex, octal), but OTOH it has other advantages (as discussed above). |
I support this proposal but at this point I'm not 100% convinced by the proposed syntax. Could we please check what other languages are using, just to make sure that we are not gratuitously different?
|
This is a bit outside of my scope of expertise and should be carefully considered by people in charge of the evolution of the syntax of the language as I'm not sure I fully grasp all the consequences...
It also firmly promotes the idea that
string
values, if they are to be interpreted as text, are UTF-8 encoded. But with#1200 in the pipeline this is really becoming a fact. The implementation is enabled by the recent addition ofBuffer.add_utf_8_uchar
in#1091.The syntax of string literals is changed to add the
\u{H}
escape sequence which replaces the escape by the UTF-8 encoding of the Unicode scalar value denoted by the hexadecimal numberH
made of one to six lower or uppercase hexadecimal digits.The syntax chosen is the one usually used for Unicode character escapes with a variable number of hex digits. Seewikipedia and for example thenotation used in the standard for Unicode regular expression syntax or inrust. I think it's a much better notation than for example the
C99
orgo
's escape sequences which have two fixed width notations: either\uHHHH
or\UHHHHHHHH
.Backward compatibility
Unfortunately due to OCaml'slax policy about escape sequences this change can break program that trigger warning 14 on compilation. More precisely programs which have literals with
"\u"
as a substring (the backslash being unescaped) will now either fail or the string literal will be silently compiled to another representation in case the literal has a subsequence that respects the\u{H}
syntax.