NotificationsYou must be signed in to change notification settings
Fork1.1k
Star5.7k

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

One more Windows Unicode PR: do not use %S#1398

Merged

gasche merged 2 commits intoocaml:trunkfromnojb:do_not_use_%S

Oct 8, 2017

Merged

One more Windows Unicode PR: do not use %S#1398

gasche merged 2 commits intoocaml:trunkfromnojb:do_not_use_%S

Oct 8, 2017

Conversation

Copy link

Contributor

nojb commentedOct 4, 2017

This PR addresses the following small issue left over from#1200. In that PR there are a handful of uses of theprintf format specifier%S to print wide strings (wchar_t *), mostly in calls tocaml_gc_message to print filename arguments (see e.g.here andhere).

The specifier%S is bothPOSIX and conforms to theUnix specification. Unfortunately it is effectively broken under Windows (in spite of"supporting" it). The details are a little long but see thispost for some good pointers.

Hence we would like to get rid of the%S specifiers. The only issue is that if we convert the wide strings into usual strings with our existing functions then we must not forget tofree them, etc. Also it would be nicer not to allocate when printing GC debug messages.

This PR defines a functionwin_utf8_string_of_utf16 inwin32.c which converts as needed but uses a fixed-length statically-allocated buffer to store the resulting UTF-8 string. This is not an issue because in each use of the function the result of doing the conversion does not need to be stored anywhere (and possible truncation is OK).

We also use this function for some arguments tocaml_fatal_error_arg which were being "leaked" (just before the program ends).

Together with this change, we also add a call toSetConsoleOutputCP in the differentcaml_{main,startup*} functions so that, under Windows with Unicode mode enabled, the console code page is set to UTF-8, saving the user from having to do it. This helps to print things correctly on the Windows Console (e.g. if you run your programs from the Command Prompt).

Lastly, and for the record, as far as I know, the most "native" solution to these printing issues would be to use the Windows callWriteConsole (like we do inheadernt.c) to print wide strings, which would not require us to do any conversion. However, it is not straightforward to integrate this function in ourprintf-based functionscaml_gc_message,caml_fatal_error_arg, andcaml_fatal_error_arg2.

@dra27

nojb force-pushed thedo_not_use_%S branch from9006c4e todab0a91Compare

October 4, 2017 22:20

gasche reviewed

Oct 5, 2017

View reviewed changes

Copy link

Member

gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If the function may truncate its input (and has a special memory-ownship behavior), it should be clear in the name. Currently the name suggests a normal conversion function, that (1) creates a new caller-owned value and (2) is injective.

I'm not fully convinced that this solution is an improvement over the direct style of having a conversion call before the print and a free call after the print. This is a burden at the callsite, but it is a normal kind of burden for C programmers, whereas you alleviate the issue by introducing a convenient but unusual way to shoot oneself in the foot (if someday the usage is extended to a case where truncation is problematic; and truncation has proved problematic in other places of the runtime before, eg. executable_name).

One aspect of the explicit-free solution that is unsatisfying is that error functions (caml_fatal_error_arg) never return and would thus never freed the copied strings (which is inconvenient if you look for leaks using valgrind, etc.). But sincebf70eb9 we explicitly trackcaml_stat_alloc memory, so using a static allocation should not be an issue there.

byterun/win32.c Outdated


		CAMLexport char * win_utf8_string_of_utf16(const wchar_t * s)
		{
		static char buf[1024];

Copy link

Member

gascheOct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this use of global state correct in a multicore runtime? (Would the static buffer be copied by each runtime instance, which is fine, or is there a risk of race to a single buffer?)

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This was an alarm bell for me - I think@stedolan will confirm that we should be avoiding adding static buffers unless absolutely necessary.

Copy link

Contributor

stedolanOct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Well spotted, and yes, this would be bad in multicore. (I have some hacky scripts to detect such buffers, but they only work on ELF binaries. I'm not sure how to prevent them sneaking into windows-only codepaths, other than by vigilance)

byterun/caml/osdeps.h Outdated

		The returned string points to a fixed-length, statically-allocated buffer so
		make sure to copy it if it needs to be preserved. Also, due to the fixed
		length of the buffer, the result may be truncated. */
		extern char * win_utf8_string_of_utf16(const wchar_t *s);

Copy link

Member

gascheOct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The function name in the documentation comment does not match the exposed name.

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The function no longer matches the semantics of the comment, either!

Copy link

ContributorAuthor

nojb commentedOct 5, 2017

Thanks for the review@gasche! I agree that my approach may be more complex/error-prone than needed. I will amend by explicitly allocating and freeing the converted string at thecaml_gc_message callsites (and do not worry about freeing in thecaml_fatal_error_arg calls).

Copy link

Member

dra27 commentedOct 5, 2017

Just about to review this too -@gasche, please could we hold off beta2 until a decision is taken on this one. The code page setting part of this GPR I would very much like to be in (I'm preparing a larger post on the Unicode alterations for Windows)

Copy link

Member

gasche commentedOct 5, 2017

please could we hold off beta2 until a decision is taken on this one

sure

(I have to discuss it with Damien first but I'm thinking of trying a "beta early, beta often" approach this time, which means that I could not wait and just do another beta when you ask for one.)

Copy link

Contributor

alainfrisch commentedOct 5, 2017

What's the current behavior and how bad it is? If it's only about displaying garbage for non-ASCII characters in filenames in case of runtime debug messages or fatal errors, this does not sound a critical bug, and it might be better to avoid any risk of regression or rushed solution for 4.06.

dra27 reviewed

Oct 5, 2017

View reviewed changes

Copy link

Member

dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The previous code you showed me had some changes in the debugger - have you intentionally reverted those?

asmrun/startup.c Outdated

		@@ -121,6 +125,10 @@ value caml_startup_common(char_os **argv, int pooling)
		if (!caml_startup_aux(pooling))
		return Val_unit;

		#if defined(_WIN32) && WINDOWS_UNICODE
		SetConsoleOutputCP(CP_UTF8);

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It would be worth a comment noting that this affects all output functions (i.e. both stdout and stderr), I think.

Two possible ideas:

Selecting CP_UTF8 only works if a TrueType font is selected (legacy Raster Fonts won't work, which is still the default case pre Windows 8, IIRC). You can detect this using this function

intconsole_supports_unicode (void) {CONSOLE_FONT_INFOEXfontInfo;fontInfo.cbSize=sizeof(fontInfo);if (GetCurrentConsoleFontEx(GetStdHandle(STD_OUTPUT_HANDLE), FALSE,&fontInfo)&& !(fontInfo.FontFamily&TMPF_TRUETYPE)) {return (GetCurrentConsoleFontEx(GetStdHandle(STD_ERROR_HANDLE), FALSE,&fontInfo)&& (fontInfo.FontFamily&TMPF_TRUETYPE));  }else {return1;  }}

Having done that, it might be worth the addition of a function inSys which exposes this (a bit likeUnix.has_symlink)

byterun/caml/misc.h Outdated

		@@ -230,6 +231,7 @@ typedef char char_os;
		#define caml_stat_strdup_to_os caml_stat_strdup
		#define caml_stat_strdup_of_os caml_stat_strdup
		#define caml_copy_string_of_os caml_copy_string
		#define caml_utf8_string_of_os(s) s

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think convention says thes should be(s) to ensure that you don't change the meaning of an expression, although I think you'd have to be doing something seriously contrived to manage that with a string pointer.

byterun/caml/osdeps.h Outdated

		The returned string points to a fixed-length, statically-allocated buffer so
		make sure to copy it if it needs to be preserved. Also, due to the fixed
		length of the buffer, the result may be truncated. */
		extern char * win_utf8_string_of_utf16(const wchar_t *s);

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The function no longer matches the semantics of the comment, either!

byterun/dynlink.c Outdated

		@@ -130,14 +130,13 @@ static void open_shared_lib(char_os * name)
		void * handle;

		realname = caml_search_dll_in_path(&caml_shared_libs_path, name);
		caml_gc_message(0x100, "Loading shared library %"
		ARCH_CHARNATSTR_PRINTF_FORMAT "\n", realname);
		caml_gc_message(0x100, "Loading shared library %s\n", caml_utf8_string_of_os(realname));

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I agree with@gasche's concern about using a static buffer. You cover nearly all cases of the strdup with either caml_fatal_arg (which will automatically clean up caml_stat_strdup anyway) or caml_gc_message. You could introduce caml_gc_message_free which strictly takes a string parameter and frees it after displaying the message.

byterun/startup.c Outdated

		@@ -326,6 +326,10 @@ CAMLexport void caml_main(char_os **argv)
		if (!caml_startup_aux(/* pooling */ caml_cleanup_on_exit))
		return;

		#if defined(_WIN32) && WINDOWS_UNICODE
		SetConsoleOutputCP(CP_UTF8);

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

See similar comment from asmrun/startup.c

byterun/startup.c Outdated

		@@ -452,6 +456,10 @@ CAMLexport value caml_startup_code_exn(
		if (!caml_startup_aux(pooling))
		return Val_unit;

		#if defined(_WIN32) && WINDOWS_UNICODE
		SetConsoleOutputCP(CP_UTF8);

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

And again

byterun/sys.c Outdated

		@@ -647,7 +647,7 @@ void caml_load_plugin(char_os *plugin)
		}
		} else {
		fprintf(stderr, "Cannot load C plugin %s\nReason: %s\n",
		caml_stat_strdup_of_os(plugin), caml_dlerror());
		caml_utf8_string_of_os(plugin), caml_dlerror());

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This would require a manual free if caml_utf8_string_of_os starts allocating.

Copy link

ContributorAuthor

nojb commentedOct 5, 2017

The current behavior is that it will either display garbage or even nothing at all (depending on Console code page, etc) for non-ASCII characters in those places where%S is used (runtime debug messages). ASCII characters should come out fine.

Copy link

Member

dra27 commentedOct 5, 2017

TheSetConsoleOutputCP part is about displaying anything at all - without it, you get raw UTF-8 which from a PR (by which I mean "Public Relations") perspective makes it look like we're not properly supporting Unicode. The code to enable it is correctly guarded by theWINDOWS_UNICODE, so legacy mode can continue "unaffected". I can imagine we may end up getting various reports along the lines of "OCaml isn't displaying foreign characters correctly" which get the response "Please run chcp 65001" otherwise!

Copy link

Member

dra27 commentedOct 5, 2017

Note that runningSetConsoleOutputCP affects your process only - the console returns to whichever codepage the user had previously selected when your process terminates.

Copy link

Contributor

alainfrisch commentedOct 5, 2017

Note that running SetConsoleOutputCP affects your process only

What about the scenario where the OCaml runtime is embedded in a host application? Could this negatively affect outputs of this application?

Copy link

Member

dra27 commentedOct 5, 2017

Compiled in legacy mode, no difference at all; compiled with WINDOWS_UNICODE=1, the Console would start correctly displaying UTF-8 sequences as the characters they represent - such an application would definitely have been displaying garbage before. It definitely only affects the console, not any file translations.

Copy link

Member

dra27 commentedOct 5, 2017

(it also doesn't affect input processing, which is SetConsoleCP)

Copy link

ContributorAuthor

nojb commentedOct 5, 2017

I pushed one commit changing to explicitly allocation/deallocation around the fewcaml_gc_message calls. Altogether it is much simpler and we do not even have to introduce a new function.

@dra27:

I will add some comments to theSetConsoleOutputCP calls.
What about input? Should we also callSetConsoleCP?
What is the point of checking whether a Unicode font is available? To print a message to the user? Or simply to expose this to the OCaml programmer?

gasche reviewed

Oct 5, 2017

View reviewed changes

byterun/dynlink.c

		void * handle;

		realname = caml_search_dll_in_path(&caml_shared_libs_path, name);
		caml_gc_message(0x100, "Loading shared library %"
		ARCH_CHARNATSTR_PRINTF_FORMAT "\n", realname);

Copy link

Member

gascheOct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Unrelated to the PR, but am I correct thatARCH_CHARNATSTR_PRINTF_FORMAT is a new 4.06 thing that escaped the bigos rename? (Will this PR end up removing it?)

Copy link

ContributorAuthor

nojbOct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yes, and yes.

Copy link

Member

gasche commentedOct 5, 2017

I have the feeling that we are converging on the matter of how printf stuff should handle the string conversion, but that the ConsoleCP stuff requires another level of expertise and is somewhat orthogonal (although in the end we need to have both for proper Unicode printing, if I understand correctly).@nojb, you may consider splitting the PR in two parts -- but feel free to keep only one if that's easier for you.

dra27 reviewed

Oct 5, 2017

View reviewed changes

byterun/win32.c Outdated

		CAMLexport int win_wide_char_to_multi_byte(const wchar_t s, int slen, char out, int outlen)
		{
		int retcode = win_wide_char_to_multi_byte_noexc(s, slen, out, outlen);

		if (retcode == 0)

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This code is almost certainly now eliminated, but this should beif (retcode == 0 && slen != 0)

Copy link

ContributorAuthor

nojbOct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Indeed, this was the cause for a previous AppVeyor failure, but is now gone.

Copy link

Member

dra27Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(that was causing your previous AppVeyor failure, since the building of dynlink includes-ccopt "" so you were hitting the case of an empty string being passed towin_wide_char_to_multi_byte)

Copy link

Member

dra27 commentedOct 5, 2017

@nojb - cosmetically, it would be that we only tell the console to switch to UTF-8 output if it can actually have a chance of displaying the symbols. It would certainly be useful to display that to the user. Cosmetically, setting CP_UTF8 when a truetype font is not selected for the console should cause no change in the output (you just get garbage UTF-8). It certainly doesn'thave to be done in this PR at all, therefore.

@gasche - SetConsoleCP only affects input, we get proper Unicode output on the console without doing that.

Copy link

Member

dra27 commentedOct 5, 2017

I don't~~fully~~ understand exactly how these things work with non-Latin keyboards but, for example, if on my Windows box I add Russian keyboard layout, switch the code page to 1250 and have a simple C program which callsgets and displays the output and press the letters ASD on my QWERTY keyboard, then the C program gets three ??? characters. If I alter the program to run SetConsoleCP(CP_UTF8) and repeat the experiment, thengets returns an empty string.

That experiment is enough to suggest that we shouldn't change the way input is being handled - SetConsoleOutputCP is only ever going to affect what is displayed to the user, it's not going (potentially) to change the semantic behaviour of programs.

Copy link

ContributorAuthor

nojb commentedOct 5, 2017

@alainfrisch Re embedded runtime situation, my understanding is as follows:

The call toSetConsoleOutputCP fixes the way the bytes output by your process (be it OCaml code or some other code) get interpreted by the Console, which is natively Unicode. So if you are emitting bytes (say, by using C'sprintf), you have to make sure these bytes have the right encoding. But if you are using C# or other .NET things which work with Unicode strings to write to the Console then no translation is actually necessary and things should just work.
The call toSetConsoleOutputCP will not have any effect whatsoever if standard output/error is redirected (in which they are not really going to the Console but to a file, another process, etc...)

@dra27 do you agree?

Copy link

Member

dra27 commentedOct 5, 2017

@nojb - yes, that's my understanding of it.

Copy link

ContributorAuthor

nojb commentedOct 5, 2017

@dra27 I added a comment to theSetConsoleOutputCP calls. I would prefer to leave the other bits (figuring out if the font supports Unicode and possibly exposing this inSys) for later.

I think we addressed the other points as well, so in principle I consider this PR ready to be merged.

Copy link

Member

gasche commentedOct 5, 2017

Would you rebase the PR to remove some back-and-forth, but preferably keep the console-setting and %S-business in separate commits?

@dra27: do you approve the current state? Do you think we should cherry-pick in 4.06?

Do not use %S

d0f68c9

nojb force-pushed thedo_not_use_%S branch from30a978f to0e8afd5Compare

October 5, 2017 12:26

Copy link

ContributorAuthor

nojb commentedOct 5, 2017

Rebased.

Copy link

Member

gasche commentedOct 5, 2017

I approve the first and third commits :-)

Copy link

Member

dra27 commentedOct 5, 2017

I approve the second commit :-) Yes, I think we should cherry-pick this to 4.06 - to a certain extent, it's all part of the "big picture" of Windows Unicode support.

gasche approved these changes

Oct 5, 2017

View reviewed changes

dra27 added the suspended label

Oct 5, 2017

Copy link

Member

dra27 commentedOct 5, 2017

Still investigating some issues (especially aroundMPR#6925) with regard to theSetConsoleOutputCP call.

Copy link

Member

gasche commentedOct 5, 2017•
edited
Loading

@nojb I would be able to merge the %S change in 4.06 quickly if you send a separate PR (I'll let@dra27 deal with the console business). We need it at some point anyway: that the ARCHCHARNAT must not be released.

gasche added this to the4.06.0 milestone

Oct 5, 2017

Copy link

ContributorAuthor

nojb commentedOct 5, 2017

OK, for simplicity I will simply remove commit 2 from this PR, and we can open a new PR for it as needed.

Add Changes entry

f9290fc

nojb force-pushed thedo_not_use_%S branch from0e8afd5 tof9290fcCompare

October 5, 2017 14:59

nojb changed the title~~One more Windows Unicode PR: do not use %S, set Console code page to UTF-8~~One more Windows Unicode PR: do not use %S

Oct 5, 2017

Copy link

ContributorAuthor

nojb commentedOct 5, 2017

OK, done (and renamed PR for accuracy).

dra27 removed the suspended label

Oct 5, 2017

dra27 mentioned this pull request

Oct 6, 2017

Windows Console Unicode Support#1408

Open

gasche merged commit6aae29d intoocaml:trunk

Oct 8, 2017

gasche added a commit that referenced this pull request

Oct 8, 2017

Merge pull request#1398from nojb/do_not_use_%S

d8e0921

One more Windows Unicode PR: do not use %S(cherry picked from commit6aae29d)

Copy link

Member

gasche commentedOct 8, 2017

Merged in trunk and cherry-picked in 4.06 (d8e0921 ). (There was really no other choice than cherry-picking as this commit removes an API that we don't want to release.)

Labels

None yet

5 participants

Movatterモバイル変換

One more Windows Unicode PR: do not use %S#1398

One more Windows Unicode PR: do not use %S#1398

Conversation

nojb commentedOct 4, 2017

gasche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nojb commentedOct 5, 2017

dra27 commentedOct 5, 2017

gasche commentedOct 5, 2017

alainfrisch commentedOct 5, 2017

dra27 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nojb commentedOct 5, 2017

dra27 commentedOct 5, 2017

dra27 commentedOct 5, 2017

alainfrisch commentedOct 5, 2017

dra27 commentedOct 5, 2017

dra27 commentedOct 5, 2017

nojb commentedOct 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche commentedOct 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dra27 commentedOct 5, 2017

dra27 commentedOct 5, 2017

nojb commentedOct 5, 2017

dra27 commentedOct 5, 2017

nojb commentedOct 5, 2017

gasche commentedOct 5, 2017

nojb commentedOct 5, 2017

gasche commentedOct 5, 2017

dra27 commentedOct 5, 2017

dra27 commentedOct 5, 2017

gasche commentedOct 5, 2017• editedLoading

nojb commentedOct 5, 2017

nojb commentedOct 5, 2017

gasche commentedOct 8, 2017

gasche commentedOct 5, 2017•
edited
Loading