JEP 400: UTF-8 by Default

Authors	Alan Bateman, Naoto Sato
Owner	Naoto Sato
Type	Feature
Scope	SE
Status	Closed / Delivered
Release	18
Component	core-libs / java.nio.charsets
Discussion	core dash libs dash dev at openjdk dot java dot net
Effort	XS
Duration	XS
Reviewed by	Alex Buckley, Brian Goetz
Endorsed by	Brian Goetz
Created	2017/08/31 13:16
Updated	2025/04/14 07:28
Issue	8187041

Summary

Specify UTF-8 as the default charset of the standard Java APIs. With this change, APIs that depend upon the default charset will behave consistently across all implementations, operating systems, locales, and configurations.

Goals

Make Java programs more predictable and portable when their code relies on the default charset.
Clarify where the standard Java API uses the default charset.
Standardize on UTF-8 throughout the standard Java APIs, except for console I/O.

Non-Goals

It is not a goal to define new standard Java APIs or supported JDK APIs, although this effort may identify opportunities where new convenience methods might make existing APIs more approachable or easier to use.
There is no intent to deprecate or remove standard Java APIs that rely on the default charset rather than taking an explicit charset parameter.

Motivation

Standard Java APIs for reading and writing files and for processing text allow acharset to be passed as an argument. A charset governs the conversion between raw bytes and the 16-bitchar values of the Java programming language. Supported charsets include, for example, US-ASCII, UTF-8, and ISO-8859-1.

If a charset argument is not passed, then standard Java APIs typically use thedefault charset. The JDK chooses the default charset at startup based upon the run-time environment: the operating system, the user's locale, and other factors.

Because the default charset is not the same everywhere, APIs that use the default charset pose many non-obvious hazards, even to experienced developers.

Consider an application that creates ajava.io.FileWriter without passing a charset, and then uses it to write some text to a file. The resulting file will contain a sequence of bytes encoded using the default charset of the JDK running the application. A second application, run on a different machine or by a different user on the same machine, creates ajava.io.FileReader without passing a charset and uses it to read the bytes in that file. The resulting text contains a sequence of characters decoded using the default charset of the JDK running the second application. If the default charset differs between the JDK of the first application and the JDK of the second application then the resulting text may be silently corrupted or incomplete, since theFileReader cannot tell that it decoded the text using thewrong charset relative to theFileWriter. Here is an example of this hazard, where a Japanese text file encoded inUTF-8 on macOS is corrupted when read on Windows in US-English or Japanese locales:

java.io.FileReader(“hello.txt”) -> “こんにちは” (macOS)java.io.FileReader(“hello.txt”) -> “ã?“ã‚“ã?«ã?¡ã? ” (Windows (en-US))java.io.FileReader(“hello.txt”) -> “縺ォ縺。縺ッ” (Windows (ja-JP)

Developers familiar with such hazards can use methods and constructors that take a charset argument explicitly. However, having to pass an argument prevents methods and constructors from being used via method references (::) in stream pipelines.

Developers sometimes attempt to configure the default charset by setting the system propertyfile.encoding on the command line (i.e.,java -Dfile.encoding=...), but this has never been supported. Furthermore, attempting to set the property programmatically (i.e.,System.setProperty(...)) after the Java runtime has started does not work.

Not all standard Java APIs defer to the JDK's choice of default charset. For example, the methods injava.nio.file.Files that read or write files without aCharset argument are specified to always use UTF-8. The fact that newer APIs default to using UTF-8 while older APIs default to using the default charset is a hazard for applications that use a mix of APIs.

The entire Java ecosystem would benefit if the default charset were specified to be the same everywhere. Applications that are not concerned with portability will see little impact, while applications that embrace portability by passing charset arguments will see no impact. UTF-8 haslong been the most common charset on the World Wide Web. UTF-8 is standard for the XML and JSON files processed by vast numbers of Java programs, and Java's own APIs increasingly favor UTF-8 in, e.g., theNIO API and forproperty files. It therefore makes sense to specify UTF-8 as the default charset for all Java APIs.

We recognize that this change could have a widespread compatibility impact on programs that migrate to JDK 18. For this reason, it will always be possible to recover the pre-JDK 18 behavior, where the default charset is environment-dependent.

Description

In JDK 17 and earlier, the default charset is determined when the Java runtime starts. On macOS, it is UTF-8 except in the POSIX C locale. On other operating systems, it depends upon the user's locale and the default encoding, e.g., on Windows, it is a codepage-based charset such aswindows-1252 orwindows-31j. The methodjava.nio.charsets.Charset.defaultCharset() returns the default charset. A quick way to see the default charset of the current JDK is with the following command:

java -XshowSettings:properties -version 2>&1 | grep file.encoding

Several standard Java APIs use the default charset, including:

In thejava.io package,InputStreamReader,FileReader,OutputStreamWriter,FileWriter, andPrintStream define constructors to create readers, writers, and print streams that encode or decode using the default charset.
In thejava.util package,Formatter andScanner define constructors whose results use the default charset.
In thejava.net package,URLEncoder andURLDecoder define deprecated methods that use the default charset.

We propose to change the specification ofCharset.defaultCharset() to say that the default charset isUTF-8 unless configured otherwise by an implementation-specific means. (See below for how to configure the JDK.) The UTF-8 charset is specified byRFC 2279; the transformation format upon which it is based is specified in Amendment 2 of ISO 10646-1 and is also described in theUnicode Standard. It is not to be confused withModified UTF-8.

We will update the specifications of all standard Java APIs that use the default charset to cross-referenceCharset.defaultCharset(). Those APIs include the ones listed above, but notSystem.out andSystem.err, whose charset will be as specified byConsole.charset().

The`file.encoding` and`native.encoding` system properties

As envisaged by the specification ofCharset.defaultCharset(), the JDK will allow the default charset to be configured to something other than UTF-8. We will revise the treatment of the system propertyfile.encoding so that setting it on the command line is the supported means of configuring the default charset. We will specify this in an implementation note ofSystem.getProperties() as follows:

Iffile.encoding is set to"COMPAT" (i.e.,java -Dfile.encoding=COMPAT), then the default charset will be the charset chosen by the algorithm in JDK 17 and earlier, based on the user's operating system, locale, and other factors. The value offile.encoding will be set to the name of that charset.
Iffile.encoding is set to"UTF-8" (i.e.,java -Dfile.encoding=UTF-8), then the default charset will be UTF-8. This no-op value is defined in order to preserve the behavior of existing command lines.
The treatment of values other than"COMPAT" and"UTF-8" are not specified. They are not supported, but if such a value worked in JDK 17 then it will likely continue to work in JDK 18.

Prior to deploying on a JDK where UTF-8 is the default charset, developers are strongly encouraged to check for charset issues by starting the Java runtime withjava -Dfile.encoding=UTF-8 ... on their current JDK (8-17).

JDK 17 introduced thenative.encoding system property as a standard way for programs to obtain the charset chosen by the JDK's algorithm, regardless of whether the default charset is actually configured to be that charset. In JDK 18, iffile.encoding is set toCOMPAT on the command line, then the run-time value offile.encoding will be the same as the run-time value ofnative.encoding; iffile.encoding is set toUTF-8 on the command line, then the run-time value offile.encoding may differ from the run-time value ofnative.encoding.

InRisks and Assumptions below, we discuss how to mitigate the possible incompatibilities that arise from this change tofile.encoding, as well as thenative.encoding system property and recommendations for applications.

There are three charset-related system properties used internally by the JDK. They remain unspecified and unsupported, but are documented here for completeness:

sun.stdout.encoding andsun.stderr.encoding — the names of the charsets used for the standard output stream (System.out) and standard error stream (System.err), and in thejava.io.Console API.
sun.jnu.encoding — the name of the charset used by the implementation ofjava.nio.file when encoding or decoding filename paths, as opposed to file contents. On macOS its value is"UTF-8"; on other platforms it is typically the default charset.

Source file encoding

The Java language allows source code to express Unicode characters in aUTF-16 encoding, and this is unaffected by the choice of UTF-8 for the default charset. However, thejavac compiler is affected because it assumes that.java source files are encoded with the default charset, unless configured otherwise by the-encodingoption. If source files were saved with a non-UTF-8 encoding and compiled with an earlier JDK, then recompiling on JDK 18 or later may cause problems. For example, if a non-UTF-8 source file has string literals that contain non-ASCII characters, then those literals may be misinterpreted byjavac in JDK 18 or later unless-encoding is used.

Prior to compiling on a JDK where UTF-8 is the default charset, developers are strongly encouraged to check for charset issues by compiling withjavac -encoding UTF-8 ... on their current JDK (8-17). Alternatively, developers who prefer to save source files with a non-UTF-8 encoding can preventjavac from assuming UTF-8 by setting the-encoding option to the value of thenative.encoding system property on JDK 17 and later.

The legacy`default` charset

In JDK 17 and earlier, the namedefault is recognized as an alias for theUS-ASCII charset. That is,Charset.forName("default") produces the same result asCharset.forName("US-ASCII"). Thedefault alias was introduced in JDK 1.5 to ensure that legacy code which usedsun.io converters could migrate to thejava.nio.charset framework introduced in JDK 1.4.

It would be extremely confusing for JDK 18 to preservedefault as an alias forUS-ASCII when the default charset is specified to beUTF-8. It would also be confusing fordefault to meanUS-ASCII when the user configures the default charset to its pre-JDK 18 value by setting-Dfile.encoding=COMPAT on the command line. Redefiningdefault to be an alias not forUS-ASCII but rather for the default charset (whetherUTF-8 or user-configured) would cause subtle behavioral changes in the (few) programs that callCharset.forName("default").

We believe that continuing to recognizedefault in JDK 18 would be prolonging a poor decision. It is not defined by the Java SE Platform, nor is it recognized by IANA as the name or alias of any character set. In fact, for ASCII-based network protocols, IANA encourages use of the canonical nameUS-ASCII rather than justASCII or obscure aliases such asANSI_X3.4-1968 -- plainly, use of the JDK-specific aliasdefault goes counter to that advice. Java programs can use the enum constantStandardCharsets.US_ASCII to make their intent clear, rather than passing a string toCharset.forName(...).

Accordingly, in JDK 18,Charset.forName("default") will throw anUnsupportedCharsetException. This will give developers a chance to detect use of the idiom and migrate to eitherUS-ASCII or to the result ofCharset.defaultCharset().

Testing

Significant testing is required to understand the extent of the compatibility impact of this change. Testing by developers or organizations with geographically diverse user populations will be needed.
Developers can check for issues with an existing JDK release by running with-Dfile.encoding=UTF-8 in advance of any early-access or GA release with this change.

Risks and Assumptions

We assume that applications in many environments will see no impact from Java's choice ofUTF-8:

On macOS, the default charset has been UTF-8 for several releases, except when configured to use the POSIX C locale.
In many Linux distributions, though not all, the default charset is UTF-8, so no change will be discernible in those environments.
Many server applications are already started with-Dfile.encoding=UTF-8, so they will not experience any change.

In other environments, the risk of changing the default charset toUTF-8 after more than 20 years may be significant. The most obvious risk is that applications which implicitly depend on the default charset (e.g., by not passing an explicit charset argument to APIs) will behave incorrectly when processing data produced when the default charset was unspecified. A further risk is that data corruption may silently occur. We expect the main impact will be to users of Windows in Asian locales, and possibly some server environments in Asian and other locales. Possible scenarios include:

If an application that has been running for years withwindows-31j as the default charset is upgraded to a JDK release that uses UTF-8 as the default charset then it will experience problems when reading files that are encoded inwindows-31j. In this case, the application code could be changed to pass thewindows-31j charset when opening such files. If the code cannot be changed, then starting the Java runtime with-Dfile.encoding=COMPAT will force the default charset to bewindows-31j until the application is updated or the files are converted to UTF-8.
In environments where several JDK versions are in use, users might not be able to exchange file data. If, e.g., one user uses an older JDK release wherewindows-31j is the default and another uses a newer JDK where UTF-8 is the default, then text files created by the first user might not be readable by the second. In this case the user on the older JDK release could specify-Dfile.encoding=UTF-8 when starting applications, or the user on the newer release could specify-Dfile.encoding=COMPAT.

Where application code can be changed, then we recommend it is changed to pass a charset argument to constructors. If an application has no particular preference among charsets, and is satisfied with the traditional environment-driven selection for the default charset, then the following code can be usedon all Java releases to obtain the charset determined from the environment:

String encoding = System.getProperty("native.encoding");  // Populated on Java 18 and laterCharset cs = (encoding != null) ? Charset.forName(encoding) : Charset.defaultCharset();var reader = new FileReader("file.txt", cs);

If neither application code nor Java startup can be changed, then it will be necessary to inspect the application code to determine manually whether it will run compatibly on JDK 18.

Alternatives

Preserve the status quo — This does not eliminate the hazards described above.
Deprecate all methods in the Java API that use the default charset — This would encourage developers to use constructors and methods that take a charset parameter, but the resulting code would be more verbose.
Specify UTF-8 as the default charset without providing any means to change it — The compatibility impact of this change would be too high.

Movatterモバイル変換