Movatterモバイル変換


[0]ホーム

URL:


JEP 400: UTF-8 by Default

AuthorsAlan Bateman, Naoto Sato
OwnerNaoto Sato
TypeFeature
ScopeSE
StatusClosed / Delivered
Release18
Componentcore-libs / java.nio.charsets
Discussioncore dash libs dash dev at openjdk dot java dot net
EffortXS
DurationXS
Reviewed byAlex Buckley, Brian Goetz
Endorsed byBrian Goetz
Created2017/08/31 13:16
Updated2025/04/14 07:28
Issue8187041

Summary

Specify UTF-8 as the default charset of the standard Java APIs. With this change, APIs that depend upon the default charset will behave consistently across all implementations, operating systems, locales, and configurations.

Goals

Non-Goals

Motivation

Standard Java APIs for reading and writing files and for processing text allow acharset to be passed as an argument. A charset governs the conversion between raw bytes and the 16-bitchar values of the Java programming language. Supported charsets include, for example, US-ASCII, UTF-8, and ISO-8859-1.

If a charset argument is not passed, then standard Java APIs typically use thedefault charset. The JDK chooses the default charset at startup based upon the run-time environment: the operating system, the user's locale, and other factors.

Because the default charset is not the same everywhere, APIs that use the default charset pose many non-obvious hazards, even to experienced developers.

Consider an application that creates ajava.io.FileWriter without passing a charset, and then uses it to write some text to a file. The resulting file will contain a sequence of bytes encoded using the default charset of the JDK running the application. A second application, run on a different machine or by a different user on the same machine, creates ajava.io.FileReader without passing a charset and uses it to read the bytes in that file. The resulting text contains a sequence of characters decoded using the default charset of the JDK running the second application. If the default charset differs between the JDK of the first application and the JDK of the second application then the resulting text may be silently corrupted or incomplete, since theFileReader cannot tell that it decoded the text using thewrong charset relative to theFileWriter. Here is an example of this hazard, where a Japanese text file encoded inUTF-8 on macOS is corrupted when read on Windows in US-English or Japanese locales:

java.io.FileReader(“hello.txt”) -> “こんにちは” (macOS)java.io.FileReader(“hello.txt”) -> “ã?“ã‚“ã?«ã?¡ã? ” (Windows (en-US))java.io.FileReader(“hello.txt”) -> “縺ォ縺。縺ッ” (Windows (ja-JP)

Developers familiar with such hazards can use methods and constructors that take a charset argument explicitly. However, having to pass an argument prevents methods and constructors from being used via method references (::) in stream pipelines.

Developers sometimes attempt to configure the default charset by setting the system propertyfile.encoding on the command line (i.e.,java -Dfile.encoding=...), but this has never been supported. Furthermore, attempting to set the property programmatically (i.e.,System.setProperty(...)) after the Java runtime has started does not work.

Not all standard Java APIs defer to the JDK's choice of default charset. For example, the methods injava.nio.file.Files that read or write files without aCharset argument are specified to always use UTF-8. The fact that newer APIs default to using UTF-8 while older APIs default to using the default charset is a hazard for applications that use a mix of APIs.

The entire Java ecosystem would benefit if the default charset were specified to be the same everywhere. Applications that are not concerned with portability will see little impact, while applications that embrace portability by passing charset arguments will see no impact. UTF-8 haslong been the most common charset on the World Wide Web. UTF-8 is standard for the XML and JSON files processed by vast numbers of Java programs, and Java's own APIs increasingly favor UTF-8 in, e.g., theNIO API and forproperty files. It therefore makes sense to specify UTF-8 as the default charset for all Java APIs.

We recognize that this change could have a widespread compatibility impact on programs that migrate to JDK 18. For this reason, it will always be possible to recover the pre-JDK 18 behavior, where the default charset is environment-dependent.

Description

In JDK 17 and earlier, the default charset is determined when the Java runtime starts. On macOS, it is UTF-8 except in the POSIX C locale. On other operating systems, it depends upon the user's locale and the default encoding, e.g., on Windows, it is a codepage-based charset such aswindows-1252 orwindows-31j. The methodjava.nio.charsets.Charset.defaultCharset() returns the default charset. A quick way to see the default charset of the current JDK is with the following command:

java -XshowSettings:properties -version 2>&1 | grep file.encoding

Several standard Java APIs use the default charset, including:

We propose to change the specification ofCharset.defaultCharset() to say that the default charset isUTF-8 unless configured otherwise by an implementation-specific means. (See below for how to configure the JDK.) The UTF-8 charset is specified byRFC 2279; the transformation format upon which it is based is specified in Amendment 2 of ISO 10646-1 and is also described in theUnicode Standard. It is not to be confused withModified UTF-8.

We will update the specifications of all standard Java APIs that use the default charset to cross-referenceCharset.defaultCharset(). Those APIs include the ones listed above, but notSystem.out andSystem.err, whose charset will be as specified byConsole.charset().

Thefile.encoding andnative.encoding system properties

As envisaged by the specification ofCharset.defaultCharset(), the JDK will allow the default charset to be configured to something other than UTF-8. We will revise the treatment of the system propertyfile.encoding so that setting it on the command line is the supported means of configuring the default charset. We will specify this in an implementation note ofSystem.getProperties() as follows:

Prior to deploying on a JDK where UTF-8 is the default charset, developers are strongly encouraged to check for charset issues by starting the Java runtime withjava -Dfile.encoding=UTF-8 ... on their current JDK (8-17).

JDK 17 introduced thenative.encoding system property as a standard way for programs to obtain the charset chosen by the JDK's algorithm, regardless of whether the default charset is actually configured to be that charset. In JDK 18, iffile.encoding is set toCOMPAT on the command line, then the run-time value offile.encoding will be the same as the run-time value ofnative.encoding; iffile.encoding is set toUTF-8 on the command line, then the run-time value offile.encoding may differ from the run-time value ofnative.encoding.

InRisks and Assumptions below, we discuss how to mitigate the possible incompatibilities that arise from this change tofile.encoding, as well as thenative.encoding system property and recommendations for applications.

There are three charset-related system properties used internally by the JDK. They remain unspecified and unsupported, but are documented here for completeness:

Source file encoding

The Java language allows source code to express Unicode characters in aUTF-16 encoding, and this is unaffected by the choice of UTF-8 for the default charset. However, thejavac compiler is affected because it assumes that.java source files are encoded with the default charset, unless configured otherwise by the-encodingoption. If source files were saved with a non-UTF-8 encoding and compiled with an earlier JDK, then recompiling on JDK 18 or later may cause problems. For example, if a non-UTF-8 source file has string literals that contain non-ASCII characters, then those literals may be misinterpreted byjavac in JDK 18 or later unless-encoding is used.

Prior to compiling on a JDK where UTF-8 is the default charset, developers are strongly encouraged to check for charset issues by compiling withjavac -encoding UTF-8 ... on their current JDK (8-17). Alternatively, developers who prefer to save source files with a non-UTF-8 encoding can preventjavac from assuming UTF-8 by setting the-encoding option to the value of thenative.encoding system property on JDK 17 and later.

The legacydefault charset

In JDK 17 and earlier, the namedefault is recognized as an alias for theUS-ASCII charset. That is,Charset.forName("default") produces the same result asCharset.forName("US-ASCII"). Thedefault alias was introduced in JDK 1.5 to ensure that legacy code which usedsun.io converters could migrate to thejava.nio.charset framework introduced in JDK 1.4.

It would be extremely confusing for JDK 18 to preservedefault as an alias forUS-ASCII when the default charset is specified to beUTF-8. It would also be confusing fordefault to meanUS-ASCII when the user configures the default charset to its pre-JDK 18 value by setting-Dfile.encoding=COMPAT on the command line. Redefiningdefault to be an alias not forUS-ASCII but rather for the default charset (whetherUTF-8 or user-configured) would cause subtle behavioral changes in the (few) programs that callCharset.forName("default").

We believe that continuing to recognizedefault in JDK 18 would be prolonging a poor decision. It is not defined by the Java SE Platform, nor is it recognized by IANA as the name or alias of any character set. In fact, for ASCII-based network protocols, IANA encourages use of the canonical nameUS-ASCII rather than justASCII or obscure aliases such asANSI_X3.4-1968 -- plainly, use of the JDK-specific aliasdefault goes counter to that advice. Java programs can use the enum constantStandardCharsets.US_ASCII to make their intent clear, rather than passing a string toCharset.forName(...).

Accordingly, in JDK 18,Charset.forName("default") will throw anUnsupportedCharsetException. This will give developers a chance to detect use of the idiom and migrate to eitherUS-ASCII or to the result ofCharset.defaultCharset().

Testing

Risks and Assumptions

We assume that applications in many environments will see no impact from Java's choice ofUTF-8:

In other environments, the risk of changing the default charset toUTF-8 after more than 20 years may be significant. The most obvious risk is that applications which implicitly depend on the default charset (e.g., by not passing an explicit charset argument to APIs) will behave incorrectly when processing data produced when the default charset was unspecified. A further risk is that data corruption may silently occur. We expect the main impact will be to users of Windows in Asian locales, and possibly some server environments in Asian and other locales. Possible scenarios include:

Where application code can be changed, then we recommend it is changed to pass a charset argument to constructors. If an application has no particular preference among charsets, and is satisfied with the traditional environment-driven selection for the default charset, then the following code can be usedon all Java releases to obtain the charset determined from the environment:

String encoding = System.getProperty("native.encoding");  // Populated on Java 18 and laterCharset cs = (encoding != null) ? Charset.forName(encoding) : Charset.defaultCharset();var reader = new FileReader("file.txt", cs);

If neither application code nor Java startup can be changed, then it will be necessary to inspect the application code to determine manually whether it will run compatibly on JDK 18.

Alternatives

OpenJDK logo
Installing
Contributing
Sponsoring
Developers' Guide
Vulnerabilities
JDK GA/EA Builds
Mailing lists
Wiki ·IRC
Mastodon
Bluesky
Bylaws ·Census
Legal
Workshop
JEP Process
Source code
GitHub
Mercurial
Tools
Git
jtreg harness
Groups
(overview)
Adoption
Build
Client Libraries
Compatibility & Specification Review
Compiler
Conformance
Core Libraries
Governing Board
HotSpot
IDE Tooling & Support
Internationalization
JMX
Members
Networking
Porters
Quality
Security
Serviceability
Vulnerability
Web
Projects
(overview,archive)
Amber
Babylon
CRaC
Code Tools
Coin
Common VM Interface
Developers' Guide
Device I/O
Duke
Galahad
Graal
IcedTea
JDK 8 Updates
JDK 9
JDK (…,24,25,26)
JDK Updates
JMC
Jigsaw
Kona
Lanai
Leyden
Lilliput
Locale Enhancement
Loom
Memory Model Update
Metropolis
Multi-Language VM
Nashorn
New I/O
OpenJFX
Panama
Penrose
Port: AArch32
Port: AArch64
Port: BSD
Port: Haiku
Port: Mac OS X
Port: MIPS
Port: Mobile
Port: PowerPC/AIX
Port: RISC-V
Port: s390x
SCTP
Shenandoah
Skara
Sumatra
Tsan
Valhalla
Verona
VisualVM
Wakefield
Zero
ZGC
Oracle logo
© 2025 Oracle Corporation and/or its affiliates
Terms of Use · License:GPLv2 ·Privacy ·Trademarks

[8]ページ先頭

©2009-2025 Movatter.jp