sigpwned/chardet4jPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star27

Simple, compact charset detection for Java 8+

License

Apache-2.0 license

27 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github		.github
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Repository files navigation

CHARDET4J

Introduction

The state-of-the-art character set detection library for Java isicu4j. However, the icu4j JARfile is about 13MB. This is a hefty price to pay for programs thatonly require charset detection! There should be a smaller option ofthe same quality.

The chardet4j library pulls theCharsetDetector feature from icu4jand repackages it as this standalone library. This allows programs tomake good use of this important feature without bloating theirJARs. At the time of this writing, the chardet4j JAR comes in around85KB. There are no dependencies.

This library also implements some other important components ofcharacter set detection and decoding, namely byte order mark handling.

Features

The library assists the user with detecting character set encodings for bytestreams and decoding them into character streams. It offers specificabstractions for byte order marks (BOMs) and specific methods for identifyingand decoding character encodings for byte arrays and input streams.

The library uses the following algorithm to determine character encoding ofbinary data:

Check for a BOM. If one is present, then trust it, and use the correspondingcharset to decode the data.
Use a battery of bespoke character set detectors to guess which charset ismost likely. Users may provide a declared encoding, which provides a boostto the given charset in this estimation process. If a charset is identifiedwith sufficient confidence, then use it to decode the data.
The default charset is used to decode the data, if one is given.

Installation

The library can be found in Maven Central with the following coordinates:

<dependency>    <groupId>com.sigpwned</groupId>    <artifactId>chardet4j</artifactId>    <version>75.1.2</version></dependency>

It is compatible with Java versions 8 and later. chardet4j has no dependencies.

The$major.$minor.$patch version of the library is determined by the underlyingicu4j version and the local release version. The$major and$minor are takenfrom the icu4j version, and$patch is the release number of this library forthe icu4j version, starting with 0.

Getting Started

To decode anInputStream to aReader by detecting its character set:

try (Reader chars=Chardet.decode(bytes, StandardCharsets.UTF_8)) {    // Process chars here}

Charset detection is important when dealing with content of unknown provenance,like content downloaded from the internet or text files uploaded by users. Insuch cases, users often have a declared encoding, typically from a content type.The name of the declared encoding can be provided as a hint to charsetdetection:

try (Reader chars=Chardet.decode(bytes, declaredEncoding, StandardCharsets.UTF_8)) {    // Process chars here}

Byte arrays can be converted directly to Strings as well:

String chars=Chardet.decode(bytes, declaredEncoding, StandardCharsets.UTF_8);

Users only interested in detection can detect the charset directly, or by namein case the detected charset is not supported by the JVM:

// Throws an UnsupportedCharsetException if the charset is not supported by JVMOptional<Charset> maybeCharset = Chardet.detect(bytes, declaredEncoding);// Never throwsOptional<String> maybeCharsetName = Chardet.detectName(bytes, declaredEncoding);

Advanced Usage

The following are more sophisticated use cases and edge cases that most userswill not need to worry about.

Working with BOMs Directly

The easiest way to work with byte order marks directly is with theBomAwareInputStream class:

try (BomAwareInputStream bomed=BomAwareInputStream.detect(in)) {    if(bomed.bom().isPresent()) {        // A BOM was detected in this byte stream, and can be accessed using        // bomed.bom()    } else {        // No BOM was detected in this byte stream.    }}

It is not typically required to work with BOMs directly, but it can be usefulwhen creating a custom decode pipeline.

Accessing Character Encoding

The easiest way to determine which character encoding is in use is with theDecodedInputStreamReader class:

try (DecodedInputStreamReader chars=Chardet.decode(bytes, StandardCharsets.UTF_8)) {    // The charset that was detected and is being used to decode the given byte    // stream can be accessed using chars.charset()    Charset charset = chars.charset();}

Handling Unsupported Charsets

The Java Standard only requires that distributions support thestandard charsetsISO-8859-1, US-ASCII, UTF-8, UTF-16BE, and UTF-16LE. This library detects thosecharsets and many more besides, so there is a possibility that the detectedcharset is not supported by the current JVM.

Users are unlikely to hit this situation in the wild, since (a) Java generallysupports almost all of the charsets this library detects, and (b) theunsupported charsets are scarce in the wild, and getting more scarce every year.

Regardless, there are a couple ways to manage this situation.

Catch UnsupportedCharsetException

The library throws aUnsupportedCharsetException when the detected charset is notsupported by the current JVM. Users are free to catch this exception and handleas desired.

try (Reader chars=Chardet.decode(bytes, StandardCharsets.UTF_8)) {    // Process chars here} catch(UnsupportedCharsetException e) {    // The charset was detected, but is not supported by current JVM. There are a    // few ways this is typically handled:    //     // - Propagate as an IOException, since the content cannot be decoded properly    // - Ignore the error and use a default charset}

Detect Charset Names

Rather than working with charsets, work with charset names instead. This willnever throw an exception.

Optional<String> maybeCharsetName = Chardet.detectCharsetName(bytes);if(maybeCharsetName.isPresent()) {    // The charset was detected successfully, and the name can be accessed using    // maybeCharsetName.get()} else {    // The charset could not be detected}

Using Custom Charsets

Users who wish to add new charsets to the JVM should follow the instructionson theCharsetProviderclass. The library will automatically pick up any such new charsets.

Configuration

The following configuration variables are available to customize the working ofthe library.

System Property chardet4j.detect.bufsize

One way the library detects character encodings is by analyzing the leadingbytes of a binary file. The more data the library analyzes, the more accuratethe estimates will be, but the longer it will take. By default, this value is8192 bytes, or 8KiB. Users can change this value by setting thechardet4j.detect.bufsize system property. For example, to set this value to16KiB, use:

java -Dchardet.detect.bufsize=16384 ...

Adjusting the buffer size can be useful when dealing with particularly largefiles where detection accuracy or performance might be a concern.

Supported Character Encodings

The chardet4j library and Java in general supports the following characterencodings at the following levels:

Name	Standard	ICU4J	BOM	Laptop
Big5		✔		✔
EUC-JP		✔		✔
EUC-KR		✔		✔
GB18030		✔	✔	✔
ISO-2022-CN		✔		✔
ISO-2022-JP		✔		✔
ISO-2022-KR		✔		✔
ISO-8859-1		✔		✔
ISO-8859-2		✔		✔
ISO-8859-5		✔		✔
ISO-8859-6		✔		✔
ISO-8859-7		✔		✔
ISO-8859-8		✔		✔
ISO-8859-8-I		✔
ISO-8859-9		✔		✔
KOI8-R		✔		✔
Shift_JIS		✔		✔
US-ASCII	✔	✔*		✔
UTF-1			✔
UTF-16BE	✔	✔	✔	✔
UTF-16LE	✔	✔	✔	✔
UTF-32BE		✔	✔	✔
UTF-32LE		✔	✔	✔
UTF-8	✔	✔	✔	✔
UTF-EBCDIC			✔
windows-1250		✔		✔
windows-1251		✔		✔
windows-1252		✔		✔
windows-1253		✔		✔
windows-1254		✔		✔
windows-1255		✔		✔
windows-1256		✔		✔

Notes:
*: ICU4J detects US-ASCII as ISO-8859-1, a superset of US-ASCII

The support levels have the following meanings:

Standard -- The Java Standard requires that all JVMs support thischaracter encoding
ICU4J -- The ICU4J project has a bespoke charset recognizer for thischaracter encoding
BOM -- The character encoding can be detected by Byte Order Mark
Laptop -- The character sets supported byjava version "1.8.0_321" on mylaptop (Obviously, this test is completely unscientific. If you have abetter suggestion, please open an issue!)

Licensing

The icu library is released under the ICU license. The chardet4j library isreleased under the Apache license. For more details, see the LICENSE file.

About

Simple, compact charset detection for Java 8+

Releases4

v77.1.0 Latest

Mar 21, 2025

+ 3 releases

Contributors2

Languages

Java100.0%

Movatterモバイル変換

License

sigpwned/chardet4j

Folders and files

Latest commit

History

Repository files navigation

CHARDET4J

Introduction

Features

Installation

Getting Started

Advanced Usage

Working with BOMs Directly

Accessing Character Encoding

Handling Unsupported Charsets

Catch UnsupportedCharsetException

Detect Charset Names

Using Custom Charsets

Configuration

System Property chardet4j.detect.bufsize

Supported Character Encodings

Licensing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Uh oh!

Contributors2

Uh oh!

Languages