hsivonen/encoding_rsPublic

NotificationsYou must be signed in to change notification settings
Fork60
Star419

A Gecko-oriented implementation of the Encoding Standard in Rust

License

Unknown and 2 other licenses found

Licenses found

419 stars 60 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 947 Commits
.github/workflows		.github/workflows
ci		ci
doc		doc
fuzz		fuzz
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
COPYRIGHT		COPYRIGHT
Cargo.toml		Cargo.toml
Ideas.md		Ideas.md
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
LICENSE-WHATWG		LICENSE-WHATWG
README.md		README.md
generate-encoding-data.py		generate-encoding-data.py
rustfmt.toml		rustfmt.toml

Repository files navigation

encoding_rs

encoding_rs an implementation of the (non-JavaScript parts of) theEncoding Standard written in Rust.

The Encoding Standard defines the Web-compatible set of character encodings,which means this crate can be used to decode Web content. encoding_rs isused in Gecko starting with Firefox 56. Due to the notable overlap betweenthe legacy encodings on the Web and the legacy encodings used on Windows,this crate may be of use for non-Web-related situations as well; see belowfor links to adjacent crates.

Additionally, themem module provides various operations for dealing within-RAM text (as opposed to data that's coming from or going to an IO boundary).Themem module is a module instead of a separate crate due to internalimplementation detail efficiencies.

Functionality

Due to the Gecko use case, encoding_rs supports decoding to and encoding fromUTF-16 in addition to supporting the usual Rust use case of decoding to andencoding from UTF-8. Additionally, the API has been designed to be FFI-friendlyto accommodate the C++ side of Gecko.

Specifically, encoding_rs does the following:

Decodes a stream of bytes in an Encoding Standard-defined character encodinginto valid aligned native-endian in-RAM UTF-16 (units ofu16 /char16_t).
Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16(units ofu16 /char16_t) into a sequence of bytes in an EncodingStandard-defined character encoding as if the lone surrogates had beenreplaced with the REPLACEMENT CHARACTER before performing the encode.(Gecko's UTF-16 is potentially invalid.)
Decodes a stream of bytes in an Encoding Standard-defined characterencoding into valid UTF-8.
Encodes a stream of valid UTF-8 into a sequence of bytes in an EncodingStandard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
Does the above in streaming (input and output split across multiplebuffers) and non-streaming (whole input in a single buffer and wholeoutput in a single buffer) variants.
Avoids copying (borrows) when possible in the non-streaming cases whendecoding to or encoding from UTF-8.
Resolves textual labels that identify character encodings inprotocol text into type-safe objects representing the those encodingsconceptually.
Maps the type-safe encoding objects onto strings suitable forreturning fromdocument.characterSet.
Validates UTF-8 (in common instruction set scenarios a bit faster for Webworkloads than the standard library; hopefully will get upstreamed someday) and ASCII.

Additionally,encoding_rs::mem does the following:

Checks if a byte buffer contains only ASCII.
Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16buffer contains only Latin1 code points (below U+0100).
Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior(suitable for checking if the Unicode Bidirectional Algorithm can be optimizedout).
Combined versions of the above two checks.
Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
Converts UTF-8 and UTF-16 to Latin1 (if in range).
Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
Copies ASCII from one buffer to another up to the first non-ASCII byte.
Converts ASCII to UTF-16 up to the first non-ASCII byte.
Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.

Integration with`std::io`

Notably, the above feature list doesn't include the capability to wrapastd::io::Read, decode it into UTF-8 and presenting the result viastd::io::Read. Theencoding_rs_iocrate provides that capability.

`no_std` Environment

The crate works in ano_std environment. By default, thealloc feature,which assumes that an allocator is present is enabled. For a no-allocatorenvironment, the default features (i.e.alloc) can be turned off. Thismakes the part of the API that returnsVec/String/Cow unavailable.

Decoding Email

For decoding character encodings that occur in email, use thecharset crate instead of using thisone directly. (It wraps this crate and adds UTF-7 decoding.)

Windows Code Page Identifier Mappings

For mappings to and from Windows code page identifiers, use thecodepage crate.

DOS Encodings

This crate does not support single-byte DOS encodings that aren't required bythe Web Platform, but theoem_cp crate does.

Preparing Text for the Encoders

Normalizing text into Unicode Normalization Form C prior to encoding text intoa legacy encoding minimizes unmappable characters. Text can be normalized toUnicode Normalization Form C using theicu_normalizer crate.

The exception is windows-1258, which after normalizing to Unicode NormalizationForm C requires tone marks to be decomposed in order to minimize unmappablecharacters. Vietnamese tone marks can be decomposed using thedetone crate.

Licensing

TL;DR:(Apache-2.0 OR MIT) AND BSD-3-Clause for the code and data combination.

Please see the file namedCOPYRIGHT.

The non-test code that isn't generated from the WHATWG data in this crate isunder Apache-2.0 OR MIT. Test code is under CC0.

This crate contains code/data generated from WHATWG-supplied data. The WHATWGupstream changed its license for portions of specs incorporated into source codefrom CC0 to BSD-3-Clause between the initial release of this crate and the presentversion of this crate. The in-source licensing legends have been updated for theparts of the generated code that have changed since the upstream license change.

Documentation

GeneratedAPI documentation is availableonline.

There is along-form write-up about thedesign and internals of the crate.

C and C++ bindings

An FFI layer for encoding_rs is available as aseparate crate. The crate comeswith ademo C++ wrapperusing the C++ standard library andGSL types.

The bindings for themem module are in theencoding_c_mem crate.

For the Gecko context, there's aC++ wrapper using the MFBT/XPCOM types.

There's awrite-up about the C++wrappers.

Sample programs

Optional features

There are currently these optional cargo features:

`simd-accel`

Enables SIMD acceleration using the nightly-dependentportable_simd standardlibrary feature.

This is an opt-in feature, because enabling this featureopts out of Rust'sguarantees of future compilers compiling old code (aka. "stability story").

Currently, this has not been tested to be an improvement except for thesetargets and enabling thesimd-accel feature is expected to break the buildon other targets:

x86_64
i686
aarch64
thumbv7neon

If you use nightly Rust, you use targets whose first component is one of theabove, and you are preparedto have to revise your configuration when updatingRust, you should enable this feature. Otherwise, pleasedo not enable thisfeature.

Used by Firefox.

`serde`

Enables support for serializing and deserializing&'static Encoding-typedstruct fields usingSerde.

Not used by Firefox.

`fast-legacy-encode`

A catch-all option for enabling the fastest legacy encode options.Does notaffect decode speed or UTF-8 encode speed.

At present, this option is equivalent to enabling the following options:

fast-hangul-encode
fast-hanja-encode
fast-kanji-encode
fast-gb-hanzi-encode
fast-big5-hanzi-encode

Adds 176 KB to the binary size.

Not used by Firefox.

`fast-hangul-encode`

Changes encoding precomposed Hangul syllables into EUC-KR from binarysearch over the decode-optimized tables to lookup by index making Koreanplain-text encode about 4 times as fast as without this option.

Adds 20 KB to the binary size.

Doesnot affect decode speed.

Not used by Firefox.

`fast-hanja-encode`

Changes encoding of Hanja into EUC-KR from linear search over thedecode-optimized table to lookup by index. Since Hanja is practically absentin modern Korean text, this option doesn't affect perfomance in the commoncase and mainly makes sense if you want to make your application resilientagaist denial of service by someone intentionally feeding it a lot of Hanjato encode into EUC-KR.

Adds 40 KB to the binary size.

Doesnot affect decode speed.

Not used by Firefox.

`fast-kanji-encode`

Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linearsearch over the decode-optimized tables to lookup by index making Japaneseplain-text encode to legacy encodings 30 to 50 times as fast as without thisoption (about 2 times as fast as withless-slow-kanji-encode).

Takes precedence overless-slow-kanji-encode.

Adds 36 KB to the binary size (24 KB compared toless-slow-kanji-encode).

Doesnot affect decode speed.

Not used by Firefox.

`less-slow-kanji-encode`

Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP andISO-2022-JP) encode less slow (binary search instead of linear search) makingJapanese plain-text encode to legacy encodings 14 to 23 times as fast aswithout this option.

Adds 12 KB to the binary size.

Doesnot affect decode speed.

Not used by Firefox.

`fast-gb-hanzi-encode`

Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK andgb18030 from linear search over a part the decode-optimized tables followedby a binary search over another part of the decode-optimized tables to lookupby index making Simplified Chinese plain-text encode to the legacy encodings100 to 110 times as fast as without this option (about 2.5 times as fast aswithless-slow-gb-hanzi-encode).

Takes precedence overless-slow-gb-hanzi-encode.

Adds 36 KB to the binary size (24 KB compared toless-slow-gb-hanzi-encode).

Doesnot affect decode speed.

Not used by Firefox.

`less-slow-gb-hanzi-encode`

Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encodeless slow (binary search instead of linear search) making Simplified Chineseplain-text encode to the legacy encodings about 40 times as fast as withoutthis option.

Adds 12 KB to the binary size.

Doesnot affect decode speed.

Not used by Firefox.

`fast-big5-hanzi-encode`

Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 fromlinear search over a part the decode-optimized tables to lookup by indexmaking Traditional Chinese plain-text encode to Big5 105 to 125 times as fastas without this option (about 3 times as fast as withless-slow-big5-hanzi-encode).

Takes precedence overless-slow-big5-hanzi-encode.

Adds 40 KB to the binary size (20 KB compared toless-slow-big5-hanzi-encode).

Doesnot affect decode speed.

Not used by Firefox.

`less-slow-big5-hanzi-encode`

Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow(binary search instead of linear search) making Traditional Chineseplain-text encode to Big5 about 36 times as fast as without this option.

Adds 20 KB to the binary size.

Doesnot affect decode speed.

Not used by Firefox.

Performance goals

For decoding to UTF-16, the goal is to perform at least as well as Gecko's olduconv. For decoding to UTF-8, the goal is to perform at least as well asrust-encoding. These goals have been achieved.

Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalenttomemcpy and UTF-16 to UTF-8 should be fast.)

Speed is a non-goal when encoding to legacy encodings. By default, encoding tolegacy encodings should not be optimized for speed at the expense of code sizeas long as form submission and URL parsing in Gecko don't become noticeablytoo slow in real-world use.

In the interest of binary size, by default, encoding_rs does not haveencode-specific data tables beyond 32 bits of encode-specific data for eachsingle-byte encoding. Therefore, encoders search the decode-optimized datatables. This is a linear search in most cases. As a result, by default, encodeto legacy encodings varies from slow to extremely slow relative to otherlibraries. Still, with realistic work loads, this seemed fast enough not to beuser-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing)in the Web-exposed encoder use cases.

See the cargo features above for optionally making CJK legacy encode fast.

A framework for measuring performance isavailable separately.

Rust Version Compatibility

It is a goal to support the latest stable Rust, the latest nightly Rust andthe version of Rust that's used for Firefox Nightly.

At this time, there is no firm commitment to support a version older thanwhat's required by Firefox, and there is no commitment to treat MSRV changesas semver-breaking, because this crate depends oncfg-if, which doesn'tappear to treat MSRV changes as semver-breaking, so it would be useless forthis crate to treat MSRV changes as semver-breaking.

As of 2024-11-01, MSRV appears to be Rust 1.40.0 for using the crate and1.42.0 for doc tests to pass without errors about the global allocator.With thesimd-accel feature, the MSRV is even higher.

Compatibility with rust-encoding

A compatibility layer that implements the rust-encoding API on top ofencoding_rs isprovided as a separate crate(cannot be uploaded to crates.io). The compatibility layer was originallywritten with the assuption that Firefox would need it, but it is not currentlyused in Firefox.

Regenerating Generated Code

To regenerate the generated code:

Have Python 2 installed.
Clonehttps://github.com/hsivonen/encoding_cnext to theencoding_rs directory.
Clonehttps://github.com/hsivonen/codepagenext to theencoding_rs directory.
Clonehttps://github.com/whatwg/encodingnext to theencoding_rs directory.
Checkout revision1d519bf8e5555cef64cf3a712485f41cd1a6a990 of theencoding repo.(Note:f381389 was the revision ofencoding used from before theencoding repolicense change.)
With theencoding_rs directory as the working directory, runpython generate-encoding-data.py.

Roadmap

Design the low-level API.
Provide Rust-only convenience features.
Provide an stl/gsl-flavored C++ API.
Implement all decoders and encoders.
Add unit tests for all decoders and encoders.
Finish BOM sniffing variants in Rust-only convenience features.
Document the API.
Publish the crate on crates.io.
Create a solution for measuring performance.
Accelerate ASCII conversions using SSE2 on x86.
Accelerate ASCII conversions using ALU register-sized operations onnon-x86 architectures (process anusize instead ofu8 at a time).
Split FFI into a separate crate so that the FFI doesn't interfere withLTO in pure-Rust usage.
Compress CJK indices by making use of sequential code points as wellas Unicode-ordered parts of indices.
Make lookups by label or name use binary search that searches from theend of the label/name to the start.
Make labels with non-ASCII bytes fail fast.
~~Parallelize UTF-8 validation usingRayon.~~(This turned out to be a pessimization in the ASCII case due to memory bandwidth reasons.)
Provide an XPCOM/MFBT-flavored C++ API.
Investigate accelerating single-byte encode with a single fast-trackedrange per encoding.
Replace uconv with encoding_rs in Gecko.
Implement the rust-encoding API in terms of encoding_rs.
Add SIMD acceleration for Aarch64.
Investigate the use of NEON on 32-bit ARM.
~~Investigate Björn Höhrmann's lookup table acceleration for UTF-8 asadapted to Rust in rust-encoding.~~
Add actually fast CJK encode options.
~~InvestigateBob Steagall's lookup table acceleration for UTF-8.~~
Provide a build mode that works withoutalloc (with lesser API surface).
Migrate tostd::simd~~once it is stable and declare 1.0.~~
Migrateunsafe slice access by larger types thanu8/u16 toalign_to.

Release Notes

0.8.35

Implement changes for GB18030-2022. (Intentionally not treated as a semver break in practice even if this could be argued to be a breaking change in theory.)

0.8.34

Use theportable_simd nightly feature of the standard library instead of thepacked_simd crate. Only affects thesimd-accel optional nightly feature.
Internal documentation improvements and minor code improvements aroundunsafe.
Addedrust-version toCargo.toml.

0.8.33

Usepacked_simd instead ofpacked_simd_2 again now that updates are back under thepacked_simd name. Only affects thesimd-accel optional nightly feature.

0.8.32

Removedbuild.rs. (This removal should resolve false positives reported by some antivirus products. This may break some build configurations that have opted out of Rust's guarantees against future build breakage.)
Internal change to what API is used for reinterpreting the lane configuration of SIMD vectors.
Documentation improvements.

0.8.31

Use SPDX with parentheses now that crates.io supports parentheses.

0.8.30

Update the licensing information to take into account the WHATWG data license change.

0.8.29

Make the parts that use an allocator optional.

0.8.28

Fix error in Serde support introduced as part ofno_std support.

0.8.27

Make the crate works in ano_std environment (withalloc).

0.8.26

Fix oversights in edition 2018 migration that broke thesimd-accel feature.

0.8.25

Do pointer alignment checks in a way where intermediate steps aren't defined to be Undefined Behavior.
Update thepacked_simd dependency topacked_simd_2.
Update thecfg-if dependency to 1.0.
Address warnings that have been introduced by newer Rust versions along the way.
Update to edition 2018, since even prior to 1.0cfg-if updated to edition 2018 without a semver break.

0.8.24

Avoid computing an intermediate (not dereferenced) pointer value in a manner designated as Undefined Behavior when computing pointer alignment.

0.8.23

Remove year from copyright notices. (No features or bug fixes.)

0.8.22

Formatting fix and new unit test. (No features or bug fixes.)

0.8.21

Fixed a panic with invalid UTF-16[BE|LE] input at the end of the stream.

0.8.20

MakeDecoder::latin1_byte_compatible_up_to returnNone in morecases to make the method actually useful. While this could be arguedto be a breaking change due to the bug fix changing semantics, it doesnot break callers that had to handle theNone case in a reasonableway anyway.

0.8.19

Removed a bunch of bound checks inconvert_str_to_utf16.
Addedmem::convert_utf8_to_utf16_without_replacement.

0.8.18

Addedmem::utf8_latin1_up_to andmem::str_latin1_up_to.
AddedDecoder::latin1_byte_compatible_up_to.

0.8.17

Updatebincode (dev dependency) version requirement to 1.0.

0.8.16

Switch from thesimd crate topacked_simd.

0.8.15

Adjust documentation forsimd-accel (README-only release).

0.8.14

Made UTF-16 to UTF-8 encode conversion fill the output buffer asclosely as possible.

0.8.13

Made the UTF-8 to UTF-16 decoder compare the number of code units writtenwith the length of the right slice (the output slice) to fix a panicintroduced in 0.8.11.

0.8.12

Removed theclippy:: prefix from clippy lint names.

0.8.11

Changed minimum Rust requirement to 1.29.0 (for the ability to referto the interior of astatic when defining anotherstatic).
Explicitly aligned the lookup tables for single-byte encodings andUTF-8 to cache lines in the hope of freeing up one cache line forother data. (Perhaps the tables were already aligned and this isplacebo.)
Added 32 bits of encode-oriented data for each single-byte encoding.The change was performance-neutral for non-Latin1-ish Latin legacyencodings, improved Latin1-ish and Arabic legacy encode speedsomewhat (new speed is 2.4x the old speed for German, 2.3x forArabic, 1.7x for Portuguese and 1.4x for French) and improvednon-Latin1, non-Arabic legacy single-byte encode a lot (7.2x forThai, 6x for Greek, 5x for Russian, 4x for Hebrew).
Added compile-time options for fast CJK legacy encode options (atthe cost of binary size (up to 176 KB) and run-time memory usage).These options still retain the overall code structure instead ofrewriting the CJK encoders totally, so the speed isn't as good aswhat could be achieved by using even more memory / making thebinary even langer.
Made UTF-8 decode and validation faster.
Added methodis_single_byte() onEncoding.
Addedmem::decode_latin1() andmem::encode_latin1_lossy().

0.8.10

Disabled a unit test that tests a panic condition when the assertionbeing tested is disabled.

0.8.9

Made--features simd-accel work with stable-channel compiler tosimplify the Firefox build system.

0.8.8

Made theis_foo_bidi() not treat U+FEFF (ZERO WIDTH NO-BREAK SPACEaka. BYTE ORDER MARK) as right-to-left.
Made theis_foo_bidi() functions reporttrue if the input containsHebrew presentations forms (which are right-to-left but not in aright-to-left-roadmapped block).

0.8.7

Fixed a panic in the UTF-16LE/UTF-16BE decoder when decoding to UTF-8.

0.8.6

Temporarily removed the debug assertion added in version 0.8.5 fromconvert_utf16_to_latin1_lossy.

0.8.5

If debug assertions are enabled but fuzzing isn't enabled, lossy conversionsto Latin1 in themem module assert that the input is in the rangeU+0000...U+00FF (inclusive).
In themem module provide conversions from Latin1 and UTF-16 to UTF-8that can deal with insufficient output space. The idea is to use themfirst with an allocation rounded up to jemalloc bucket size and do theworst-case allocation only if the jemalloc rounding up was insufficientas the first guess.

0.8.4

Fix SSE2-specific,simd-accel-specific memory corruption introduced inversion 0.8.1 in conversions between UTF-16 and Latin1 in themem module.

0.8.3

Removed an#[inline(never)] annotation that was not meant for release.

0.8.2

Made non-ASCII UTF-16 to UTF-8 encode faster by manually omitting boundchecks and manually adding branch prediction annotations.

0.8.1

Tweaked loop unrolling and memory alignment for SSE2 conversions betweenUTF-16 and Latin1 in themem module to increase the performance whenconverting long buffers.

0.8.0

Changed the minimum supported version of Rust to 1.21.0 (semver breakingchange).
Flipped around the defaults vs. optional features for controlling the sizevs. speed trade-off for Kanji and Hanzi legacy encode (semver breakingchange).
Added NEON support on ARMv7.
SIMD-accelerated x-user-defined to UTF-16 decode.
Made UTF-16LE and UTF-16BE decode a lot faster (including SIMDacceleration).

0.7.2

Add themem module.
Refactor SIMD code which can affect performance outside thememmodule.

0.7.1

When encoding from invalid UTF-16, correctly handle U+DC00 followed byanother low surrogate.

0.7.0

Makereplacement a label of the replacementencoding. (Spec change.)
RemoveEncoding::for_name(). (Encoding::for_label(foo).unwrap() isnow close enough after the above label change.)
Remove theparallel-utf8 cargo feature.
Add optional Serde support for&'static Encoding.
Performance tweaks for ASCII handling.
Performance tweaks for UTF-8 validation.
SIMD support on aarch64.

0.6.11

MakeEncoder::has_pending_state() public.
Update thesimd crate dependency to 0.2.0.

0.6.10

Reserve enough space for NCRs when encoding to ISO-2022-JP.
Correct max length calculations for multibyte decoders.
Correct max length calculations before BOM sniffing has beenperformed.
Correctly calculate max length when encoding from UTF-16 to GBK.

0.6.9

Don't prepend anything when gb18030 range decodefails. (Spec change.)

0.6.8

Correcly handle the case where the first buffer contains potentiallypartial BOM and the next buffer is the last buffer.
Decode byte7F correctly in ISO-2022-JP.
Make UTF-16 to UTF-8 encode write closer to the end of the buffer.
ImplementHash forEncoding.

0.6.7

Map half-width katakana to full-width katana in ISO-2022-JPencoder. (Spec change.)
GiveInputEmpty correct precedence overOutputFull when encodingwith replacement and the output buffer passed in is too short or theremaining space in the output buffer is too small after a replacement.

0.6.6

Correct max length calculation when a partial BOM prefix is part ofthe decoder's state.

0.6.5

Correct max length calculation in various encoders.
Correct max length calculation in the UTF-16 decoder.
DerivePartialEq andEq for theCoderResult,DecoderResultandEncoderResult types.

0.6.4

Avoid panic when encoding with replacement and the destination buffer istoo short to hold one numeric character reference.

0.6.3

Add support for 32-bit big-endian hosts. (For real this time.)

0.6.2

Fix a panic from subslicing with bad indices inEncoder::encode_from_utf16. (Due to an oversight, it lacked the fix thatEncoder::encode_from_utf8 already had.)
Micro-optimize error status accumulation in non-streaming case.

0.6.1

Avoid panic near integer overflow in a case that's unlikely to actuallyhappen.
Address Clippy lints.

0.6.0

Make the methods for computing worst-case buffer size requirements checkfor integer overflow.
Upgrade rayon to 0.7.0.

0.5.1

Reorder methods for better documentation readability.
Add support for big-endian hosts. (Only 64-bit case actually tested.)
Optimize the ALU (non-SIMD) case for 32-bit ARM instead of x86_64.

0.5.0

Avoid allocating an excessively long buffers in non-streaming decode.
Fix the behavior of ISO-2022-JP and replacement decoders near the end of theoutput buffer.
Annotate the result structs with#[must_use].

0.4.0

Split FFI into a separate crate.
Performance tweaks.
CJK binary size and encoding performance changes.
Parallelize UTF-8 validation in the case of long buffers (with optionalfeatureparallel-utf8).
Borrow even with ISO-2022-JP when possible.

0.3.2

Fix moving pointers to alignment in ALU-based ASCII acceleration.
Fix errors in documentation and improve documentation.

0.3.1

Fix UTF-8 to UTF-16 decode for byte sequences beginning with 0xEE.
Make UTF-8 to UTF-8 decode SSE2-accelerated when featuresimd-accel is used.
When decoding and encoding ASCII-only input from or to an ASCII-compatibleencoding using the non-streaming API, return a borrow of the input.
Make encode from UTF-16 to UTF-8 faster.

0.3

Change the references to the instances ofEncoding fromconst tostaticto make the referents unique across crates that use the refernces.
Introduce non-reference-typedFOO_INIT instances ofEncoding to allowforeign crates to initializestatic arrays with references toEncodinginstances even under Rust's constraints that prohibit the initialization of&'static Encoding-typed array items with&'static Encoding-typedstatics.
Document that the above two points will be reverted if Rust changesconstto work so that cross-crate usage keeps the referents unique.
ReturnCows from Rust-only non-streaming methods for encode and decode.
Encoding::for_bom() returns the length of the BOM.
ASCII-accelerated conversions for encodings other than UTF-16LE, UTF-16BE,ISO-2022-JP and x-user-defined.
Add SSE2 acceleration behind thesimd-accel feature flag. (Requiresnightly Rust.)
Fix panic with long bogus labels.
Map0xCA to U+05BA in windows-1255.(Spec change.)
Correct theend of the Shift_JIS EUDC range.(Spec change.)

0.2.4

Polish FFI documentation.

0.2.3

Fix UTF-16 to UTF-8 encode.

0.2.2

AddEncoder.encode_from_utf8_to_vec_without_replacement().

0.2.1

AddEncoding.is_ascii_compatible().
AddEncoding::for_bom().
Make== forEncoding use name comparison instead of pointer comparison,because uses of the encoding constants in different crates result indifferent addresses and the constant cannot be turned into statics withoutbreaking other things.

0.2.0

The initial release.

About

A Gecko-oriented implementation of the Encoding Standard in Rust

docs.rs/encoding_rs/

Topics

rust unicode encoding web charset

Resources

Readme

License

Unknown and 2 other licenses found

Movatterモバイル変換

License

Licenses found

hsivonen/encoding_rs

Folders and files

Latest commit

History

Repository files navigation

encoding_rs

Functionality

Integration withstd::io

no_std Environment

Decoding Email

Windows Code Page Identifier Mappings

DOS Encodings

Preparing Text for the Encoders

Licensing

Documentation

C and C++ bindings

Sample programs

Optional features

simd-accel

serde

fast-legacy-encode

fast-hangul-encode

fast-hanja-encode

fast-kanji-encode

less-slow-kanji-encode

fast-gb-hanzi-encode

less-slow-gb-hanzi-encode

fast-big5-hanzi-encode

less-slow-big5-hanzi-encode

Performance goals

Rust Version Compatibility

Compatibility with rust-encoding

Regenerating Generated Code

Roadmap

Release Notes

0.8.35

0.8.34

0.8.33

0.8.32

0.8.31

0.8.30

0.8.29

0.8.28

0.8.27

0.8.26

0.8.25

0.8.24

0.8.23

0.8.22

0.8.21

0.8.20

0.8.19

0.8.18

0.8.17

0.8.16

0.8.15

0.8.14

0.8.13

0.8.12

0.8.11

0.8.10

0.8.9

0.8.8

0.8.7

0.8.6

0.8.5

0.8.4

0.8.3

0.8.2

0.8.1

0.8.0

0.7.2

0.7.1

0.7.0

0.6.11

0.6.10

0.6.9

Integration with`std::io`

`no_std` Environment

`simd-accel`

`serde`

`fast-legacy-encode`

`fast-hangul-encode`

`fast-hanja-encode`

`fast-kanji-encode`

`less-slow-kanji-encode`

`fast-gb-hanzi-encode`

`less-slow-gb-hanzi-encode`

`fast-big5-hanzi-encode`

`less-slow-big5-hanzi-encode`

Packages