TextEncoder: encodeInto() method
BaselineWidely available
This feature is well established and works across many devices and browser versions. It’s been available across browsers since April 2021.
Note: This feature is available inWeb Workers.
TheTextEncoder.encodeInto()
method takes a string to encode and a destinationUint8Array
to put resultingUTF-8 encoded text into, and returns an object indicating the progress of the encoding.This is potentially more performant than theencode()
method — especially when the target buffer is a view into aWasm heap.
Syntax
encodeInto(string, uint8Array)
Parameters
string
A string containing the text to encode.
uint8Array
A
Uint8Array
object instance to place the resulting UTF-8 encoded text into.
Return value
An object, which contains two members:
read
The number ofUTF-16 code units from the source that have been converted to UTF-8.This may be less than
string.length
ifuint8Array
did not have enough space.written
The number of bytes modified in the destination
Uint8Array
.The bytes written are guaranteed to form complete UTF-8 byte sequences.
Encode into a specific position
encodeInto()
always puts its output at the start of the array.However, it is sometimes useful to make the output start at a particular index.The solution isTypedArray.prototype.subarray()
:
const encoder = new TextEncoder();function encodeIntoAtPosition(string, u8array, position) { return encoder.encodeInto( string, position ? u8array.subarray(position | 0) : u8array, );}const u8array = new Uint8Array(8);encodeIntoAtPosition("hello", u8array, 2);console.log(u8array.join()); // 0,0,104,101,108,108,111,0
Buffer sizing
To convert a JavaScript strings
, the output space needed for full conversion is never less thans.length
bytes and never greater thans.length * 3
bytes.The exact UTF-8-to-UTF-16 length ratio for your string depends on the language you are working with:
- For basic English text that uses mostly ASCII characters, the ratio is close to 1.
- For text in scripts using characters U+0080 to U+07FF, which includes Greek, Cyrillic, Hebrew, Arabic, etc., the ratio is about 2.
- For text in scripts using characters U+0800 to U+FFFF, which includes Chinese, Japanese, Korean, etc., the ratio is about 3.
- It's not common for entire scripts to be written innon-BMP characters (although they do exist). These characters are usually math symbols, emojis, historical scripts, etc. The ratio for these characters is 2, because they take 4 bytes in UTF-8 and 2 in UTF-16.
If the output allocation (typically within Wasm heap) is expected to be short-lived, it makes sense to allocates.length * 3
bytes for the output, in which case the first conversion attempt is guaranteed to convert the whole string.
For example, if your text is primarily English, it is unlikely that long text will exceeds.length * 2
bytes in length.Thus, a more optimistic approach might be to allocates.length * 2 + 5
bytes, and perform reallocation in the rare circumstance that the optimistic prediction was wrong.
If the output is expected to be long-lived, it makes sense to compute minimum allocationroundUpToBucketSize(s.length)
, the maximum allocation sizes.length * 3
, and to have a chosen (as a tradeoff between memory usage and speed) thresholdt
such that ifroundUpToBucketSize(s.length) + t >= s.length * 3
, you allocate fors.length * 3
.Otherwise, first allocate forroundUpToBucketSize(s.length)
and convert.If theread
item it the return dictionary iss.length
, the conversion is done.If not, reallocate the target buffer towritten + (s.length - read) * 3
and then convert the rest by taking a substring ofs
starting from indexread
and a subbuffer of the target buffer starting from indexwritten
.
AboveroundUpToBucketSize()
is a function that rounds up to the allocator bucket size.For example, if your Wasm allocator is known to use power-of-two buckets,roundUpToBucketSize()
should return the argument if it is a power-of-two or the next power-of-two otherwise.If the behavior of the Wasm allocator is unknown,roundUpToBucketSize()
should be an identity function.
If the behavior of your allocator is unknown, you might want to have up to two reallocation steps and make the first reallocation step multiply theremaining unconverted length by two instead of three.However, in that case, it makes sense not to implement the usual multiplying by two of thealready written buffer length, because in such a case if a second reallocation happened, it would always over-allocate compared to the original length times three.The above advice assumes that you don't need to allocate space for a zero terminator.That is, on the Wasm side you are working with Rust strings or a non-zero-terminating C++ class.If you are working with C++std::string
, even though the logical length is shown to you, you need to take the extra terminator byte into account when computing rounding up to allocator bucket size.See the next section about C strings.
No Zero-termination
If the input string contains the character U+0000 in the input,encodeInto()
will write a 0x00 byte in the output.encodeInto()
does not write a C-style 0x00 sentinel byte after the logical output.
If your Wasm program uses C strings, it's your responsibility to write the0x00
sentinel and you can't prevent your Wasm program from seeing a logically truncated string if the JavaScript string containedU+0000
.Observe:
const encoder = new TextEncoder();function encodeIntoWithSentinel(string, u8array, position) { const stats = encoder.encodeInto( string, position ? u8array.subarray(position | 0) : u8array, ); if (stats.written < u8array.length) u8array[stats.written] = 0; // append null if room return stats;}
Examples
Encoding into a buffer
<p>This is a sample paragraph.</p><p></p>
const sourcePara = document.querySelector(".source");const resultPara = document.querySelector(".result");const string = sourcePara.textContent;const textEncoder = new TextEncoder();const utf8 = new Uint8Array(string.length);const encodedResults = textEncoder.encodeInto(string, utf8);resultPara.textContent += `Bytes read: ${encodedResults.read}` + ` | Bytes written: ${encodedResults.written}` + ` | Encoded result: ${utf8}`;
Specifications
Specification |
---|
Encoding # ref-for-dom-textencoder-encodeinto① |
Browser compatibility
See also
- The
TextEncoder
interface it belongs to. TextEncoder.encode()