- Notifications
You must be signed in to change notification settings - Fork124
Convert and detect character encoding in JavaScript
License
polygonplanet/encoding.js
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Convert and detect character encoding in JavaScript.
- Features
- Installation
- Supported encodings
- Example usage
- Demo
- API
- detect : Detects character encoding
- convert : Converts character encoding
- Specify conversion options to the argument
toas an object - Specify the return type by the
typeoption - Specify handling for unrepresentable characters
- Replacing characters with HTML entities when they cannot be represented
- Ignoring characters when they cannot be represented
- Throwing an Error when they cannot be represented
- Specify BOM in UTF-16
- Specify conversion options to the argument
- urlEncode : Encodes to percent-encoded string
- urlDecode : Decodes from percent-encoded string
- base64Encode : Encodes to Base64 formatted string
- base64Decode : Decodes from Base64 formatted string
- codeToString : Converts character code array to string
- stringToCode : Converts string to character code array
- Japanese Zenkaku/Hankaku conversion
- Other examples
- Contributing
- License
encoding.js is a JavaScript library for converting and detecting character encodings,supporting both Japanese character encodings (Shift_JIS,EUC-JP,ISO-2022-JP) and Unicode formats (UTF-8,UTF-16).
Since JavaScript string values are internally encoded as UTF-16 code units(ref: ECMAScript® 2019 Language Specification - 6.1.4 The String Type),they cannot directly handle other character encodings as strings. However, encoding.js overcomes this limitation by treating these encodings as arrays instead of strings,enabling the conversion between different character sets.
Each character encoding is represented as an array of numbers corresponding to character code values, for example,[130, 160] represents "あ" in Shift_JIS.
The array of character codes used in its methods can also be utilized with TypedArray objects, such asUint8Array, or withBuffer in Node.js.
Numeric arrays of character codes can be converted to strings using methods such asEncoding.codeToString.However, due to the JavaScript specifications mentioned above, some character encodings may not be handled properly when converted directly to strings.
If you prefer to use strings instead of numeric arrays, you can convert them to percent-encoded strings,such as'%82%A0', usingEncoding.urlEncode andEncoding.urlDecode for passing to other resources.Similarly,Encoding.base64Encode andEncoding.base64Decode allow for encoding and decoding to and from base64,which can then be passed as strings.
encoding.js is published under the package nameencoding-japanese on npm.
npm install --save encoding-japanese
importEncodingfrom'encoding-japanese';
constEncoding=require('encoding-japanese');
TypeScript type definitions for encoding.js are available at@types/encoding-japanese (thanks to@rhysd).
npm install --save-dev @types/encoding-japanese
To use encoding.js in a browser environment, you can either install it via npm or download it directly from therelease list.The package includes bothencoding.js andencoding.min.js.
Note: Cloning the repository viagit clone might give you access to themaster (ormain) branch, which could still be in a development state.
<!-- To include the full version --><scriptsrc="encoding.js"></script><!-- Or, to include the minified version for production --><scriptsrc="encoding.min.js"></script>
When the script is loaded, the objectEncoding is defined in the global scope (i.e.,window.Encoding).
You can use encoding.js (package name:encoding-japanese) directly from a CDN via a script tag:
<scriptsrc="https://unpkg.com/encoding-japanese@2.2.0/encoding.min.js"></script>
In this example we useunpkg, but you can use any CDN that provides npm packages,for examplecdnjs orjsDelivr.
| Value in encoding.js | detect() | convert() | MIME Name (Note) |
|---|---|---|---|
| ASCII | ✓ | US-ASCII (Code point range:0-127) | |
| BINARY | ✓ | (Binary string. Code point range:0-255) | |
| EUCJP | ✓ | ✓ | EUC-JP |
| JIS | ✓ | ✓ | ISO-2022-JP |
| SJIS | ✓ | ✓ | Shift_JIS |
| UTF8 | ✓ | ✓ | UTF-8 |
| UTF16 | ✓ | ✓ | UTF-16 |
| UTF16BE | ✓ | ✓ | UTF-16BE (big-endian) |
| UTF16LE | ✓ | ✓ | UTF-16LE (little-endian) |
| UTF32 | ✓ | UTF-32 | |
| UNICODE | ✓ | ✓ | (JavaScript string. *SeeAboutUNICODE below) |
In encoding.js,UNICODE is defined as the internal character encoding that JavaScript strings (JavaScript string objects) can handle directly.
As mentioned in theFeatures section, JavaScript strings are internally encoded using UTF-16 code units.This means that other character encodings cannot be directly handled without conversion.Therefore, when converting to a character encoding that is properly representable in JavaScript, you should specifyUNICODE.
(Note: Even if the HTML file's encoding is UTF-8, you should specifyUNICODE instead ofUTF8 when processing the encoding in JavaScript.)
When usingEncoding.convert, if you specify a character encoding other thanUNICODE (such asUTF8 orSJIS), the values in the returned character code array will range from0-255.However, if you specifyUNICODE, the values will range from0-65535, which corresponds to the range of values returned byString.prototype.charCodeAt() (Code Units).
Convert character encoding from JavaScript string (UNICODE) toSJIS.
constunicodeArray=Encoding.stringToCode('こんにちは');// Convert string to code arrayconstsjisArray=Encoding.convert(unicodeArray,{to:'SJIS',from:'UNICODE'});console.log(sjisArray);// [130, 177, 130, 241, 130, 201, 130, 191, 130, 205] ('こんにちは' array in SJIS)
Convert character encoding fromSJIS toUNICODE.
constsjisArray=[130,177,130,241,130,201,130,191,130,205];// 'こんにちは' array in SJISconstunicodeArray=Encoding.convert(sjisArray,{to:'UNICODE',from:'SJIS'});conststr=Encoding.codeToString(unicodeArray);// Convert code array to stringconsole.log(str);// 'こんにちは'
Detect character encoding.
constdata=[227,129,147,227,130,147,227,129,171,227,129,161,227,129,175];// 'こんにちは' array in UTF-8constdetectedEncoding=Encoding.detect(data);console.log(`Character encoding is${detectedEncoding}`);// 'Character encoding is UTF8'
(Node.js) Example of reading a text file written inSJIS.
constfs=require('fs');constEncoding=require('encoding-japanese');constsjisBuffer=fs.readFileSync('./sjis.txt');constunicodeArray=Encoding.convert(sjisBuffer,{to:'UNICODE',from:'SJIS'});console.log(Encoding.codeToString(unicodeArray));
- Playground for testing character encoding conversion and detection
- Test run for reading sample files and converting character encodings
- Demo for converting and detecting character encoding by specifying a file
- detect
- convert
- urlEncode
- urlDecode
- base64Encode
- base64Decode
- codeToString
- stringToCode
- Japanese Zenkaku/Hankaku conversion
Detects the character encoding of the given data.
- data(Array<number>|TypedArray|Buffer|string) : The code array or string to detect character encoding.
- [encodings](string|Array<string>|Object) : (Optional) Specifies a specific character encoding,or an array of encodings to limit the detection. Detects automatically if this argument is omitted or
AUTOis specified.Supported encoding values can be found in the "Supported encodings" section.
(string|boolean): Returns a string representing the detected encoding (e.g.,SJIS,UTF8) listed in the "Supported encodings" section, orfalse if the encoding cannot be detected.If theencodings argument is provided, it returns the name of the detected encoding if thedata matches any of the specified encodings, orfalse otherwise.
Example of detecting character encoding.
constsjisArray=[130,168,130,205,130,230];// 'おはよ' array in SJISconstdetectedEncoding=Encoding.detect(sjisArray);console.log(`Encoding is${detectedEncoding}`);// 'Encoding is SJIS'
Example of using theencodings argument to specify the character encoding to be detected.This returns a string detected encoding if the specified encoding matches, orfalse otherwise:
constsjisArray=[130,168,130,205,130,230];// 'おはよ' array in SJISconstdetectedEncoding=Encoding.detect(sjisArray,'SJIS');if(detectedEncoding){console.log('Encoding is SJIS');}else{console.log('Encoding does not match SJIS');}
Example of specifying multiple encodings:
constsjisArray=[130,168,130,205,130,230];// 'おはよ' array in SJISconstdetectedEncoding=Encoding.detect(sjisArray,['UTF8','SJIS']);if(detectedEncoding){console.log(`Encoding is${detectedEncoding}`);// 'Encoding is SJIS'}else{console.log('Encoding does not match UTF8 and SJIS');}
Converts the character encoding of the given data.
- data(Array<number>|TypedArray|Buffer|string) : The code array or string to convert character encoding.
- to(string|Object) : The character encoding name of the conversion destination as a string, or conversion options as an object.
- [from](string|Array<string>) : (Optional) The character encoding name of the conversion source as a string,or an array of encoding names. Detects automatically if this argument is omitted or
AUTOis specified.Supported encoding values can be found in the "Supported encodings" section.
(Array<number>|TypedArray|string) : Returns a numeric character code array of the converted character encoding ifdata is an array or a buffer,or returns the converted string ifdata is a string.
Example of converting a character code array to Shift_JIS from UTF-8:
constutf8Array=[227,129,130];// 'あ' in UTF-8constsjisArray=Encoding.convert(utf8Array,'SJIS','UTF8');console.log(sjisArray);// [130, 160] ('あ' in SJIS)
TypedArray such asUint8Array, andBuffer of Node.js can be converted in the same usage:
constutf8Array=newUint8Array([227,129,130]);constsjisArray=Encoding.convert(utf8Array,'SJIS','UTF8');
Converts character encoding by auto-detecting the encoding name of the source:
// The character encoding is automatically detected when the argument `from` is omittedconstutf8Array=[227,129,130];letsjisArray=Encoding.convert(utf8Array,'SJIS');// Or explicitly specify 'AUTO' to auto-detectingsjisArray=Encoding.convert(utf8Array,'SJIS','AUTO');
You can pass the second argumentto as an object for improving readability.Also, the following options such astype,fallback, andbom must be specified with an object.
constutf8Array=[227,129,130];constsjisArray=Encoding.convert(utf8Array,{to:'SJIS',from:'UTF8'});
convert returns an array by default, but you can change the return type by specifying thetype option.Also, if the argumentdata is passed as a string and the type option is not specified, thentype ='string' is assumed (returns as a string).
constsjisArray=[130,168,130,205,130,230];// 'おはよ' array in SJISconstunicodeString=Encoding.convert(sjisArray,{to:'UNICODE',from:'SJIS',type:'string'// Specify 'string' to return as string});console.log(unicodeString);// 'おはよ'
The followingtype options are supported.
- string : Return as a string.
- arraybuffer : Return as an ArrayBuffer (Actually returns a
Uint16Arraydue to historical reasons). - array : Return as an Array. (default)
type: 'string' can be used as a shorthand for converting a code array to a string,as performed byEncoding.codeToString.
Note: Specifyingtype: 'string' may not handle conversions properly, except when converting toUNICODE.
With thefallback option, you can specify how to handle characters that cannot be represented in the target encoding.Thefallback option supports the following values:
- html-entity: Replace characters with HTML entities (decimal HTML numeric character references).
- html-entity-hex: Replace characters with HTML entities (hexadecimal HTML numeric character references).
- ignore: Ignore characters that cannot be represented.
- error: Throw an error if any character cannot be represented.
Characters that cannot be represented in the target character set are replaced with '?' (U+003F) by default,but by specifyinghtml-entity as thefallback option, you can replace them with HTML entities (Numeric character references), such as🍣.
Example of specifying{ fallback: 'html-entity' } option:
constunicodeArray=Encoding.stringToCode('寿司🍣ビール🍺');// No fallback specifiedletsjisArray=Encoding.convert(unicodeArray,{to:'SJIS',from:'UNICODE'});console.log(sjisArray);// Converted to a code array of '寿司?ビール?'// Specify `fallback: html-entity`sjisArray=Encoding.convert(unicodeArray,{to:'SJIS',from:'UNICODE',fallback:'html-entity'});console.log(sjisArray);// Converted to a code array of '寿司🍣ビール🍺'
Example of specifying{ fallback: 'html-entity-hex' } option:
constunicodeArray=Encoding.stringToCode('ホッケの漢字は𩸽');constsjisArray=Encoding.convert(unicodeArray,{to:'SJIS',from:'UNICODE',fallback:'html-entity-hex'});console.log(sjisArray);// Converted to a code array of 'ホッケの漢字は𩸽'
By specifyingignore as afallback option, characters that cannot be represented in the target encoding format can be ignored.
Example of specifying{ fallback: 'ignore' } option:
constunicodeArray=Encoding.stringToCode('寿司🍣ビール🍺');// No fallback specifiedletsjisArray=Encoding.convert(unicodeArray,{to:'SJIS',from:'UNICODE'});console.log(sjisArray);// Converted to a code array of '寿司?ビール?'// Specify `fallback: ignore`sjisArray=Encoding.convert(unicodeArray,{to:'SJIS',from:'UNICODE',fallback:'ignore'});console.log(sjisArray);// Converted to a code array of '寿司ビール'
If you need to throw an error when a character cannot be represented in the target character encoding,specifyerror as afallback option. This will cause an exception to be thrown.
Example of specifying{ fallback: 'error' } option:
constunicodeArray=Encoding.stringToCode('おにぎり🍙ラーメン🍜');try{constsjisArray=Encoding.convert(unicodeArray,{to:'SJIS',from:'UNICODE',fallback:'error'// Specify 'error' to throw an exception});}catch(e){console.error(e);// Error: Character cannot be represented: [240, 159, 141, 153]}
You can add a BOM (byte order mark) by specifying thebom option when converting toUTF16.The default is no BOM.
constutf16Array=Encoding.convert(utf8Array,{to:'UTF16',from:'UTF8',bom:true// Specify to add the BOM});
UTF16 byte order is big-endian by default.If you want to convert as little-endian, specify the{ bom: 'LE' } option.
constutf16leArray=Encoding.convert(utf8Array,{to:'UTF16',from:'UTF8',bom:'LE'// Specify to add the BOM as little-endian});
If you do not need BOM, useUTF16BE orUTF16LE.UTF16BE is big-endian, andUTF16LE is little-endian, and both have no BOM.
constutf16beArray=Encoding.convert(utf8Array,{to:'UTF16BE',from:'UTF8'});
Encodes a numeric character code array into a percent-encoded string formatted as a URI component in%xx format.
urlEncode escapes all characters except the following, just likeencodeURIComponent().
A-Z a-z 0-9 - _ . ! ~ * ' ( )- data(Array<number>|TypedArray|Buffer|string) : The numeric character code array or string that will be encoded into a percent-encoded URI component.
(string) : Returns a percent-encoded string formatted as a URI component in%xx format.
Example of URL encoding a Shift_JIS array:
constsjisArray=[130,168,130,205,130,230];// 'おはよ' array in SJISconstencoded=Encoding.urlEncode(sjisArray);console.log(encoded);// '%82%A8%82%CD%82%E6'
Decodes a percent-encoded string formatted as a URI component in%xx format to a numeric character code array.
- string(string) : The string to decode.
(Array<number>) : Returns a numeric character code array.
Example of decoding a percent-encoded Shift_JIS string:
constencoded='%82%A8%82%CD%82%E6';// 'おはよ' encoded as percent-encoded SJIS stringconstsjisArray=Encoding.urlDecode(encoded);console.log(sjisArray);// [130, 168, 130, 205, 130, 230]
Encodes a numeric character code array into a Base64 encoded string.
- data(Array<number>|TypedArray|Buffer|string) : The numeric character code array or string to encode.
(string) : Returns a Base64 encoded string.
Example of Base64 encoding a Shift_JIS array:
constsjisArray=[130,168,130,205,130,230];// 'おはよ' array in SJISconstencodedStr=Encoding.base64Encode(sjisArray);console.log(encodedStr);// 'gqiCzYLm'
Decodes a Base64 encoded string to a numeric character code array.
- string(string) : The Base64 encoded string to decode.
(Array<number>) : Returns a Base64 decoded numeric character code array.
Example ofbase64Encode andbase64Decode:
constsjisArray=[130,177,130,241,130,201,130,191,130,205];// 'こんにちは' array in SJISconstencodedStr=Encoding.base64Encode(sjisArray);console.log(encodedStr);// 'grGC8YLJgr+CzQ=='constdecodedArray=Encoding.base64Decode(encodedStr);console.log(decodedArray);// [130, 177, 130, 241, 130, 201, 130, 191, 130, 205]
Converts a numeric character code array to string.
- code(Array<number>|TypedArray|Buffer) : The numeric character code array to convert.
(string) : Returns a converted string.
Example of converting a character code array to a string:
constsjisArray=[130,168,130,205,130,230];// 'おはよ' array in SJISconstunicodeArray=Encoding.convert(sjisArray,{to:'UNICODE',from:'SJIS'});constunicodeStr=Encoding.codeToString(unicodeArray);console.log(unicodeStr);// 'おはよ'
Converts a string to a numeric character code array.
- string(string) : The string to convert.
(Array<number>) : Returns a numeric character code array converted from the string.
Example of converting a string to a character code array:
constunicodeArray=Encoding.stringToCode('おはよ');console.log(unicodeArray);// [12362, 12399, 12424]
The following methods convert Japanese full-width (zenkaku) and half-width (hankaku) characters,suitable for use withUNICODE strings or numeric character code arrays ofUNICODE.
Returns a converted string if the argumentdata is a string.Returns a numeric character code array if the argumentdata is a code array.
- Encoding.toHankakuCase (data) : Converts full-width (zenkaku) symbols and alphanumeric characters to their half-width (hankaku) equivalents.
- Encoding.toZenkakuCase (data) : Converts half-width (hankaku) symbols and alphanumeric characters to their full-width (zenkaku) equivalents.
- Encoding.toHiraganaCase (data) : Converts full-width katakana to full-width hiragana.
- Encoding.toKatakanaCase (data) : Converts full-width hiragana to full-width katakana.
- Encoding.toHankanaCase (data) : Converts full-width katakana to half-width katakana.
- Encoding.toZenkanaCase (data) : Converts half-width katakana to full-width katakana.
- Encoding.toHankakuSpace (data) : Converts the em space (U+3000) to the single space (U+0020).
- Encoding.toZenkakuSpace (data) : Converts the single space (U+0020) to the em space (U+3000).
- data(Array<number>|TypedArray|Buffer|string) : The string or numeric character code array to convert.
(Array<number>|string) : Returns a converted string or numeric character code array.
Example of converting zenkaku and hankaku strings:
console.log(Encoding.toHankakuCase('abcDEF123@!#*='));// 'abcDEF123@!#*='console.log(Encoding.toZenkakuCase('abcDEF123@!#*='));// 'abcDEF123@!#*='console.log(Encoding.toHiraganaCase('アイウエオァィゥェォヴボポ'));// 'あいうえおぁぃぅぇぉゔぼぽ'console.log(Encoding.toKatakanaCase('あいうえおぁぃぅぇぉゔぼぽ'));// 'アイウエオァィゥェォヴボポ'console.log(Encoding.toHankanaCase('アイウエオァィゥェォヴボポ'));// 'アイウエオァィゥェォヴボポ'console.log(Encoding.toZenkanaCase('アイウエオァィゥェォヴボポ'));// 'アイウエオァィゥェォヴボポ'console.log(Encoding.toHankakuSpace('あいうえお abc 123'));// 'あいうえお abc 123'console.log(Encoding.toZenkakuSpace('あいうえお abc 123'));// 'あいうえお abc 123'
Example of converting zenkaku and hankaku code arrays:
constunicodeArray=Encoding.stringToCode('abc123!# あいうアイウ ABCアイウ');console.log(Encoding.codeToString(Encoding.toHankakuCase(unicodeArray)));// 'abc123!# あいうアイウ ABCアイウ'console.log(Encoding.codeToString(Encoding.toZenkakuCase(unicodeArray)));// 'abc123!# あいうアイウ ABCアイウ'console.log(Encoding.codeToString(Encoding.toHiraganaCase(unicodeArray)));// 'abc123!# あいうあいう ABCアイウ'console.log(Encoding.codeToString(Encoding.toKatakanaCase(unicodeArray)));// 'abc123!# アイウアイウ ABCアイウ'console.log(Encoding.codeToString(Encoding.toHankanaCase(unicodeArray)));// 'abc123!# あいうアイウ ABCアイウ'console.log(Encoding.codeToString(Encoding.toZenkanaCase(unicodeArray)));// 'abc123!# あいうアイウ ABCアイウ'console.log(Encoding.codeToString(Encoding.toHankakuSpace(unicodeArray)));// 'abc123!# あいうアイウ ABCアイウ'console.log(Encoding.codeToString(Encoding.toZenkakuSpace(unicodeArray)));// 'abc123!# あいうアイウ ABCアイウ'
This example reads a text file encoded in Shift_JIS as binary data,and displays it as a string after converting it to Unicode usingEncoding.convert.
(async()=>{try{constresponse=awaitfetch('shift_jis.txt');constbuffer=awaitresponse.arrayBuffer();// Code array with Shift_JIS file contentsconstsjisArray=newUint8Array(buffer);// Convert encoding to UNICODE (JavaScript Code Units) from Shift_JISconstunicodeArray=Encoding.convert(sjisArray,{to:'UNICODE',from:'SJIS'});// Convert to string from code array for displayconstunicodeString=Encoding.codeToString(unicodeArray);console.log(unicodeString);}catch(error){console.error('Error loading the file:',error);}})();
XMLHttpRequest version of this example
constreq=newXMLHttpRequest();req.open('GET','shift_jis.txt',true);req.responseType='arraybuffer';req.onload=(event)=>{constbuffer=req.response;if(buffer){// Code array with Shift_JIS file contentsconstsjisArray=newUint8Array(buffer);// Convert encoding to UNICODE (JavaScript Code Units) from Shift_JISconstunicodeArray=Encoding.convert(sjisArray,{to:'UNICODE',from:'SJIS'});// Convert to string from code array for displayconstunicodeString=Encoding.codeToString(unicodeArray);console.log(unicodeString);}};req.send(null);
This example uses the File API to read the content of a selected file, detects its character encoding,and converts the file content to UNICODE from any character encoding such asShift_JIS orEUC-JP.The converted content is then displayed in a textarea.
<inputtype="file"id="file"><divid="encoding"></div><textareaid="content"rows="5"cols="80"></textarea><script>functiononFileSelect(event){constfile=event.target.files[0];constreader=newFileReader();reader.onload=function(e){constcodes=newUint8Array(e.target.result);constdetectedEncoding=Encoding.detect(codes);constencoding=document.getElementById('encoding');encoding.textContent=`Detected encoding:${detectedEncoding}`;// Convert encoding to UNICODEconstunicodeString=Encoding.convert(codes,{to:'UNICODE',from:detectedEncoding,type:'string'});document.getElementById('content').value=unicodeString;};reader.readAsArrayBuffer(file);}document.getElementById('file').addEventListener('change',onFileSelect);</script>
We welcome contributions from everyone.For bug reports and feature requests, pleasecreate an issue on GitHub.
Before submitting a pull request, please runnpm run test to ensure there are no errors.We only accept pull requests that pass all tests.
This project is licensed under the terms of the MIT license.See theLICENSE file for details.
About
Convert and detect character encoding in JavaScript
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors6
Uh oh!
There was an error while loading.Please reload this page.