Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

🌏 Truly Universal Encoding Detection in Python 🌎

License

NotificationsYou must be signed in to change notification settings

chomechome/charamel

Repository files navigation

logo


Package versionPackage licensePython versionsTravisCI statusCode coverageCode quality


Truly Universal Encoding Detection in Python

Charamel is a pure Python universal character encoding library that supportsall ofPython character encodings.The library is based on machine learning and trained to handle more than 60 languages.All that with no external dependencies. Ain't it sweet? 🍭

Installation

$ pip install charamel

Features

  • 🌈 Powered by machine learning
  • 📦 No dependencies
  • ⚡ Faster than other pure Python libraries
  • 🐍 Supports all98 Python encodings
  • 🌍 Works on 60+ languages
  • 🔎 97% accuracy

Usage

API is centered aroundDetector class, withdetect method being responsible for basic encoding detection:

>>>fromcharamelimportDetector>>>detector=Detector()>>>content=b'El espa\xf1ol o castellano del lat\xedn hablado'>>>detector.detect(content)<Encoding.ISO_8859_14:'iso8859_14'>

This returns the most likely encoding that can decode the byte string. Let's try it out:

>>>fromcharamelimportEncoding>>>content.decode(Encoding.ISO_8859_14)'El español o castellano del latín hablado'

To get multiple likely encodings along with confidences in range[0, 1], useprobe method:

>>>detector.probe(content,top=3)[(<Encoding.ISO_8859_14:'iso8859_14'>,0.9964286725192874), (<Encoding.CP_1258:'cp1258'>,0.9919203166700203), (<Encoding.ISO_8859_3:'iso8859_3'>,0.9915028923264849)]

Detector can be configured to use a subset of encodings. Less possible encodings lead to faster detection:

>>>detector=Detector(encodings=[Encoding.UTF_8,Encoding.BIG_5])

Another usefulDetector parameter ismin_confidence. Basically, this parameter regulates how conservative theDetector will be.Confidence for encodings that are returned bydetect andprobe methods must be greater thatmin_confidence:

>>>detector=Detector(min_confidence=0.5)

If no encoding confidences exceedmin_confidence,detect will returnNone andprobe will return an empty list.

Benchmark

Below is the comparison betweenCharamel and other available Python encoding detection libraries:

DetectorSupported EncodingsSec / File (Mean)Sec / File (99%)Sec / File (Max)KB / SecAccuracyAccuracy on Supported
Chardet v3.0.4260.0292590.4161563.11522061%97%
Cchardet v2.1.6400.0003830.0039130.0628551681167%79%
Charset-Normalizer v1.3.4890.1266740.5028821.418485177%78%
Charamel v1.0.0980.0090530.042770.12066771297%97%

How to run this benchmark (requires Python 3.6+):

$ git clone git@github.com:chomechome/charamel.git$cd charamel$ pip install poetry>=1.0.5$ make benchmark

It also produces a detailed breakdown for all represented encodings:

* - not officially support for detector

EncodingTotalChardet v3.0.4Cchardet v2.1.6Charset-Normalizer v1.3.4Charamel v1.0.0
ascii87 (88%)8 (100%)7 (88%)8 (100%)
big53333 (100%)33 (100%)32 (97%)31 (94%)
big5hkscs96 (67%) *6 (67%) *8 (89%)9 (100%)
cp037140 (0%) *0 (0%) *12 (86%)14 (100%)
cp100644 (100%) *4 (100%) *4 (100%) *4 (100%)
cp1026140 (0%) *0 (0%) *10 (71%)14 (100%)
cp112554 (80%) *4 (80%) *5 (100%)5 (100%)
cp1140140 (0%) *0 (0%) *12 (86%)14 (100%)
cp1250237 (30%) *22 (96%)11 (48%)23 (100%)
cp12514544 (98%)45 (100%)45 (100%)45 (100%)
cp12523636 (100%)30 (83%)18 (50%)36 (100%)
cp125364 (67%)6 (100%)6 (100%)6 (100%)
cp12541615 (94%) *13 (81%) *12 (75%)16 (100%)
cp12552929 (100%)29 (100%)29 (100%)29 (100%)
cp125686 (75%) *7 (88%)8 (100%)8 (100%)
cp1257137 (54%) *10 (77%)6 (46%)13 (100%)
cp12581514 (93%) *12 (80%) *12 (80%)15 (100%)
cp273140 (0%) *0 (0%) *7 (50%)14 (100%)
cp42440 (0%) *0 (0%) *4 (100%)4 (100%)
cp437114 (36%) *4 (36%) *9 (82%)11 (100%)
cp500140 (0%) *0 (0%) *7 (50%)14 (100%)
cp72064 (67%) *4 (67%) *6 (100%) *6 (100%)
cp73744 (100%) *4 (100%) *4 (100%) *4 (100%)
cp775114 (36%) *4 (36%) *8 (73%)11 (100%)
cp850144 (29%) *4 (29%) *11 (79%)14 (100%)
cp852144 (29%) *12 (86%)6 (43%)14 (100%)
cp8552626 (100%)26 (100%)26 (100%)26 (100%)
cp85644 (100%) *4 (100%) *4 (100%) *4 (100%)
cp857144 (29%) *4 (29%) *11 (79%)14 (100%)
cp858144 (29%) *4 (29%) *11 (79%)14 (100%)
cp86074 (57%) *4 (57%) *6 (86%)7 (100%)
cp86194 (44%) *4 (44%) *8 (89%)9 (100%)
cp86244 (100%) *4 (100%) *4 (100%)4 (100%)
cp86374 (57%) *4 (57%) *6 (86%)7 (100%)
cp86444 (100%) *4 (100%) *4 (100%)4 (100%)
cp865124 (33%) *4 (33%) *10 (83%)12 (100%)
cp8662323 (100%)23 (100%)23 (100%)23 (100%)
cp86944 (100%) *4 (100%) *4 (100%)4 (100%)
cp87486 (75%) *7 (88%) *8 (100%) *8 (100%)
cp87540 (0%) *0 (0%) *3 (75%) *4 (100%)
cp9321111 (100%)8 (73%) *11 (100%)9 (82%)
cp94966 (100%) *6 (100%)6 (100%)6 (100%)
cp95066 (100%) *6 (100%) *6 (100%)6 (100%)
euc_jis_2004298 (28%) *8 (28%) *20 (69%)29 (100%)
euc_jisx0213298 (28%) *8 (28%) *20 (69%)29 (100%)
euc_jp5639 (70%)38 (68%)53 (95%)56 (100%)
euc_kr3838 (100%)38 (100%)37 (97%)38 (100%)
gb18030486 (12%) *47 (98%)33 (69%)48 (100%)
gb23122625 (96%)24 (92%) *23 (88%)26 (100%)
gbk105 (50%) *9 (90%) *9 (90%)10 (100%)
hz66 (100%)6 (100%)5 (83%)6 (100%)
iso2022_jp1010 (100%)10 (100%)9 (90%)10 (100%)
iso2022_jp_1268 (31%) *8 (31%) *25 (96%)26 (100%)
iso2022_jp_2298 (28%) *8 (28%) *28 (97%)29 (100%)
iso2022_jp_2004218 (38%) *8 (38%) *20 (95%)21 (100%)
iso2022_jp_3218 (38%) *8 (38%) *20 (95%)21 (100%)
iso2022_jp_ext268 (31%) *8 (31%) *25 (96%)26 (100%)
iso2022_kr88 (100%)8 (100%)8 (100%)8 (100%)
iso8859_10149 (64%) *13 (93%)7 (50%)14 (100%)
iso8859_1196 (67%) *8 (89%) *9 (100%)8 (89%)
iso8859_13167 (44%) *14 (88%)6 (38%)16 (100%)
iso8859_141414 (100%) *11 (79%) *12 (86%)14 (100%)
iso8859_151814 (78%) *14 (78%)12 (67%)18 (100%)
iso8859_16138 (62%) *11 (85%)7 (54%)13 (100%)
iso8859_2287 (25%) *27 (96%)16 (57%)28 (100%)
iso8859_31310 (77%) *10 (77%)9 (69%)13 (100%)
iso8859_4159 (60%) *14 (93%)7 (47%)15 (100%)
iso8859_53939 (100%)39 (100%)39 (100%)39 (100%)
iso8859_664 (67%) *6 (100%)6 (100%)6 (100%)
iso8859_71716 (94%)17 (100%)17 (100%)17 (100%)
iso8859_855 (100%)5 (100%)4 (80%)5 (100%)
iso8859_91814 (78%) *15 (83%)13 (72%)18 (100%)
johab54 (80%) *4 (80%) *5 (100%)5 (100%)
koi8_r2626 (100%)26 (100%)26 (100%)26 (100%)
koi8_t44 (100%) *4 (100%) *4 (100%) *4 (100%)
koi8_u54 (80%) *4 (80%) *4 (80%) *5 (100%)
kz104854 (80%) *4 (80%) *5 (100%)5 (100%)
latin_12929 (100%)26 (90%)24 (83%)29 (100%)
mac_cyrillic2525 (100%)25 (100%)23 (92%)25 (100%)
mac_greek74 (57%) *4 (57%) *6 (86%)7 (100%)
mac_iceland154 (27%) *4 (27%) *9 (60%)15 (100%)
mac_latin2164 (25%) *11 (69%) *6 (38%)16 (100%)
mac_roman154 (27%) *4 (27%) *9 (60%)15 (100%)
mac_turkish154 (27%) *4 (27%) *9 (60%)15 (100%)
ptcp15454 (80%) *4 (80%) *5 (100%)5 (100%)
shift_jis4040 (100%)40 (100%)38 (95%)40 (100%)
shift_jis_2004218 (38%) *8 (38%) *15 (71%)21 (100%)
shift_jisx0213218 (38%) *8 (38%) *15 (71%)21 (100%)
tis_6201312 (92%)12 (92%) *13 (100%)13 (100%)
utf_164040 (100%)40 (100%) *33 (82%)40 (100%)
utf_16_be420 (0%) *0 (0%)35 (83%)30 (71%)
utf_16_le430 (0%) *0 (0%)35 (81%)37 (86%)
utf_324242 (100%)42 (100%) *22 (52%)41 (98%)
utf_32_be410 (0%) *0 (0%)20 (49%)27 (66%)
utf_32_le400 (0%) *0 (0%)20 (50%)28 (70%)
utf_7404 (10%) *4 (10%) *20 (50%)39 (98%)
utf_8101100 (99%)100 (99%)78 (77%)101 (100%)
utf_8_sig4242 (100%) *42 (100%) *0 (0%) *42 (100%)

[8]ページ先頭

©2009-2025 Movatter.jp