- Notifications
You must be signed in to change notification settings - Fork24
a Python implementation of the Unicode Collation Algorithm
License
MIT, Unknown licenses found
Licenses found
jtauber/pyuca
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Python implementation of theUnicode Collation Algorithm (UCA). Itpasses 100% of the UCA conformance tests for Unicode 5.2.0 (Python 2.7),Unicode 6.3.0 (Python 3.3+), Unicode 8.0.0 (Python 3.5+), Unicode 9.0.0(Python 3.6+), and Unicode 10.0.0 (Python 3.7+) with a variable-weightingsetting of Non-ignorable.
In short, sorting non-English strings properly.
The core of the algorithm involves multi-level comparison. For example,café
comes beforecaff
because at the primary level, the accent isignored and the first word is treated as if it werecafe
. The secondarylevel (which considers accents) only applies then to words that are equivalentat the primary level.
The Unicode Collation Algorithm and pyuca also support contraction andexpansion.Contraction is where multiple letters are treated as a singleunit. In Spanish,ch
is treated as a letter coming betweenc
andd
so that, for example, words beginningch
should sort after all other wordsbeginnings withc
.Expansion is where a single letter is treated asthough it were multiple letters. In German,ä
is sorted as if it wereae
, i.e. afterad
but beforeaf
.
Here is how to use thepyuca
module.
pip install pyuca
Usage example:
from pyuca import Collatorc = Collator()assert sorted(["cafe", "caff", "café"]) == ["cafe", "caff", "café"]assert sorted(["cafe", "caff", "café"], key=c.sort_key) == ["cafe", "café", "caff"]
Collator
can also take an optional filename for specifying a customcollation element table.
You can also import collators for specific Unicode versions,e.g.from pyuca.collator import Collator_8_0_0
.But justfrom pyuca import Collator
will ensure that the collator versionmatches the version ofunicodata
provided by the standard library for yourversion of Python.
Tauber, J. K. (2016). pyuca: a Python implementation of the Unicode Collation Algorithm. The Journal of Open Source Software. DOI: 10.21105/joss.00021
Python code is made available under an MIT license (seeLICENSE
).allkeys.txt
is made available under the similar license defined inLICENSE-allkeys
.
If you have any problems, questions or suggestions, it's best to file an issueon GitHub although you can also contact me atjtauber@jtauber.com.
For more of my work on linguistics and Ancient Greek, seehttp://jktauber.com/.
About
a Python implementation of the Unicode Collation Algorithm