Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.github		.github
CollationTest		CollationTest
pyuca		pyuca
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS		AUTHORS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-allkeys		LICENSE-allkeys
MANIFEST.in		MANIFEST.in
README.md		README.md
full_test.py		full_test.py
paper.md		paper.md
setup.cfg		setup.cfg
setup.py		setup.py
test.py		test.py
tox.ini		tox.ini

Repository files navigation

pyuca: Python Unicode Collation Algorithm implementation

This is a Python implementation of theUnicode Collation Algorithm (UCA). Itpasses 100% of the UCA conformance tests for Unicode 5.2.0 (Python 2.7),Unicode 6.3.0 (Python 3.3+), Unicode 8.0.0 (Python 3.5+), Unicode 9.0.0(Python 3.6+), and Unicode 10.0.0 (Python 3.7+) with a variable-weightingsetting of Non-ignorable.

What do you use it for?

In short, sorting non-English strings properly.

The core of the algorithm involves multi-level comparison. For example,café comes beforecaff because at the primary level, the accent isignored and the first word is treated as if it werecafe. The secondarylevel (which considers accents) only applies then to words that are equivalentat the primary level.

The Unicode Collation Algorithm and pyuca also support contraction andexpansion.Contraction is where multiple letters are treated as a singleunit. In Spanish,ch is treated as a letter coming betweenc anddso that, for example, words beginningch should sort after all other wordsbeginnings withc.Expansion is where a single letter is treated asthough it were multiple letters. In German,ä is sorted as if it wereae, i.e. afterad but beforeaf.

How to use it

Here is how to use thepyuca module.

pip install pyuca

Usage example:

from pyuca import Collatorc = Collator()assert sorted(["cafe", "caff", "café"]) == ["cafe", "caff", "café"]assert sorted(["cafe", "caff", "café"], key=c.sort_key) == ["cafe", "café", "caff"]

Collator can also take an optional filename for specifying a customcollation element table.

You can also import collators for specific Unicode versions,e.g.from pyuca.collator import Collator_8_0_0.But justfrom pyuca import Collator will ensure that the collator versionmatches the version ofunicodata provided by the standard library for yourversion of Python.

How to cite it

Tauber, J. K. (2016). pyuca: a Python implementation of the Unicode Collation Algorithm. The Journal of Open Source Software. DOI: 10.21105/joss.00021

License

Python code is made available under an MIT license (seeLICENSE).allkeys.txt is made available under the similar license defined inLICENSE-allkeys.