
Normality is a Python micro-package that contains a small set of textnormalization functions for easier re-use. These functions accept asnippet of unicode or utf-8 encoded text and remove various classesof characters, such as diacritics, punctuation etc. This is useful asa preparation to further text analysis.
WARNING: This library works much better when used in combinationwithpyicu
, a Python binding for the International Components forUnicode C library. ICU provides much better text transliteration thanthe defaulttext-unidecode
.
# coding: utf-8fromnormalityimportnormalize,slugify,collapse_spacestext=normalize('Nie wieder "Grüne Süppchen" kochen!')asserttext=='nie wieder grune suppchen kochen'slug=slugify('My first blog post!')assertslug=='my-first-blog-post'text='this\n\n\r\nhas\tlots of\nodd spacing.'assertcollapse_spaces(text)=='this has lots of odd spacing.'
normality
is open source, licensed under a standard MIT license(included in this repository asLICENSE
).