This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Truecasing" – news ·newspapers ·books ·scholar ·JSTOR(October 2010) (Learn how and when to remove this message) |
Truecasing, also calledcapitalization recovery,[1]capitalization correction,[2] orcase restoration,[3] is the problem innatural language processing (NLP) of determining the propercapitalization of words where such information is unavailable. This commonly comes up due to the standard practice (inEnglish and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercasetext messages).
Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in theLatin,Greek,Cyrillic orArmenian alphabets, such asKorean,Japanese,Chinese,Thai,Hebrew,Arabic,Hindi, andGeorgian.
Truecasing aids in other NLP tasks, such asnamed entity recognition (NER),automatic content extraction (ACE), andmachine translation.[4]Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems usestatistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.