Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Truecasing

From Wikipedia, the free encyclopedia
icon
This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Truecasing" – news ·newspapers ·books ·scholar ·JSTOR
(October 2010) (Learn how and when to remove this message)

Truecasing, also calledcapitalization recovery,[1]capitalization correction,[2] orcase restoration,[3] is the problem innatural language processing (NLP) of determining the propercapitalization of words where such information is unavailable. This commonly comes up due to the standard practice (inEnglish and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercasetext messages).

Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in theLatin,Greek,Cyrillic orArmenian alphabets, such asKorean,Japanese,Chinese,Thai,Hebrew,Arabic,Hindi, andGeorgian.

Techniques

[edit]
  • Neural networks that operate at the word level or the character level have been trained to recover capitalization with greater than 90% accuracy.
  • Sentence segmentation can be used to determine where sentences begin, to implement the rule that the first word of every sentence must be capitalized.
  • Part-of-speech tagging can be used to identifyproper nouns (such as Africa, Jupiter, Sarah, or Amazon), which must be capitalized. In some cases, the same word can be used as different parts of speech, and is capitalized differently. For example, Xerox the company, as a noun, is capitalized, but to xerox a document, as a verb, is not capitalized. A xerox, as in the copy of a document, can be recognized by the presence of adeterminer, which is not used for proper nouns.
  • Named entity recognition can be used to identify proper nouns, which must be capitalized.
  • Aspell checker can be used to identify words that are always capitalized.

Applications

[edit]

Truecasing aids in other NLP tasks, such asnamed entity recognition (NER),automatic content extraction (ACE), andmachine translation.[4]Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems usestatistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.

See also

[edit]

References

[edit]
  1. ^Brown, Eric W.; Coden, Anni R. (2002). "Capitalization Recovery for Text".Information Retrieval Techniques for Speech Applications. Lecture Notes in Computer Science. Vol. 2273. pp. 11–22.doi:10.1007/3-540-45637-6_2.ISBN 978-3-540-43156-5.
  2. ^US patent 7,827,025 B2, Peter K. L. Mau & Dong Yu, "Efficient capitalization through user modeling", issued 2010-11-02, assigned to Microsoft Corporation 
  3. ^US patent 8,972,855 B2, Zhu Liu; David Gibbon & Behzad Shahraray, "Method and apparatus for providing case restoration", issued 2015-03-03, assigned to AT&T Intellectual Property I, L.P. 
  4. ^Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003)."tRuEcasIng".Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.
General terms
Text analysis
Text segmentation
Automatic summarization
Machine translation
Distributional semantics models
Language resources,
datasets and corpora
Types and
standards
Data
Automatic identification
and data capture
Topic model
Computer-assisted
reviewing
Natural language
user interface
Related
Retrieved from "https://en.wikipedia.org/w/index.php?title=Truecasing&oldid=1208685697"
Category:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp