Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Linguistic categories

From Wikipedia, the free encyclopedia
(Redirected fromISO 12620)
Ontology for descriptive linguistics

Linguistic categories include

The definition of linguistic categories is a major concern oflinguistic theory, and thus, the definition and naming of categories varies across different theoretical frameworks and grammatical traditions for different languages. Theoperationalization of linguistic categories inlexicography,computational linguistics,natural language processing,corpus linguistics, andterminology management typically requires resource-, problem- or application-specific definitions of linguistic categories. InCognitive linguistics it has been argued that linguistic categories have aprototype structure like that of the categories of common words in a language.[1]

Linguistic category inventories

[edit]

To facilitate theinteroperability betweenlexical resources,linguistic annotations and annotation tools and for the systematic handling of linguistic categories across different theoretical frameworks, a number of inventories of linguistic categories have been developed and are being used, with examples as given below. The practical objective of such inventories is to performquantitative evaluation (for language-specific inventories), to train NLP tools, or to facilitate cross-linguistic evaluation, querying or annotation of language data. At a theoretical level, the existence of universal categories in human language has been postulated, e.g., inUniversal grammar, but alsoheavily criticized.

Part-of-Speech tagsets

[edit]
Main article:Part-of-speech tagging § Tag sets

Schools commonly teach that there are 9parts of speech in English:noun,verb,article,adjective,preposition,pronoun,adverb,conjunction, andinterjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for theircase (role as subject, object, etc.),grammatical gender, and so on; while verbs are marked fortense,aspect, and other things. In some tagging systems, differentinflections of the same root word will get different parts of speech, resulting in a large number of tags. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see thePOS tags used in the Brown Corpus). Other tagging systems use a smaller number of tags and ignore fine differences or model them asfeatures somewhat independent from part-of-speech.[2]

In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such asGreek andLatin can be very large; taggingwords inagglutinative languages such asInuit languages may be virtually impossible. Work onstochastic methods for taggingKoine Greek (DeRose 1990) has used over 1,000 parts of speech and found that about as many words wereambiguous in that language as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such asncmsan forcategory = noun, type = common, gender = masculine, number = singular, case = accusative, animate = no.

The most popular tag set for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project.

Multilingual annotation schemes

[edit]
For Universal Dependencies corpora, seeUniversal Dependencies.

For Western European languages, cross-linguistically applicable annotation schemes for parts-of-speech, morphosyntax and syntax have been developed with theEAGLES Guidelines. The "Expert Advisory Group on Language Engineering Standards" (EAGLES) was an initiative of theEuropean Commission that ran within the DG XIIILinguistic Research and Engineering programme from 1994 to 1998, coordinated by Consorzio Pisa Ricerche, Pisa, Italy. The EAGLES guidelines provide guidance formarkup to be used withtext corpora, particularly for identifying features relevant incomputational linguistics andlexicography.Numerous companies, research centres, universities and professional bodies across the European Union collaborated to produce the EAGLES Guidelines, which set out recommendations forde facto standards and rules of best practice for:[3]

  • Large-scale language resources (such as text corpora, computationallexicons andspeech corpora);
  • Means of manipulating such knowledge, viacomputational linguistic formalisms, mark up languages and various software tools;
  • Means of assessing and evaluating resources, tools and products.

The Eagles guidelines have inspired subsequent work on other regions, as well, e.g., Eastern Europe.[4]

A generation later, a similar effort was initiated by the research community under the umbrella ofUniversal Dependencies. Petrov et al.[5][6] have proposed a "universal", but highly reductionist, tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition (hardly a "universal" coincidence), etc.). Subsequently, this was complemented with cross-lingual specifications for dependency syntax (Stanford Dependencies),[7] and morphosyntax (Interset interlingua,[8] partially building on the Multext-East/Eagles tradition) in the context of theUniversal Dependencies (UD), an international cooperative project to createtreebanks of the world's languages with cross-linguistically applicable ("universal") annotations for parts of speech, dependency syntax, and (optionally) morphosyntactic (morphological) features. Core applications are automatedtext processing in the field ofnatural language processing (NLP) and research into natural language syntax and grammar, especially withinlinguistic typology. The annotation scheme has it roots in three related projects: The UD annotation scheme uses a representation in the form ofdependency trees as opposed to aphrase structure trees. At as of February 2019, there are just over 100 treebanks of more than 70 languages available in the UD inventory.[9] The project's primary aim is to achieve cross-linguistic consistency of annotation. However, language-specific extensions are permitted for morphological features (individual languages or resources can introduce additional features). In a more restricted form, dependency relations can be extended with a secondary label that accompanies the UD label, e.g.,aux:pass for an auxiliary (UDaux) used to mark passive voice.[10]

The Universal Dependencies have inspired similar efforts for the areas of inflectional morphology,[11]frame semantics[12] andcoreference.[13] Forphrase structure syntax, a comparable effort does not seem to exist, but the specifications of thePenn Treebank have been applied to (and extended for) a broad range of languages,[14] e.g., Icelandic,[15] Old English,[16] Middle English,[17] Middle Low German,[18] Early Modern High German,[19] Yiddish,[20] Portuguese,[21] Japanese,[22] Arabic[23] and Chinese.[24]

Conventions for interlinear glosses

[edit]
Main articles:Interlinear gloss andList of glossing abbreviations

Inlinguistics, an interlinear gloss is agloss (series of brief explanations, such as definitions or pronunciations) placed between lines (inter- +linear), such as between a line of original text and itstranslation into anotherlanguage. When glossed, each line of the original text acquires one or more lines of transcription known as aninterlinear text orinterlinear glossed text (IGT)—interlinear for short. Such glosses help the reader follow the relationship between thesource text and its translation, and the structure of the original language. There is no standard inventory for glosses, but common labels are collected in the Leipzig Glossing Rules.[25] Wikipedia also provides aList of glossing abbreviations that draws on this and other sources.

General Ontology for Linguistic Description (GOLD)

[edit]

GOLD ("General Ontology for Linguistic Description") is anontology fordescriptive linguistics. It gives a formalized account of the most basic categories and relations used in the scientific description of human language, e.g., as a formalization of interlinear glosses. GOLD was first introduced by Farrar and Langendoen (2003).[26] Originally, it was envisioned as a solution to the problem of resolving disparate markup schemes for linguistic data, in particular data fromendangered languages. However, GOLD is much more general and can be applied to all languages. In this function, GOLD overlaps with theISO 12620 Data Category Registry (ISOcat); it is, however, more stringently structured.

GOLD was maintained by theLINGUIST List and others from 2007 to 2010.[27] TheRELISH project created a mirror of the 2010 edition of GOLD as a Data Category Selection within ISOcat. As of 2018, GOLD data remains an important terminology hub in the context of theLinguistic Linked Open Data cloud, but as it is not actively maintained anymore, its function is increasingly replaced byOLiA (for linguistic annotation, building on GOLD and ISOcat) andlexinfo.net (for dictionary metadata, building on ISOcat).

ISO 12620 (ISO TC37 Data Category Registry, ISOcat)

[edit]

ISO 12620 is astandard fromISO/TC 37 that defines aData Category Registry, a registry for registering linguistic terms used in various fields oftranslation,computational linguistics andnatural language processing and defining mappings both between different terms and between different systems in which the same terms are used.[28][29][30]

An earlier implementation of this standard, ISOcat, provides persistent identifiers andURIs for linguistic categories, including the inventory of the GOLD ontology (see below). The goal of the registry is that new systems can reuse existing terminology, or at least be easily mapped to existing terminology, to aidinteroperability.[31] The standard is used by other standards such asLexical Markup Framework (ISO 24613:2008), and a number of terminologies have been added to the registry, including the Eagles guidelines, theNational Corpus of Polish, and the TermBase eXchange format from theLocalization Industry Standards Association.

However, the 2019 edition, ISO 12620:2019,[32] no longer provides a registry of terms for language technology and is now restricted to terminology resources, hence the revised title "Management of terminology resources – Data category specifications". Accordingly, ISOcat is no longer actively developed.[33] As of May 2020[update], successor systems CLARIN Concept Registry[34] and DatCatInfo[35] were emerging.

For linguistic categories relevant tolexical resources, thelexinfo vocabulary represents an established community standard,[36] in particular in connection with theOntoLex vocabulary andmachine-readable dictionaries in the context ofLinguistic Linked Open Data technologies. Like the OntoLex vocabulary builds on theLexical Markup Framework (LMF), lexinfo builds on (the LMF section of) ISOcat.[37] Unlike ISOcat, however, lexinfo is actively maintained and currently (May 2020) extended in a community effort.[38]

Ontologies of Linguistic Annotation (OLiA)

[edit]

Similar in spirit to GOLD, the Ontologies of Linguistic Annotation (OLiA) provide a reference inventory of linguistic categories for syntactic, morphological and semantic phenomena relevant forlinguistic annotation andlinguistic corpora in the form of anontology. In addition, they also provide machine-readable annotation schemes for more than 100 languages, linked with the OLiA reference model.[39] The OLiA ontologies represent a major hub of annotation terminology in the(Linguistic)Linked Open Data cloud, with applications for search, retrieval and machine learning over heterogeneously annotated language resources.[37]

In addition to annotation schemes, the OLiA Reference Model is also linked with the Eagles Guidelines,[40] GOLD,[40] ISOcat,[41] CLARIN Concept Registry,[42] Universal Dependencies,[43] lexinfo,[43] etc., they thus enable interoperability between these vocabularies. OLiA is being developed as a community project on GitHub[44]

References

[edit]
  1. ^John R Taylor (1995)Linguistic Categorization: Prototypes in Linguistic Theory, 2nd ed., ch.2 p.21
  2. ^Universal POS tags
  3. ^The essentials of EAGLES
  4. ^Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H. J., & Tufis, D. (1998, August).Multext-east: Parallel and comparable corpora and lexicons for six central and eastern european languages. InProceedings of the 17th international conference on Computational linguistics-Volume 1 (pp. 315-319). Association for Computational Linguistics.
  5. ^Petrov, Slav; Das, Dipanjan; McDonald, Ryan (11 Apr 2011). "A Universal Part-of-Speech Tagset".arXiv:1104.2086 [cs.CL].
  6. ^Petrov, Slav (11 Apr 2011). "A Universal Part-of-Speech Tagset".arXiv:1104.2086 [cs.CL].
  7. ^"Stanford Dependencies".nlp.stanford.edu. The Stanford Natural Language Processing Group. Retrieved8 May 2020.
  8. ^"Interset".cuni.cz. Institute of Formal and Applied Linguistics (Czech Republic). Retrieved8 May 2020.
  9. ^"Universal Dependencies".universaldependencies.org. Retrieved2020-05-14.
  10. ^"aux:pass".universaldependencies.org. Retrieved2020-05-14.
  11. ^UniMorph."UniMorph: Universal Morphological Annotation".UniMorph. Retrieved2020-05-14.
  12. ^System-T/UniversalPropositions, System-T, 2020-05-14, retrieved2020-05-14
  13. ^Prange, J., Schneider, N., & Abend, O. (2019, August).Semantically Constrained Multilayer Annotation: The Case of Coreference. InProceedings of the First International Workshop on Designing Meaning Representations (pp. 164-176).
  14. ^"Penn Parsed Corpora of Historical English: Other Corpora".www.ling.upenn.edu. Retrieved2020-05-14.
  15. ^"Icelandic Parsed Historical Corpus (IcePaHC)".www.linguist.is. Retrieved2020-05-14.
  16. ^Warner, Anthony Department of Language and Linguistic Science University of York York; Taylor, Ann; Warner, Anthony; Pintzuk, Susan; Beths, Frank (September 2003)."The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)".{{cite journal}}:Cite journal requires|journal= (help)
  17. ^"Penn-Helsinki Parsed Corpus of Middle English 2".www.ling.upenn.edu. Retrieved2020-05-14.
  18. ^"Corpus of Historical Low German".www.chlg.ac.uk. Retrieved2020-05-14.
  19. ^Light, C., & Wallenberg, J. (2011).On the use of passives across Germanic. Presented at 13th Meeting of the Diachronic Generative Syntax (DIGS) Conference DIGS 13, University of Pennsylvania. June 5, 2011
  20. ^Beatrice Santorini (1993) [./Ftp://babel.ling.upenn.edu/papers/faculty/beatrice%20santorini/santorini-1993.pdf The rate of phrase structure change in the history of Yiddish].Language Variation and Change 5, 257-283.
  21. ^"Tycho Brahe Project".www.tycho.iel.unicamp.br. Retrieved2020-05-14.
  22. ^"NPCMJ – Ninjal Parsed Corpus of Modern Japanese". Retrieved2020-05-14.
  23. ^"Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) - Linguistic Data Consortium".catalog.ldc.upenn.edu. Retrieved2020-05-14.
  24. ^"Penn Chinese Treebank Project".verbs.colorado.edu. Retrieved2020-05-14.
  25. ^Comrie, B., Haspelmath, M., & Bickel, B. (2008).The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses.Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology & the Department of Linguistics of the University of Leipzig. Retrieved January,28, 2010.
  26. ^Scott Farrar and D. Terence Langendoen (2003) "A linguistic ontology for the Semantic Web." GLOT International. 7 (3), pp.97-100,[1].
  27. ^GOLD versions
  28. ^"ISO 12620:1999 – Computer applications in terminology – Data categories".iso.org. 2011. Retrieved9 November 2011.
  29. ^"ISO 12620:2009 – Terminology and other language and content resources – Specification of data categories and management of a Data Category Registry for language resources".iso.org. 2011. Retrieved9 November 2011.
  30. ^"ISO 12620:2019 Management of terminology resources – Data category specifications". ISO. Retrieved20 January 2020.
  31. ^Bononno, Robert (2011). "Terminology for Translators – an Implementation of ISO 12620".Meta.45 (4):646–669.CiteSeerX 10.1.1.136.4771.doi:10.7202/002101ar.
  32. ^"ISO 12620:2019 Management of terminology resources – Data category specifications". ISO. Retrieved20 January 2020.
  33. ^"The Data Category Repository (DCR) has changed address".www.iso.org. Retrieved2020-05-08.
  34. ^"CLARIN Concept Registry | CLARIN ERIC".www.clarin.eu. Retrieved2020-05-08.
  35. ^"DatCatInfo".www.datcatinfo.net. Retrieved2020-05-08.
  36. ^"LexInfo".www.lexinfo.net. Retrieved2020-05-14.
  37. ^abCimiano, P., Chiarcos, C., McCrae, J. P., & Gracia, J. (2020).Linguistic Linked Data (pp. 137–160). Springer, Cham.
  38. ^ontolex/lexinfo, OntoLex Community Group, 2020-03-07, retrieved2020-05-14
  39. ^"OLiA ontologies".purl.org/olia. Retrieved2020-05-14.
  40. ^abChiarcos, C. (2008).An ontology of linguistic annotations. InLDV Forum (Vol. 23, No. 1, pp. 1-16).
  41. ^Chiarcos, C. (2010, May).Grounding an ontology of linguistic annotations in the Data Category Registry. InLREC 2010 Workshop on Language Resource and Language Technology Standards (LT&LTS), Valetta, Malta (pp. 37-40).
  42. ^Rehm, G., Galanis, D., Labropoulou, P., Piperidis, S., Welß, M., Usbeck, R., et al (2020).Towards an Interoperable Ecosystem of AI and LT Platforms: A Roadmap for the Implementation of Different Levels of Interoperability.arXiv preprintarXiv:2004.08355.
  43. ^abChristian Chiarcos, Maxim Ionov and Christian Fäth (2020), Annotation interoperability in the post-ISOcat era, LREC 2020
  44. ^acoli-repo/olia, ACoLi, 2020-03-10, retrieved2020-05-14

External links

[edit]
1–9999
10000–19999
20000–29999
30000+
Retrieved from "https://en.wikipedia.org/w/index.php?title=Linguistic_categories&oldid=1276295216#ISO_12620_(ISO_TC37_Data_Category_Registry,_ISOcat)"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2025 Movatter.jp