Movatterモバイル変換

Language resource

From Wikipedia, the free encyclopedia

Linguistic material used for various types of language research and processing

Inlinguistics andlanguage technology, alanguage resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications."^[1]

According to Bird & Simons (2003),^[2] this includes

data, i.e. "any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar",^[2]
tools, i.e., "computational resources that facilitate creating, viewing, querying, or otherwise using language data",^[2] and
advice, i.e., "any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data". The latter aspect is usually referred to as "best practices" or "(community) standards".^[2]

In a narrower sense, language resource is specifically applied to resources that are available indigital form, and then, "encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management".^[1]

Typology

[edit]

As of May 2020, no widely used standard typology of language resources has been established (current proposals include theLREMap,^[3] METASHARE,^[4] and, for data, theLLOD classification). Important classes of language resources include

data
1. lexical resources, e.g.,machine-readable dictionaries,
2. linguistic corpora, i.e., digital collections of natural language data,
3. linguistic data bases such as theCross-Linguistic Linked Data collection,
tools
1. linguistic annotations and tools for creating such annotations in a manual or semiautomated fashion (e.g., tools for annotatinginterlinear glossed text such asToolbox andFLEx, or otherlanguage documentation tools),
2. applications for search and retrieval over such data (corpus management systems), for automated annotation (part-of-speech tagging, syntacticparsing,semantic parsing, etc.),
metadata and vocabularies
1. vocabularies, repositories oflinguistic terminology and language metadata, e.g., MetaShare (for language resource metadata),^[4] theISO 12620 data category registry (for linguistic features, data structures and annotations within a language resource),^[5] or theGlottolog database (identifiers for language varieties and bibliographical database).^[6]

Language resource publication, dissemination and creation

[edit]

A major concern of the language resource community has been to develop infrastructures and platforms to present, discuss and disseminate language resources. Selected contributions in this regard include:

a series ofInternational Conferences on Language Resources and Evaluation (LREC),
the European Language Resources Association (ELRA, EU-based), and theLinguistic Data Consortium (LDC, US-based), which represent commercial hosting and dissemination platforms for language resources,
theOpen Languages Archives Community (OLAC), which provides and aggregates language resource metadata,
theLanguage Resources and Evaluation Journal (LREJ),^[7]
theEuropean Language Grid is a European platform for language technologies (eg services), data and resources.

As for the development of standards and best practices for language resources, these are subject of several community groups and standardization efforts, including

ISO Technical Committee 37: Terminology and other language and content resources (ISO/TC 37), developing standards for all aspects of language resources,
W3C Community GroupBest Practices for Multilingual Linked Open Data (BPMLOD),^[8] working on best practice recommendations for publishing language resources asLinked Data or inRDF,
W3C Community GroupLinked Data for Language Technology (LD4LT),^[9] working on linguistic annotations on the web and language resource metadata,
W3C Community GroupOntology-Lexica (OntoLex),^[10] working on lexical resources,
the Open Linguistics working group of theOpen Knowledge Foundation, working on conventions for publishing and linkingopen language resources, developing theLinguistic Linked Open Data cloud,^[11]
theText Encoding Initiative (TEI),^[12] working onXML-based specifications for language resources and digitally edited text.

References

[edit]

^^a ^bLD4LT (2020),The Metashare Ontology as Created by the LD4LT Community Group, W3C Community Group Linked Data for Language Technology (LD4LT), Development branch, version of Mar 10, 2020
^^a ^b ^c ^dBird, Steven; Simons, Gary (2003-11-01). "Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources".Computers and the Humanities.37 (4):375–388.arXiv:cs/0308022.Bibcode:2003cs........8022B.doi:10.1023/A:1025720518994.ISSN 1572-8412.S2CID 5969663.
^Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., & Soria, C. (2012, May).The LRE Map. Harmonising Community Descriptions of Resources. InLREC (pp. 1084-1089).
^^a ^bMcCrae, John P.; Labropoulou, Penny; Gracia, Jorge; Villegas, Marta; Rodríguez-Doncel, Víctor; Cimiano, Philipp (2015). "One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web". In Gandon, Fabien; Guéret, Christophe; Villata, Serena; Breslin, John; Faron-Zucker, Catherine; Zimmermann, Antoine (eds.).The Semantic Web: ESWC 2015 Satellite Events. Lecture Notes in Computer Science. Vol. 9341. Cham: Springer International Publishing. pp. 271–282.doi:10.1007/978-3-319-25639-9_42.ISBN 978-3-319-25639-9.
^Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2008). ISOcat:Corralling data categories in the wild. In6th International Conference on Language Resources and Evaluation (LREC 2008).
^Nordhoff, Sebastian (2012), Chiarcos, Christian; Nordhoff, Sebastian; Hellmann, Sebastian (eds.), "Linked Data for Linguistic Diversity Research: Glottolog/Langdoc and ASJP Online",Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata, Springer, pp. 191–200,doi:10.1007/978-3-642-28249-2_18,ISBN 978-3-642-28249-2
^"Language Resources and Evaluation".Springer. Retrieved2020-05-13.
^"Best Practices for Multilingual Linked Open Data Community Group".www.w3.org. 2 October 2015. Retrieved2020-05-13.
^"Linked Data for Language Technology Community Group".www.w3.org. 26 June 2015. Retrieved2020-05-13.
^"Ontology-Lexica Community Group".www.w3.org. 10 May 2016. Retrieved2020-05-13.
^"Linguistic Linked Open Data".
^"TEI: Text Encoding Initiative".tei-c.org. Retrieved2020-05-13.

Natural language processing

General terms

Text analysis

Text segmentation	Compound-term processing Lemmatisation Lexical analysis Text chunking Stemming Sentence segmentation Word segmentation

Automatic summarization

Machine translation

Distributional semantics models

Language resources,
datasets and corpora

Types and standards	Corpus linguistics Lexical resource Linguistic Linked Open Data Machine-readable dictionary Parallel text PropBank Semantic network Simple Knowledge Organization System Speech corpus Text corpus Thesaurus (information retrieval) Treebank Universal Dependencies
Data	BabelNet Bank of English DBpedia FrameNet Google Ngram Viewer UBY WordNet Wikidata

Automatic identification
and data capture

Topic model

Computer-assisted
reviewing

Natural language
user interface

Retrieved from "https://en.wikipedia.org/w/index.php?title=Language_resource&oldid=1279405066"

Categories:

Hidden categories:

[8]ページ先頭