- Notifications
You must be signed in to change notification settings - Fork0
cldf-datasets/lapollaqiang
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
| geometry | ||
|---|---|---|
|
Source code and data accompanying the study «Towards a sustainable handling of inter-linear-glossed text in language documentation»
This repository is intended to run the users through the workflow described in the paper.
We assume that users are familiar with the commandline on their respective system, that they have Python in a version equal or higher to 3.5 installed, and that they also have the GIT version control software on their machine.
In order to run through the workflow, some python packages must be installed. This is best donein avirtual environment,in order to keep your system's python installation unaffected. Thus, in an activated virtual environment,withìgt-paper/ as working directroy, run
$ pip install -e.This will install the packages listed insetup.py, under"install_requires".
The workflow decribed in this paper requires access to several catalogs:
- Glottolog - to lookup language metadata,
- Concepticon - to lookup information about lexical concepts - and
- CLTS - for information about transcription systems.
We will "install" these using thecldfbench command, installed with thecldfbench package(see theREADME):
cldfbench catconfig
Since Glottolog requires downloading about 500MB of data, this may take some time. It also requires about 1.2 GB offree disk space:
$ du -sh .config/cldf/*4,0K.config/cldf/_catalog.ini4,0K.config/cldf/catalog.ini43M.config/cldf/clts122M.config/cldf/concepticon994M.config/cldf/glottologThe Quiang corpus used in this paper comes as simple, line-based text file:Qiang.txt.First attempts at parsing this file led to detection of errors in the original digitized version.These were corrected in the fileQiang-2.txt. (Qiang.txt is kept in this repositoryin case users are interested to inspect the actual errors).
Conversion to CLDF is implemented within thecldfbench framework,i.e. by providing somePython code, which is invoked via
$ cldfbench makecldf cldfbench_lapollaqiang.py
This will (re-)create the CLDF dataset in thecldf/ directory. The remainder of the workflow willusecldf/examples.csv as its main input.
The workflow described in the paper is implemented asdataset specific commandto be run withcldfbench. The code is available incommands/workflow.py.
In the following sections we will follow through this code in an interactive Python session.
You can inspect the IGTs in the dataset using theigt command, installed withpyigt, to get somesummary statistics:
$ igt stats cldf/cldf-metadata.json count-------- -------example 1276word 3954morpheme 8256Example properties: ID Language_ID Primary_Text Analyzed_Word Gloss Translated_Text Meta_Language_ID Comment Text_ID Sentence_Number Phrase_Number
TheText_ID andGloss properties listed above can be used to filter IGTs for display:
$ igt ls cldf/cldf-metadata.json Text_ID=1 Gloss=CSMExample 1:zəple: ȵike: peji qeʴlotʂuʁɑ,zəp-le: ȵi-ke: pe-ji qeʴlotʂu-ʁɑ,earth-DEF:CL WH-INDEF:CL become-CSM in.the.past-LOCExample 22:tɕetɕilɑwu mufů təlɑji,tɕetɕi-lɑ-wu mufů tə-lɑ-ji,everywhere-LOC-ABL smoke DIR-come-CSMExample 24:mi luji.mi lu-ji.people come-CSM
For further inspection, we load the data in an interactive Python session:
>>>frompyigtimportCorpus>>>fromcldfbench_lapollaqiangimportDataset>>>texts=Corpus.from_cldf(Dataset().cldf_reader())
In order to check the glosses (as essential part of our worfklow steps 1 and 2), we run
>>>texts.check_glosses()
The output distinguishes errors bylevels. An error in the first level means thatphrase and gloss are not well-aligned, i.e.a phrase has more or less elements than its corresponding glosses. A second-level error refers tomis-alignments of morphemes in a word and corresponding gloss.
Our check yields 13 level 2 errors, where number of morphemes differs from the number ofmorphemes glossed:
[63:5 : second level 1]['qu', 'kəpə', 'kəi', 'ʂ,']['ɦe', 'afraid', 'HABIT', 'NAR', 'LNK']---[318:2 : second level 2]['hɑ', 'lə', 'jə', 'kui', 'tu,']['DIR', 'come', 'REP', 'LNK']---[463:3 : second level 3]['satʂů', 'le:', 'tʂi', 'le:', 'wu']['younger', 'sister', 'DEF:CL', 'son', 'DEF:CL', 'AGT']---[643:1 : second level 4]['ɦɑ', 'kə']['that.manner']---[678:1 : second level 5]['ɦɑ', 'tsəi', 'ŋuəȵi,']['this.manner', 'TOP']---[745:3 : second level 6]['ɑ', 'χtʂ']['one.row']---[840:1 : second level 7]['he', 'ɕi', 'kui']['DIR', 'send']---[843:2 : second level 8]['qɑpə', 'tɕ']['old.man']---[860:2 : second level 9]['du', 'ɸu', 'ȵi']['run.away', 'ADV']---[886:3 : second level 10]['ə', 'lɑ', 'kəi', 'tu,']['DIR', 'come', 'LNK']---[928:2 : second level 11]['ɕtɕə', 'p']['seven.years']---[984:1 : second level 12]['ɦɑ', 'kə']['that.manner']---[1255:7 : second level 13]['tɕɑu', 'ʐbə', 'kə', 'ȵi,']['think.to.oneself', 'INF', 'ADV']---ACorpus object computes three basic types of concordance upon loading:
- basic concordances that list each morpheme along with its gloss (
form), - grammatical concordances which list only those items deemed to be grammatical (
grammar), and - lexical concordances which are built from items supposed to be only lexical (
lexicon).
Note: Erroneous forms as identified in the previous steps are ignored.
We can write the concordances to files as follows:
>>>texts.write_concordance('forms',filename='output/form-concordance.tsv')>>>texts.write_concordance('lexicon',filename='output/lexical-concordance.tsv')>>>texts.write_concordance('grammar',filename='output/grammatical-concordance.tsv')
The concordances created above keep a full trace to each word in each original phrase.They can be used in further steps to normalize the data or to link it to reference catalogs.
We can use the concordances to create concept lists, both for grammatical and for lexical entries:
>>>texts.write_concepts('lexicon',filename='output/automated-concepts.tsv')>>>texts.write_concepts('grammar',filename='output/automated-glosses.tsv')
While there is no Grammaticon to which we could link our grammatical concepts, we can add the fullnames for each grammatical concept from the resource. This has do be done manually, but it does nottake much time, and it has also revealed that the abbreviation list in the resource lacks adescription for the abbreviation REDUP. The list of grammatical concepts is provided asetc/glosses.tsv.
For the list of lexical concepts, we can use the Concepticon resource to automatically map ourautomatically created concept list to the data provided by the concepticon project.This can be done using theconcepticon command, installed with thepyconcepticon package(you may have to look up the path to the Concepticon repository clone viacldfbench catinfo):
$ concepticon --repos=PATH/TO/CLONE/OF/concepticon-data map_concepts output/automated-concepts.tsv
This will yield a longer list as output that needs to be written to a file in order to edit it.
$ concepticon --repos=PATH/TO/CLONE/OF/concepticon-data map_concepts output/automated-concepts.tsv > etc/concepts-mapped.tsvAs we can see from the output, as many as 421 concepts can be automatically linked, which accounts for about 72% of the list.After manually revising this list, (seeetc/concepts.tsv) the number of linked items drops a bit, but it contains still a considerable amount of glosses that are describe well enough to link them to the Concepticon project and make them thus available for different studies.
The transcriptions in the original resource are not necessarily standardized. We can use thepyigt libraryagain to make a firstorthography profile which we then can use to further standardize the data.(Again, you may have to look up the path to the CLTS repository clone viacldfbench catinfo.)
>>>frompycltsimportCLTS>>>texts.get_profile(filename='output/automated-orthograpy.tsv',clts=CLTS('PATH/TO/CLONE/OF/clts'))
This will create an initial orthography profile that can be further refined by the users. Our refined version can be found inetc/orthography.tsv.
Once created and manually corrected, we can use our improved transcriptions to search forlanguage-internal cognates. We do this, by envoking the following command, which will segment thetranscriptions, iterate over all data, compare words which have the same grammatical and lexicalgloss, and place them in the same cognate set, labelled asCROSSID, if their similarity is belowa certain threshold.
>>>wl=texts.get_wordlist(doculect='Qiang',profile='etc/orthography.tsv')>>>wl.output(...'tsv',...filename='qiang-wordlist',...prettify=False,...ignore='all',...subset=True,...cols=[hforhinwl.columns])
The resulting wordlistqiang-wordlist.tsv can be conveniently inspected with help of the EDICTOR tool.
The concordance browser is created from the assembled data, specifically the wordlist. It consists of a simple HTMLfrontend, which includes some JavaScript code implementing the viewer's functionality, and the data, againloaded from a JavaScript file, created via
>>>texts.write_app(dest='app')
To open the app, just open the local fileapp/index.html in your browser.
About
Data and Source Code
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Contributors2
Uh oh!
There was an error while loading.Please reload this page.