cldf-datasets/lapollaqiangPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Data and Source Code

License

CC-BY-4.0 license

0 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
app		app
cldf		cldf
commands		commands
etc		etc
output		output
raw		raw
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cldfbench_lapollaqiang.py		cldfbench_lapollaqiang.py
metadata.json		metadata.json
qiang-wordlist.tsv		qiang-wordlist.tsv
setup.cfg		setup.cfg
setup.py		setup.py
test.py		test.py

Repository files navigation

geometry

top=30mm

left=20mm

Source code and data accompanying the study «Towards a sustainable handling of inter-linear-glossed text in language documentation»

This repository is intended to run the users through the workflow described in the paper.

1 Preliminaries

We assume that users are familiar with the commandline on their respective system, that they have Python in a version equal or higher to 3.5 installed, and that they also have the GIT version control software on their machine.

2 Getting started

Python packages

In order to run through the workflow, some python packages must be installed. This is best donein avirtual environment,in order to keep your system's python installation unaffected. Thus, in an activated virtual environment,withìgt-paper/ as working directroy, run

$ pip install -e.

This will install the packages listed insetup.py, under"install_requires".

CLDF Catalogs

The workflow decribed in this paper requires access to several catalogs:

Glottolog - to lookup language metadata,
Concepticon - to lookup information about lexical concepts - and
CLTS - for information about transcription systems.

We will "install" these using thecldfbench command, installed with thecldfbench package(see theREADME):

cldfbench catconfig

Since Glottolog requires downloading about 500MB of data, this may take some time. It also requires about 1.2 GB offree disk space:

$ du -sh .config/cldf/*4,0K.config/cldf/_catalog.ini4,0K.config/cldf/catalog.ini43M.config/cldf/clts122M.config/cldf/concepticon994M.config/cldf/glottolog

3 Converting the "raw" data to a CLDF dataset

The Quiang corpus used in this paper comes as simple, line-based text file:Qiang.txt.First attempts at parsing this file led to detection of errors in the original digitized version.These were corrected in the fileQiang-2.txt. (Qiang.txt is kept in this repositoryin case users are interested to inspect the actual errors).

Conversion to CLDF is implemented within thecldfbench framework,i.e. by providing somePython code, which is invoked via

$ cldfbench makecldf cldfbench_lapollaqiang.py

This will (re-)create the CLDF dataset in thecldf/ directory. The remainder of the workflow willusecldf/examples.csv as its main input.

4 Running the five-stage workflow

The workflow described in the paper is implemented asdataset specific commandto be run withcldfbench. The code is available incommands/workflow.py.

In the following sections we will follow through this code in an interactive Python session.

4.1 Getting Started

You can inspect the IGTs in the dataset using theigt command, installed withpyigt, to get somesummary statistics:

$ igt stats cldf/cldf-metadata.json            count--------  -------example      1276word         3954morpheme     8256Example properties:  ID  Language_ID  Primary_Text  Analyzed_Word  Gloss  Translated_Text  Meta_Language_ID  Comment  Text_ID  Sentence_Number  Phrase_Number

TheText_ID andGloss properties listed above can be used to filter IGTs for display:

$ igt ls cldf/cldf-metadata.json Text_ID=1 Gloss=CSMExample 1:zəple: ȵike: peji qeʴlotʂuʁɑ,zəp-le:       ȵi-ke:       pe-ji       qeʴlotʂu-ʁɑ,earth-DEF:CL  WH-INDEF:CL  become-CSM  in.the.past-LOCExample 22:tɕetɕilɑwu mufů təlɑji,tɕetɕi-lɑ-wu        mufů    tə-lɑ-ji,everywhere-LOC-ABL  smoke   DIR-come-CSMExample 24:mi luji.mi      lu-ji.people  come-CSM

4.2 Checking glosses

For further inspection, we load the data in an interactive Python session:

>>>frompyigtimportCorpus>>>fromcldfbench_lapollaqiangimportDataset>>>texts=Corpus.from_cldf(Dataset().cldf_reader())

In order to check the glosses (as essential part of our worfklow steps 1 and 2), we run

>>>texts.check_glosses()

The output distinguishes errors bylevels. An error in the first level means thatphrase and gloss are not well-aligned, i.e.a phrase has more or less elements than its corresponding glosses. A second-level error refers tomis-alignments of morphemes in a word and corresponding gloss.

Our check yields 13 level 2 errors, where number of morphemes differs from the number ofmorphemes glossed:

[63:5 : second level 1]['qu', 'kəpə', 'kəi', 'ʂ,']['ɦe', 'afraid', 'HABIT', 'NAR', 'LNK']---[318:2 : second level 2]['hɑ', 'lə', 'jə', 'kui', 'tu,']['DIR', 'come', 'REP', 'LNK']---[463:3 : second level 3]['satʂů', 'le:', 'tʂi', 'le:', 'wu']['younger', 'sister', 'DEF:CL', 'son', 'DEF:CL', 'AGT']---[643:1 : second level 4]['ɦɑ', 'kə']['that.manner']---[678:1 : second level 5]['ɦɑ', 'tsəi', 'ŋuəȵi,']['this.manner', 'TOP']---[745:3 : second level 6]['ɑ', 'χtʂ']['one.row']---[840:1 : second level 7]['he', 'ɕi', 'kui']['DIR', 'send']---[843:2 : second level 8]['qɑpə', 'tɕ']['old.man']---[860:2 : second level 9]['du', 'ɸu', 'ȵi']['run.away', 'ADV']---[886:3 : second level 10]['ə', 'lɑ', 'kəi', 'tu,']['DIR', 'come', 'LNK']---[928:2 : second level 11]['ɕtɕə', 'p']['seven.years']---[984:1 : second level 12]['ɦɑ', 'kə']['that.manner']---[1255:7 : second level 13]['tɕɑu', 'ʐbə', 'kə', 'ȵi,']['think.to.oneself', 'INF', 'ADV']---

Creation of concordances

ACorpus object computes three basic types of concordance upon loading:

basic concordances that list each morpheme along with its gloss (form),
grammatical concordances which list only those items deemed to be grammatical (grammar), and
lexical concordances which are built from items supposed to be only lexical (lexicon).

Note: Erroneous forms as identified in the previous steps are ignored.

We can write the concordances to files as follows:

>>>texts.write_concordance('forms',filename='output/form-concordance.tsv')>>>texts.write_concordance('lexicon',filename='output/lexical-concordance.tsv')>>>texts.write_concordance('grammar',filename='output/grammatical-concordance.tsv')

The concordances created above keep a full trace to each word in each original phrase.They can be used in further steps to normalize the data or to link it to reference catalogs.

Mapping lexical and grammatical concepts to reference catalogs

We can use the concordances to create concept lists, both for grammatical and for lexical entries:

>>>texts.write_concepts('lexicon',filename='output/automated-concepts.tsv')>>>texts.write_concepts('grammar',filename='output/automated-glosses.tsv')

While there is no Grammaticon to which we could link our grammatical concepts, we can add the fullnames for each grammatical concept from the resource. This has do be done manually, but it does nottake much time, and it has also revealed that the abbreviation list in the resource lacks adescription for the abbreviation REDUP. The list of grammatical concepts is provided asetc/glosses.tsv.

For the list of lexical concepts, we can use the Concepticon resource to automatically map ourautomatically created concept list to the data provided by the concepticon project.This can be done using theconcepticon command, installed with thepyconcepticon package(you may have to look up the path to the Concepticon repository clone viacldfbench catinfo):

$ concepticon --repos=PATH/TO/CLONE/OF/concepticon-data map_concepts output/automated-concepts.tsv

This will yield a longer list as output that needs to be written to a file in order to edit it.

$ concepticon --repos=PATH/TO/CLONE/OF/concepticon-data map_concepts output/automated-concepts.tsv > etc/concepts-mapped.tsv

As we can see from the output, as many as 421 concepts can be automatically linked, which accounts for about 72% of the list.After manually revising this list, (seeetc/concepts.tsv) the number of linked items drops a bit, but it contains still a considerable amount of glosses that are describe well enough to link them to the Concepticon project and make them thus available for different studies.

Standardizing transcriptions

The transcriptions in the original resource are not necessarily standardized. We can use thepyigt libraryagain to make a firstorthography profile which we then can use to further standardize the data.(Again, you may have to look up the path to the CLTS repository clone viacldfbench catinfo.)

>>>frompycltsimportCLTS>>>texts.get_profile(filename='output/automated-orthograpy.tsv',clts=CLTS('PATH/TO/CLONE/OF/clts'))

This will create an initial orthography profile that can be further refined by the users. Our refined version can be found inetc/orthography.tsv.

Identifying language-internal cognates

Once created and manually corrected, we can use our improved transcriptions to search forlanguage-internal cognates. We do this, by envoking the following command, which will segment thetranscriptions, iterate over all data, compare words which have the same grammatical and lexicalgloss, and place them in the same cognate set, labelled asCROSSID, if their similarity is belowa certain threshold.

>>>wl=texts.get_wordlist(doculect='Qiang',profile='etc/orthography.tsv')>>>wl.output(...'tsv',...filename='qiang-wordlist',...prettify=False,...ignore='all',...subset=True,...cols=[hforhinwl.columns])

The resulting wordlistqiang-wordlist.tsv can be conveniently inspected with help of the EDICTOR tool.

Creating the concordance browser application

The concordance browser is created from the assembled data, specifically the wordlist. It consists of a simple HTMLfrontend, which includes some JavaScript code implementing the viewer's functionality, and the data, againloaded from a JavaScript file, created via

>>>texts.write_app(dest='app')

To open the app, just open the local fileapp/index.html in your browser.

About

Data and Source Code

Languages

JavaScript99.0%
Other1.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Source code and data accompanying the study «Towards a sustainable handling of inter-linear-glossed text in language documentation»

1 Preliminaries

2 Getting started

Python packages

CLDF Catalogs

3 Converting the "raw" data to a CLDF dataset

4 Running the five-stage workflow

4.1 Getting Started

4.2 Checking glosses

Creation of concordances

Mapping lexical and grammatical concepts to reference catalogs

Standardizing transcriptions

Identifying language-internal cognates

Creating the concordance browser application

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

cldf-datasets/lapollaqiang

Folders and files

Latest commit

History

Repository files navigation

Source code and data accompanying the study «Towards a sustainable handling of inter-linear-glossed text in language documentation»

1 Preliminaries

2 Getting started

Python packages

CLDF Catalogs

3 Converting the "raw" data to a CLDF dataset

4 Running the five-stage workflow

4.1 Getting Started

4.2 Checking glosses

Creation of concordances

Mapping lexical and grammatical concepts to reference catalogs

Standardizing transcriptions

Identifying language-internal cognates

Creating the concordance browser application

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Contributors2

Uh oh!

Languages

Packages