- Notifications
You must be signed in to change notification settings - Fork1
A program for calculating corpora alignments using a pivot language
License
czcorpus/ictools
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a faster, less memory-consuming, integrated replacement for legacycalign.py,compressrng.py,fixgaps.py,transalign.py scripts used to prepare corpora alignmentnumeric data from lists of structural attribute values mapping between languages. It also fixessome problems with missing ranges for unaligned structures you can encounter when using the scripts above.In addition, it also provides anexport
function for performing reversed operations.
Note: you still needmkalign tool distributed along withManatee-open to enable corpora alignmentsinKonText (or NoSkE).
Ictools provide three operations - import, transalign and export:
Import operation transforms an alignment XML file containing aligned string sentence IDs to a numeric form.It is able to handle non-existing alignments, gaps between ranges (including the last row range where structuresize is always used to make sure the whole range is filled in).
In terms of the input format, a list of<link>
elements is expected:
<linktype='0-3'xtargets=';cs:Adams-Holisticka_det_k:0:7:1 cs:Adams-Holisticka_det_k:0:7:2 cs:Adams-Holisticka_det_k:0:7:3'status='man'/>
Please note that the parser does not care about XML validity (e.g. there is no need for a root element or evena proper nesting of elements).
In some cases you may want totweak line buffer size (value is in bytes; by defaultbufio.MaxScanTokenSize = 64 * 1024 is used which may fail in case of some complex alignments and/or long text identifiers). In case the buffer is toosmall, ictools will end with fatal log event returning a non-zero value to shell.
ictools -line-buffer 250000 -registry-path /var/local/corpora/registry import ....etc...
Example:
Let's say we have two files with mappings between Polish and Czech (intercorp_pl2cs) and betweenEnglish and Czech (intercorp.en2cs) where Czech is a pivot.
ictools -registry-path /var/local/corpora/registry import intercorp_v10_pl intercorp_v10_cs s.id /var/local/corpora/aligndef/intercorp_pl2cs > intercorp.pl2csictools -registry-path /var/local/corpora/registry import intercorp_v10_en intercorp_v10_cs s.id /var/local/corpora/aligndef/intercorp_en2cs > intercorp.en2cs
Transalign operation takes two numeric alignments against a common pivot language and generatesa new alignment between the two non-pivot languages.
Example:
ictools transalign ./intercorp.pl2cs ./intercorp.en2cs > intercorp.pl2en
Theexport
operation is able to reconstruct the XML-ish source used as an inputfor theimport
operation using numeric alignment files as produced byimport -> transalign
operations. Any grouped intervals are split back to the originaltext groups.
Example:
ictools -export-type intercorp export /corpora/registry/intercorp_v12_cs /corpora/registry/intercorp_v12_en s.id /corpora/aligndef/intercorp.cs2en > orig.xml
ICTools come withmanabuild as its dependency. So in case you have~/go/bin
in your$PATH
, everything needed to buildictools
is:
manabuild
In case Manabuild finds Manatee-open in a non-standard location where system does not look for libraries,it producesictools.bin
with actual ICTools binary andictools
which is a short Bash scriptto setLD_LIBRARY_PATH
to the path Manabuild found Manatee in and to start the binary. So in this case,two files must be moved (or copied) to a target installation location (e.g./usr/local/bin
).s
Used data files:
- intercorp_pl2cs (size 1.4GB)
- intercorp_pl2en (size 1.5GB)
Used hardware:
- A (a server)
- CPU: Intel Xeon E5-2640 v3 @ 2.60GHz
- 64GB RAM
- B (a common Dell desktop)
- CPU: Intel Core) i5-2400 @ 3.10GHz
- 8GB RAM
Setup | Used program | calign+fixgaps+compress [sec] | transalign [sec] | total [sec] |
---|---|---|---|---|
A | classic scripts | 255 | 191 | 446 |
A | ictools | 164 | 55 | 219 |
B | classic scripts | 312 | DNF (RAM) | DNF |
B | ictools | 175 | 63 | 238 |
Ictools are approximatelytwice as fast as the original Python scripts.
In terms ofmemory usage, there were no thorough measurements performed but according to thetoputility thetransalign function inictools consumes about30-40% of of the memory consumedby the classic scripts. The import function (i.e. calign+fixgaps+compress) in both programsconsumes only a little RAM because data read from an input file are (almost) immediately writtento the output without any unnecessary memory allocation.
Run
manabuild -no-build
and copyCGO_CPPFLAGS=...
,CGO_CPPFLAGS=...
andCGO_CXXFLAGS=...
.
Opendebug environment (left column) and click the "gear" button to editlaunch.json. Thenset proper environment variables (just like in the previous paragraph).
{"version":"0.2.0","configurations": [ {" .... parts are omitted here ... " :" ...","env": {"CGO_LDFLAGS":"...","CGO_CPPFLAGS":"...","CGO_CXXFLAGS":"..." }," .... parts are omitted here ... " :" ...", } ]}
Where the env. variables part is the one copied in the previous step.
manabuild -test
About
A program for calculating corpora alignments using a pivot language