Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A program for calculating corpora alignments using a pivot language

License

NotificationsYou must be signed in to change notification settings

czcorpus/ictools

Repository files navigation

This is a faster, less memory-consuming, integrated replacement for legacycalign.py,compressrng.py,fixgaps.py,transalign.py scripts used to prepare corpora alignmentnumeric data from lists of structural attribute values mapping between languages. It also fixessome problems with missing ranges for unaligned structures you can encounter when using the scripts above.In addition, it also provides anexport function for performing reversed operations.

Note: you still needmkalign tool distributed along withManatee-open to enable corpora alignmentsinKonText (or NoSkE).

Contents

Using ictools

Ictools provide three operations - import, transalign and export:

import

Import operation transforms an alignment XML file containing aligned string sentence IDs to a numeric form.It is able to handle non-existing alignments, gaps between ranges (including the last row range where structuresize is always used to make sure the whole range is filled in).

In terms of the input format, a list of<link> elements is expected:

<linktype='0-3'xtargets=';cs:Adams-Holisticka_det_k:0:7:1 cs:Adams-Holisticka_det_k:0:7:2 cs:Adams-Holisticka_det_k:0:7:3'status='man'/>

Please note that the parser does not care about XML validity (e.g. there is no need for a root element or evena proper nesting of elements).

In some cases you may want totweak line buffer size (value is in bytes; by defaultbufio.MaxScanTokenSize = 64 * 1024 is used which may fail in case of some complex alignments and/or long text identifiers). In case the buffer is toosmall, ictools will end with fatal log event returning a non-zero value to shell.

ictools -line-buffer 250000 -registry-path /var/local/corpora/registry import ....etc...

Example:

Let's say we have two files with mappings between Polish and Czech (intercorp_pl2cs) and betweenEnglish and Czech (intercorp.en2cs) where Czech is a pivot.

ictools -registry-path /var/local/corpora/registry import intercorp_v10_pl intercorp_v10_cs s.id /var/local/corpora/aligndef/intercorp_pl2cs > intercorp.pl2csictools -registry-path /var/local/corpora/registry import intercorp_v10_en intercorp_v10_cs s.id /var/local/corpora/aligndef/intercorp_en2cs > intercorp.en2cs

transalign

Transalign operation takes two numeric alignments against a common pivot language and generatesa new alignment between the two non-pivot languages.

Example:

ictools transalign ./intercorp.pl2cs ./intercorp.en2cs > intercorp.pl2en

export

Theexport operation is able to reconstruct the XML-ish source used as an inputfor theimport operation using numeric alignment files as produced byimport -> transalign operations. Any grouped intervals are split back to the originaltext groups.

Example:

ictools -export-type intercorp export /corpora/registry/intercorp_v12_cs /corpora/registry/intercorp_v12_en s.id /corpora/aligndef/intercorp.cs2en > orig.xml

How to build ictools

ICTools come withmanabuild as its dependency. So in case you have~/go/bin in your$PATH, everything needed to buildictools is:

manabuild

In case Manabuild finds Manatee-open in a non-standard location where system does not look for libraries,it producesictools.bin with actual ICTools binary andictools which is a short Bash scriptto setLD_LIBRARY_PATH to the path Manabuild found Manatee in and to start the binary. So in this case,two files must be moved (or copied) to a target installation location (e.g./usr/local/bin).s

Benchmark

Used data files:

  • intercorp_pl2cs (size 1.4GB)
  • intercorp_pl2en (size 1.5GB)

Used hardware:

  • A (a server)
    • CPU: Intel Xeon E5-2640 v3 @ 2.60GHz
    • 64GB RAM
  • B (a common Dell desktop)
    • CPU: Intel Core) i5-2400 @ 3.10GHz
    • 8GB RAM
SetupUsed programcalign+fixgaps+compress [sec]transalign [sec]total [sec]
Aclassic scripts255191446
Aictools16455219
Bclassic scripts312DNF (RAM)DNF
Bictools17563238

Ictools are approximatelytwice as fast as the original Python scripts.

In terms ofmemory usage, there were no thorough measurements performed but according to thetoputility thetransalign function inictools consumes about30-40% of of the memory consumedby the classic scripts. The import function (i.e. calign+fixgaps+compress) in both programsconsumes only a little RAM because data read from an input file are (almost) immediately writtento the output without any unnecessary memory allocation.

For developers

Setting up VSCode debugging/testing environment

Run

manabuild -no-build

and copyCGO_CPPFLAGS=...,CGO_CPPFLAGS=... andCGO_CXXFLAGS=....

Opendebug environment (left column) and click the "gear" button to editlaunch.json. Thenset proper environment variables (just like in the previous paragraph).

{"version":"0.2.0","configurations": [    {" ....  parts are omitted here ... " :" ...","env": {"CGO_LDFLAGS":"...","CGO_CPPFLAGS":"...","CGO_CXXFLAGS":"..."      }," ....  parts are omitted here ... " :" ...",    }  ]}

Where the env. variables part is the one copied in the previous step.

Running tests

manabuild -test

[8]ページ先頭

©2009-2025 Movatter.jp