

Read and process constituency trees in various formats.
cargo install lumberjack-utils
cargo install --git https://github.com/sebpuetz/lumberjack
- Convert treebank in NEGRA export 4 format to bracketed TueBa V2 format
lumberjack-conversion --input_file treebank.negra --input_format negra \ --output_format tueba --output_file treebank.tueba --projectivize
- Retain only root node,
NP
s andPP
s and print to simple bracketed format:
echo"NP PP"> filter_set.txtlumberjack-conversion --input_file treebank.simple --input_format simple \ --output_format tueba --output_file treebank.filtered \ --filter filter_set.txt
- Convert from treebank in simple bracketed to CONLLX format and annotateparent tags of terminals as features.
lumberjack-conversion --input_file treebank.simple --input_format simple\ --output_format conllx --output_file treebank.conll --parent
- Modifications in the following order:
- Reattach all terminals with part-of-speech starting with
$
to theroot node - Remove all nonterminals except the root,
S
s,NP
s,PP
s andVP
s - Assign unique identifiers based on the closest
S
to terminals - Insert nodes with label
label
above terminals that aren't dominated byNP
orPP
- Annotate label of parent node on terminals.
- Print to CONLLX format with annotations.
echo"S VP NP PP"> filter_set.txtecho"NP PP"> insert_set.txtecho"S"> id_set.txtlumberjack-conversion --input_file treebank.simple --input_format simple\ --output_format conllx --insertion_set insert_set.txt \ --insertion_label label --id_set id_set.txt --reattach $\ --parent parent --output_file treebank.conllx
- read and projectivize trees from NEGRA format and print to simplebracketed format
use std::io::{BufReader,File};use lumberjack::io::{NegraReader,PTBFormat};use lumberjack::Projectivize;fnprint_negra(path:&str){let file =File::open(path).unwrap();let reader =NegraReader::new(BufReader::new(file));for treein reader{letmut tree = tree.unwrap(); tree.projectivize();println!("{}",PTBFormat::Simple.tree_to_string(&tree).unwrap());}}
- filter non-terminal nodes from trees in a treebank and print tosimple bracketed format:
use lumberjack::{io::PTBFormat,Tree,TreeOps, util::LabelSet};fnfilter_nodes(iter:implIterator<Item=Tree>,set:LabelSet){formut treein iter{ tree.filter_nonterminals(|tree, nt| set.matches(tree[nt].label())).unwrap();println!("{}",PTBFormat::Simple.tree_to_string(&tree).unwrap());}}
- convert treebank in simple bracketed format to CONLLX with constituency structureencoded in the features field
use conllx::graph::Sentence;use lumberjack::io::Encode;use lumberjack::{Tree,TreeOps,UnaryChains};fnto_conllx(iter:implIterator<Item=Tree>){formut treein iter{ tree.collaps_unary_chains().unwrap(); tree.annotate_absolute().unwrap();println!("{}",Sentence::from(&tree));}}