@@ -14,7 +14,7 @@ Download the BERT repository, BERT Japanese pre-trained model, QA pairs in Amaga
1414The data structure is below.
1515```
1616data
17- ├── bert : BERT original repository
17+ ├── bert :The forked repository from BERT original repository
1818├── Japanese_L-12_H-768_A-12_E-30_BPE : BERT Japanese pre-trained model
1919└── localgovfaq
2020 ├── qas : QA pairs in Amagasaki City FAQ
2525The detail about localgovFAQ is on[ localgovFAQ.md] ( localgovFAQ.md ) .
2626
2727###BERT application for FAQ retrieval
28- And we should add the task class to run_classifier.py in the original BERT repository as below.
29- ``` python
30- class CQAProcessor (DataProcessor ):
31- """ Processor for the CoLA data set (GLUE version)."""
32-
33- def get_train_examples (self ,data_dir ):
34- """ See base class."""
35- return self ._create_examples(
36- self ._read_tsv(os.path.join(data_dir," train.tsv" ))," train" )
37-
38- def get_dev_examples (self ,data_dir ):
39- """ See base class."""
40- return self ._create_examples(
41- self ._read_tsv(os.path.join(data_dir," dev.tsv" ))," dev" )
42-
43- def get_test_examples (self ,data_dir ):
44- """ See base class."""
45- return self ._create_examples(
46- self ._read_tsv(os.path.join(data_dir," test.tsv" ))," test" )
47-
48- def get_labels (self ):
49- """ See base class."""
50- return [" 0" ," 1" ]
51-
52- def _create_examples (self ,lines ,set_type ):
53- """ Creates examples for the training and dev sets."""
54- examples= []
55- for (i, line)in enumerate (lines):
56- guid= " %s -%s " % (set_type, i)
57- text_a= tokenization.convert_to_unicode(line[1 ])
58- text_b= tokenization.convert_to_unicode(line[2 ])
59- label= tokenization.convert_to_unicode(line[0 ])
60- examples.append(
61- InputExample(guid = guid,text_a = text_a,text_b = text_b,label = label))
62- return examples
63-
64- def main (_ ):
65- tf.logging.set_verbosity(tf.logging.INFO )
66-
67- processors= {
68- " cqa" : CQAProcessor,
69- }
70- ```
71-
72- ** For Japanese, we need to comment out` text = self._tokenize_chinese_chars(text) ` in tokenization.py in BERT repository.**
7328
7429Finetune and evaluate.
7530``` shell
7631make -f Makefile.generate_dataset OUTPUT_DIR=/path/to/data_dir
77- make -f Makefile.run_classifier BERT_DATA_DIR=/path/to/data_dir OUTPUT_DIR=/path/to/somewhere
32+ make -f Makefile.run_classifier BERT_DATA_DIR=/path/to/data_dir OUTPUT_DIR=/path/to/somewhere JAPANESE=true
7833```
7934
8035The result example is below.
@@ -92,8 +47,8 @@ We can get the higher score by using both TSUBAKI and BERT.
9247
9348We can evaluate the joint model by the below command.
9449``` shell
95- python scripts/merge_tsubaki_bert_results.py --bert localgovfaq/samples/bert.txt \
96- --tsubaki localgovfaq/samples/tsubaki.txt \
50+ python scripts/merge_tsubaki_bert_results.py --bertdata/ localgovfaq/samples/bert.txt \
51+ --tsubakidata/ localgovfaq/samples/tsubaki.txt \
9752 --threshold 0.3 \
9853 --tsubaki_ratio 10> /path/to/resultfile.txt
9954python scripts/calculate_score.py --testset data/localgovfaq/testset_segmentation.txt \