Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitf680466

Browse files
author
Sakata Wataru
committed
add JAPANESE option
1 parent6bacb0e commitf680466

File tree

3 files changed

+13
-52
lines changed

3 files changed

+13
-52
lines changed

‎Makefile.run_classifier‎

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,14 +23,21 @@ TEST_W_EVAL := $(OUTPUT_DIR)/test_w_eval.txt
2323

2424
TEST_SET_FILE := data/localgovfaq/testset_segmentation.txt
2525

26+
JAPANESE := false
27+
BERT_ARGS := --task_name=CQA --do_train=true --do_predict=true --data_dir=$(BERT_DATA_DIR) --vocab_file=$(VOCAB_FILE) --bert_config_file=$(BERT_CONFIG_FILE) --init_checkpoint=$(BERT_MODEL_DIR)/bert_model.ckpt --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=$(OUTPUT_DIR)
28+
29+
ifeq ($(JAPANESE),true)
30+
BERT_ARGS += --japanese=true --do_lower_case=false
31+
endif
32+
2633
all: $(TEST_W_EVAL)
2734

2835
# fine tuneing
2936
$(TEST_RESULT):
3037
mkdir -p $(OUTPUT_DIR) && \
3138
ln -s $(BERT_CONFIG_FILE) $(OUTPUT_DIR)/bert_config.json && \
3239
ln -s $(VOCAB_FILE) $(OUTPUT_DIR)/vocab.txt &&\
33-
$(PYTHON_GPU) $(BERT_DIR)/run_classifier.py--task_name=CQA --do_lower_case=false --do_train=true --do_predict=true --data_dir=$(BERT_DATA_DIR) --vocab_file=$(VOCAB_FILE) --bert_config_file=$(BERT_CONFIG_FILE) --init_checkpoint=$(BERT_MODEL_DIR)/bert_model.ckpt --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=$(OUTPUT_DIR) 2>&1 | tee $(OUTPUT_DIR)/log.txt
40+
$(PYTHON_GPU) $(BERT_DIR)/run_classifier.py$(BERT_ARGS) 2>&1 | tee $(OUTPUT_DIR)/log.txt
3441

3542
$(TEST_ARRANGED): $(TEST_RESULT)
3643
paste $(TEST_RESULT) $(TEST_TSV) | $(PYTHON_CPU) scripts/generate_evaluation_file.py --question_file $(TARGET_QUESTIONS_FILE) > $@

‎README.md‎

Lines changed: 4 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Download the BERT repository, BERT Japanese pre-trained model, QA pairs in Amaga
1414
The data structure is below.
1515
```
1616
data
17-
├── bert : BERT original repository
17+
├── bert :The forked repository fromBERT original repository
1818
├── Japanese_L-12_H-768_A-12_E-30_BPE : BERT Japanese pre-trained model
1919
└── localgovfaq
2020
    ├── qas : QA pairs in Amagasaki City FAQ
@@ -25,56 +25,11 @@ data
2525
The detail about localgovFAQ is on[localgovFAQ.md](localgovFAQ.md).
2626

2727
###BERT application for FAQ retrieval
28-
And we should add the task class to run_classifier.py in the original BERT repository as below.
29-
```python
30-
classCQAProcessor(DataProcessor):
31-
"""Processor for the CoLA data set (GLUE version)."""
32-
33-
defget_train_examples(self,data_dir):
34-
"""See base class."""
35-
returnself._create_examples(
36-
self._read_tsv(os.path.join(data_dir,"train.tsv")),"train")
37-
38-
defget_dev_examples(self,data_dir):
39-
"""See base class."""
40-
returnself._create_examples(
41-
self._read_tsv(os.path.join(data_dir,"dev.tsv")),"dev")
42-
43-
defget_test_examples(self,data_dir):
44-
"""See base class."""
45-
returnself._create_examples(
46-
self._read_tsv(os.path.join(data_dir,"test.tsv")),"test")
47-
48-
defget_labels(self):
49-
"""See base class."""
50-
return ["0","1"]
51-
52-
def_create_examples(self,lines,set_type):
53-
"""Creates examples for the training and dev sets."""
54-
examples= []
55-
for (i, line)inenumerate(lines):
56-
guid="%s-%s"% (set_type, i)
57-
text_a= tokenization.convert_to_unicode(line[1])
58-
text_b= tokenization.convert_to_unicode(line[2])
59-
label= tokenization.convert_to_unicode(line[0])
60-
examples.append(
61-
InputExample(guid=guid,text_a=text_a,text_b=text_b,label=label))
62-
return examples
63-
64-
defmain(_):
65-
tf.logging.set_verbosity(tf.logging.INFO)
66-
67-
processors= {
68-
"cqa": CQAProcessor,
69-
}
70-
```
71-
72-
**For Japanese, we need to comment out`text = self._tokenize_chinese_chars(text)` in tokenization.py in BERT repository.**
7328

7429
Finetune and evaluate.
7530
```shell
7631
make -f Makefile.generate_dataset OUTPUT_DIR=/path/to/data_dir
77-
make -f Makefile.run_classifier BERT_DATA_DIR=/path/to/data_dir OUTPUT_DIR=/path/to/somewhere
32+
make -f Makefile.run_classifier BERT_DATA_DIR=/path/to/data_dir OUTPUT_DIR=/path/to/somewhere JAPANESE=true
7833
```
7934

8035
The result example is below.
@@ -92,8 +47,8 @@ We can get the higher score by using both TSUBAKI and BERT.
9247

9348
We can evaluate the joint model by the below command.
9449
```shell
95-
python scripts/merge_tsubaki_bert_results.py --bert localgovfaq/samples/bert.txt \
96-
--tsubaki localgovfaq/samples/tsubaki.txt \
50+
python scripts/merge_tsubaki_bert_results.py --bertdata/localgovfaq/samples/bert.txt \
51+
--tsubakidata/localgovfaq/samples/tsubaki.txt \
9752
--threshold 0.3 \
9853
--tsubaki_ratio 10> /path/to/resultfile.txt
9954
python scripts/calculate_score.py --testset data/localgovfaq/testset_segmentation.txt \

‎download.sh‎

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
mkdir ./data
22
cd data
3-
git clone https://github.com/google-research/bert.git
4-
#For Japanese, comment out text = self._tokenize_chinese_chars(text) in tokenization.py
3+
git clone git@github.com:ku-nlp/bert.git
54
curl http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/JapaneseBertPretrainedModel/Japanese_L-12_H-768_A-12_E-30_BPE.zip> Japanese_L-12_H-768_A-12_E-30_BPE.zip
65
unzip Japanese_L-12_H-768_A-12_E-30_BPE.zip
76
curl https://tulip.kuee.kyoto-u.ac.jp/localgovfaq/localgovfaq.zip> localgovfaq.zip

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp