Commitf680466

Sakata Wataru

committed

add JAPANESE option

1 parent6bacb0e commitf680466Copy full SHA for f680466

File tree

3 files changed

+13

-52

lines changed

3 files changed

+13

-52

lines changed

`‎Makefile.run_classifier‎`

Lines changed: 8 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -23,14 +23,21 @@ TEST_W_EVAL := $(OUTPUT_DIR)/test_w_eval.txt`
`23`	`23`
`24`	`24`	`TEST_SET_FILE := data/localgovfaq/testset_segmentation.txt`
`25`	`25`
	`26`	`+JAPANESE := false`
	`27`	`+BERT_ARGS := --task_name=CQA --do_train=true --do_predict=true --data_dir=$(BERT_DATA_DIR) --vocab_file=$(VOCAB_FILE) --bert_config_file=$(BERT_CONFIG_FILE) --init_checkpoint=$(BERT_MODEL_DIR)/bert_model.ckpt --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=$(OUTPUT_DIR)`
	`28`	`+`
	`29`	`+ifeq ($(JAPANESE),true)`
	`30`	`+BERT_ARGS += --japanese=true --do_lower_case=false`
	`31`	`+endif`
	`32`	`+`
`26`	`33`	`all: $(TEST_W_EVAL)`
`27`	`34`
`28`	`35`	`# fine tuneing`
`29`	`36`	`$(TEST_RESULT):`
`30`	`37`	`mkdir -p $(OUTPUT_DIR) && \`
`31`	`38`	`ln -s $(BERT_CONFIG_FILE) $(OUTPUT_DIR)/bert_config.json && \`
`32`	`39`	`ln -s $(VOCAB_FILE) $(OUTPUT_DIR)/vocab.txt &&\`
`33`		`-$(PYTHON_GPU) $(BERT_DIR)/run_classifier.py--task_name=CQA --do_lower_case=false --do_train=true --do_predict=true --data_dir=$(BERT_DATA_DIR) --vocab_file=$(VOCAB_FILE) --bert_config_file=$(BERT_CONFIG_FILE) --init_checkpoint=$(BERT_MODEL_DIR)/bert_model.ckpt --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=$(OUTPUT_DIR) 2>&1 \| tee $(OUTPUT_DIR)/log.txt`
	`40`	`+$(PYTHON_GPU) $(BERT_DIR)/run_classifier.py$(BERT_ARGS) 2>&1 \| tee $(OUTPUT_DIR)/log.txt`
`34`	`41`
`35`	`42`	`$(TEST_ARRANGED): $(TEST_RESULT)`
`36`	`43`	`paste $(TEST_RESULT) $(TEST_TSV) \| $(PYTHON_CPU) scripts/generate_evaluation_file.py --question_file $(TARGET_QUESTIONS_FILE) > $@`

`‎README.md‎`

Lines changed: 4 additions & 49 deletions

Original file line number	Diff line number	Diff line change
`@@ -14,7 +14,7 @@ Download the BERT repository, BERT Japanese pre-trained model, QA pairs in Amaga`
`14`	`14`	`The data structure is below.`
`15`	`15`	```
`16`	`16`	`data`
`17`		`-├── bert : BERT original repository`
	`17`	`+├── bert :The forked repository fromBERT original repository`
`18`	`18`	`├── Japanese_L-12_H-768_A-12_E-30_BPE : BERT Japanese pre-trained model`
`19`	`19`	`└── localgovfaq`
`20`	`20`	`├── qas : QA pairs in Amagasaki City FAQ`
`@@ -25,56 +25,11 @@ data`
`25`	`25`	`The detail about localgovFAQ is on[localgovFAQ.md](localgovFAQ.md).`
`26`	`26`
`27`	`27`	`###BERT application for FAQ retrieval`
`28`		`-And we should add the task class to run_classifier.py in the original BERT repository as below.`
`29`		-```python
`30`		`-classCQAProcessor(DataProcessor):`
`31`		`-"""Processor for the CoLA data set (GLUE version)."""`
`32`		`-`
`33`		`-defget_train_examples(self,data_dir):`
`34`		`-"""See base class."""`
`35`		`-returnself._create_examples(`
`36`		`-self._read_tsv(os.path.join(data_dir,"train.tsv")),"train")`
`37`		`-`
`38`		`-defget_dev_examples(self,data_dir):`
`39`		`-"""See base class."""`
`40`		`-returnself._create_examples(`
`41`		`-self._read_tsv(os.path.join(data_dir,"dev.tsv")),"dev")`
`42`		`-`
`43`		`-defget_test_examples(self,data_dir):`
`44`		`-"""See base class."""`
`45`		`-returnself._create_examples(`
`46`		`-self._read_tsv(os.path.join(data_dir,"test.tsv")),"test")`
`47`		`-`
`48`		`-defget_labels(self):`
`49`		`-"""See base class."""`
`50`		`-return ["0","1"]`
`51`		`-`
`52`		`-def_create_examples(self,lines,set_type):`
`53`		`-"""Creates examples for the training and dev sets."""`
`54`		`- examples= []`
`55`		`-for (i, line)inenumerate(lines):`
`56`		`- guid="%s-%s"% (set_type, i)`
`57`		`- text_a= tokenization.convert_to_unicode(line[1])`
`58`		`- text_b= tokenization.convert_to_unicode(line[2])`
`59`		`- label= tokenization.convert_to_unicode(line[0])`
`60`		`- examples.append(`
`61`		`- InputExample(guid=guid,text_a=text_a,text_b=text_b,label=label))`
`62`		`-return examples`
`63`		`-`
`64`		`-defmain(_):`
`65`		`- tf.logging.set_verbosity(tf.logging.INFO)`
`66`		`-`
`67`		`- processors= {`
`68`		`-"cqa": CQAProcessor,`
`69`		`- }`
`70`		-```
`71`		`-`
`72`		-For Japanese, we need to comment out`text = self._tokenize_chinese_chars(text)` in tokenization.py in BERT repository.
`73`	`28`
`74`	`29`	`Finetune and evaluate.`
`75`	`30`	```shell
`76`	`31`	`make -f Makefile.generate_dataset OUTPUT_DIR=/path/to/data_dir`
`77`		`-make -f Makefile.run_classifier BERT_DATA_DIR=/path/to/data_dir OUTPUT_DIR=/path/to/somewhere`
	`32`	`+make -f Makefile.run_classifier BERT_DATA_DIR=/path/to/data_dir OUTPUT_DIR=/path/to/somewhere JAPANESE=true`
`78`	`33`	```
`79`	`34`
`80`	`35`	`The result example is below.`
`@@ -92,8 +47,8 @@ We can get the higher score by using both TSUBAKI and BERT.`
`92`	`47`
`93`	`48`	`We can evaluate the joint model by the below command.`
`94`	`49`	```shell
`95`		`-python scripts/merge_tsubaki_bert_results.py --bert localgovfaq/samples/bert.txt \`
`96`		`- --tsubaki localgovfaq/samples/tsubaki.txt \`
	`50`	`+python scripts/merge_tsubaki_bert_results.py --bertdata/localgovfaq/samples/bert.txt \`
	`51`	`+ --tsubakidata/localgovfaq/samples/tsubaki.txt \`
`97`	`52`	`--threshold 0.3 \`
`98`	`53`	`--tsubaki_ratio 10> /path/to/resultfile.txt`
`99`	`54`	`python scripts/calculate_score.py --testset data/localgovfaq/testset_segmentation.txt \`

`‎download.sh‎`

Lines changed: 1 addition & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,6 @@`
`1`	`1`	`mkdir ./data`
`2`	`2`	`cd data`
`3`		`-git clone https://github.com/google-research/bert.git`
`4`		`-#For Japanese, comment out text = self._tokenize_chinese_chars(text) in tokenization.py`
	`3`	`+git clone git@github.com:ku-nlp/bert.git`
`5`	`4`	`curl http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/JapaneseBertPretrainedModel/Japanese_L-12_H-768_A-12_E-30_BPE.zip> Japanese_L-12_H-768_A-12_E-30_BPE.zip`
`6`	`5`	`unzip Japanese_L-12_H-768_A-12_E-30_BPE.zip`
`7`	`6`	`curl https://tulip.kuee.kyoto-u.ac.jp/localgovfaq/localgovfaq.zip> localgovfaq.zip`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitf680466

File tree

3 files changed

3 files changed

`‎Makefile.run_classifier‎`

`‎README.md‎`

`‎download.sh‎`

0 commit comments