chenyuntc/PyTorchTextPublic

NotificationsYou must be signed in to change notification settings
Fork363
Star1.1k

1st Place Solution for Zhihu Machine Learning Challenge . Implementation of various text-classification models.(知乎看山杯第一名解决方案)

License

MIT license

1.1k stars 363 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
checkpoints		checkpoints
data		data
del		del
models		models
notebooks		notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
config.py		config.py
main-all.1.py		main-all.1.py
main-all.py		main-all.py
main.py		main.py
readme-zh.md		readme-zh.md
readme.md		readme.md
rep.py		rep.py
requirements.txt		requirements.txt
test.1.py		test.1.py
test.3.py		test.3.py
看山杯-博士例会分享.pptx		看山杯-博士例会分享.pptx
看山杯init解决方案.pdf		看山杯init解决方案.pdf
说明.md		说明.md

Repository files navigation

中文用户请查看 readme-zh.md

This is the solution forZhihu Machine Learning Challenge 2017. We won the champion out of 963 teams.

1. Setup

install PyTorch frompytorch.org (Python 2, CUDA)
install other depencies:
```
pip2 install -r requirements.txt
```

You may needtf.contrib.keras.preprocessing.sequence.pad_sequences for data preprocessing.

start visdom for visualization:
```
python2 -m visdom.server
```

2. Data Preprocessing

Modify the data path in the related file

2.1 wordvector file -> numpy file

python scripts/data_process/embedding2matrix.py main char_embedding.txt char_embedding.npz python scripts/data_process/embedding2matrix.py main word_embedding.txt word_embedding.npz

2.2 question set -> numpy file

it's memory consuming , make sure you have memory larger than 32G.

python scripts/data_process/question2array.py main question_train_set.txt train.npzpython scripts/data_process/question2array.py main question_eval_set.txt test.npz

2.3 label -> json

python scripts/data_process/label2id.py main question_topic_train_set.txt labels.json

2.4 validation data

python scripts/data_process/get_val.py

3. Training

modifyconfig.py for model path

Path to the models we used:

CNN:models/MultiCNNTextBNDeep.py
RNN（LSTM）:models/LSTMText.py
RCNN:models/RCNN.py
inception:models/CNNText_inception.py
FastText:models/FastText3.py

3.1 Trian model without data augumentation

# LSTM charpython2 main.py main --max_epoch=5 --plot_every=100 --env='lstm_char' --weight=1 --model='LSTMText'  --batch-size=128  --lr=0.001 --lr2=0 --lr_decay=0.5 --decay_every=10000  --type_='char'   --zhuge=True --linear-hidden-size=2000 --hidden-size=256 --kmax-pooling=3   --num-layers=3  --augument=False# LSTM wordpython2 main.py main --max_epoch=5 --plot_every=100 --env='lstm_word' --weight=1 --model='LSTMText'  --batch-size=128  --lr=0.001 --lr2=0.0000 --lr_decay=0.5 --decay_every=10000  --type_='word'   --zhuge=True --linear-hidden-size=2000 --hidden-size=320 --kmax-pooling=2  --augument=False#  RCNN charpython2 main.py main --max_epoch=5 --plot_every=100 --env='rcnn_char' --weight=1 --model='RCNN'  --batch-size=128  --lr=0.001 --lr2=0 --lr_decay=0.5 --decay_every=5000  --title-dim=1024 --content-dim=1024  --type_='char' --zhuge=True --kernel-size=3 --kmax-pooling=2 --linear-hidden-size=2000 --debug-file='/tmp/debugrcnn' --hidden-size=256 --num-layers=3 --augument=False# RCNN wordmain.py main --max_epoch=5 --plot_every=100 --env='RCNN-word' --weight=1 --model='RCNN'  --zhuge=True --num-workers=4 --batch-size=128 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=5000  --title-dim=1024 --content-dim=512  --kernel-size=3 --debug-file='/tmp/debugrc'  --kmax-pooling=1 --type_='word' --augument=False# CNN word python main.py main --max_epoch=5 --plot_every=100 --env='MultiCNNText' --weight=1 --model='MultiCNNTextBNDeep'  --batch-size=64  --lr=0.001 --lr2=0.000 --lr_decay=0.8 --decay_every=10000  --title-dim=250 --content-dim=250    --weight-decay=0 --type_='word' --debug-file='/tmp/debug'  --linear-hidden-size=2000 --zhuge=True  --augument=False# inception wordpython2 main.py main --max_epoch=5 --plot_every=100 --env='inception-word' --weight=1 --model='CNNText_inception'  --zhuge=True --num-workers=4 --batch-size=512 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --title-dim=1200 --content-dim=1200 --type_='word' --augument=False# inception charpython2 main.py main --max_epoch=5 --plot_every=100 --env='inception-char' --weight=1 --model='CNNText_inception'  --zhuge=True --num-workers=4 --batch-size=512 --model-path=None --lr2=0 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --title-dim=1200 --content-dim=1200 --type_='char'   --augument=False# FastText3 wordpython2 main.py main --max_epoch=5 --plot_every=100 --env='fasttext3-word' --weight=5 --model='FastText3' --zhuge=True --num-workers=4 --batch-size=512  --lr2=1e-4 --lr=1e-3 --lr-decay=0.8  --decay-every=2500 --linear_hidden_size=2000 --type_='word'  --debug-file=/tmp/debugf --augument=False

In most cases, the score could be boosted by finetune. for example:

python2 main.py main --max_epoch=2 --plot_every=100 --env='LSTMText-word-ft' --model='LSTMText'  --zhuge=True --num-workers=4 --batch-size=256 --model-path=None --lr2=5e-5 --lr=5e-5 --decay-every=5000 --type_='word'  --model-path='checkpoints/LSTMText_word_0.409196378421'

3.2 train models with data augumentation

Add--augument in the training command.

3.3 scores

model	score
CNN_word	0.4103
RNN_word	0.4119
RCNN_word	0.4115
Inceptin_word	0.4109
FastText_word	0.4091
RNN_char	0.4031
RCNN_char	0.4037
Inception_char	0.4024
RCNN_word_aug	0.41344
CNN_word_aug	0.41051
RNN_word_aug	0.41368
Incetpion_word_aug	0.41254
FastText3_word_aug	0.40853
CNN_char_aug	0.38738
RCNN_char_aug	0.39854

with model ensemble, it can get up to 0.433.

4 Test and Submit

4.1 Test

model: includeLSTMText,RCNN,MultiCNNTextBNDeep,FastText3,CNNText_inception
model-path: path to the pretrained model
result-path: where to save the model
val: test the val set or the test set..

# LSTMpython2 test.1.py main --model='LSTMText'  --batch-size=512  --model-path='checkpoints/LSTMText_word_0.411994005382' --result-path='/data_ssd/zhihu/result/LSTMText0.4119_word_test.pth'  --val=False --zhuge=Truepython2 test.1.py main --model='LSTMText'  --batch-size=256 --type_=char --model-path='checkpoints/LSTMText_char_0.403192339135' --result-path='/data_ssd/zhihu/result/LSTMText0.4031_char_test.pth'  --val=False --zhuge=True#RCNNpython2 test.1.py main --model='RCNN'  --batch-size=512  --model-path='checkpoints/RCNN_word_0.411511574999' --result-path='/data_ssd/zhihu/result/RCNN_0.4115_word_test.pth'  --val=False --zhuge=Truepython2 test.1.py main --model='RCNN'  --batch-size=512  --model-path='checkpoints/RCNN_char_0.403710422571' --result-path='/data_ssd/zhihu/result/RCNN_0.4037_char_test.pth'  --val=False --zhuge=True# DeepTextpython2 test.1.py main --model='MultiCNNTextBNDeep'  --batch-size=512  --model-path='checkpoints/MultiCNNTextBNDeep_word_0.410330780091' --result-path='/data_ssd/zhihu/result/DeepText0.4103_word_test.pth'  --val=False --zhuge=True# more to go ...

4.2 ensemble

Seenotebooks/val_ensemble.ipynb andnotebooks/test_ensemble.ipynb for more detail

5 Main files

main.py: main(for training)
config.py: config file
test.1.py: for test
data/: for data loader
scripts/: for data preprocessing
utils/ : including calculate score and wrapper for visualization.
models/: models
- models/BasicModel: Base model for models.
- models/MultiCNNTextBNDeep: CNN
- models/LSTMText: RNN
- models/RCNN: RCNN
- models/CNNText_inception Inception
- models/MultiModelALL 和models/MultiModelAll2
- other model
rep.py: code for reproducing.
del/: methods fail or not used.
notebooks/: notebooks.

Pretrained model

https://pan.baidu.com/s/1mjVtJGs passwd: tayb

About

1st Place Solution for Zhihu Machine Learning Challenge . Implementation of various text-classification models.(知乎看山杯第一名解决方案)

biendata.com/competition/zhihu/

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

中文用户请查看 readme-zh.md

1. Setup

2. Data Preprocessing

2.1 wordvector file -> numpy file

2.2 question set -> numpy file

2.3 label -> json

2.4 validation data

3. Training

3.1 Trian model without data augumentation

3.2 train models with data augumentation

3.3 scores

4 Test and Submit

4.1 Test

4.2 ensemble

5 Main files

Pretrained model

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

chenyuntc/PyTorchText

Folders and files

Latest commit

History

Repository files navigation

中文用户请查看readme-zh.md

1. Setup

2. Data Preprocessing

2.1 wordvector file -> numpy file

2.2 question set -> numpy file

2.3 label -> json

2.4 validation data

3. Training

3.1 Trian model without data augumentation

3.2 train models with data augumentation

3.3 scores

4 Test and Submit

4.1 Test

4.2 ensemble

5 Main files

Pretrained model

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

中文用户请查看 readme-zh.md

Packages