- Notifications
You must be signed in to change notification settings - Fork0
The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
License
JetRunner/unihan-lm
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
The pretrained model is available at 🤗 Hugging Face Model Hub:https://huggingface.co/microsoft/unihanlm-base
We have made the code to find the Unihan clusters and the cached cluster IDs availablehere.
Please follow our paper and use the training code fromfacebookresearch/XLM.
- Preprocess your corpus by replacing all characters with the first character in eachcluster.
- After cluster-level pretraining, copy the embedding of the first characters in each cluster for other characters in the same cluster.
forclusterinclusters:forchracter_idincluster[1:]:embedding.weight[chracter_id]=embedding.weight[cluster[0]].detach()
- Re-preprocess the corpus by a standard procedure.
- Restart training on the new corpus.
Note: XLM is released under a CC BY-NC 4.0 licence.
About
The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
Resources
License
Uh oh!
There was an error while loading.Please reload this page.