- Notifications
You must be signed in to change notification settings - Fork402
中文长文本分类、短句子分类、多标签分类、两句子相似度(Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,FastText,TextCNN,CharCNN,TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN
License
yongzhuo/Keras-TextClassification
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
pip install Keras-TextClassification
step2:downloadandunzipthedirof'data.rar',地址:链接:https://pan.baidu.com/s/1pIDzGaGXCZ7cjng1XU_kPA提取码:w6ps压缩包密码:2022coverthedirofdatatoanaconda,like'/anaconda/3.5.1/envs/tensorflow13/Lib/site-packages/keras_textclassification/data'step3:goto# Train&Usage(调用) and Predict&Usage(调用)
- Electra-fineture(todo)- Albert-fineture- Xlnet-fineture- Bert-fineture- FastText- TextCNN- charCNN- TextRNN- TextRCNN- TextDCNN- TextDPCNN- TextVDCNN- TextCRNN- DeepMoji- SelfAttention- HAN- CapsuleNet- Transformer-encode- SWEM- LEAM- TextGCN(todo)- 1. 进入keras_textclassification/m01_FastText目录,- 2. 训练: 运行 train.py, 例如: python train.py- 3. 预测: 运行 predict.py, 例如: python predict.py- 说明: 默认不带pre train的random embedding,训练和验证语料只有100条,完整语料移步下面data查看下载- bert,word2vec,random样例在test/目录下, 注意word2vec(char or word), random-word, bert(chinese_L-12_H-768_A-12)未全部加载,需要下载- multi_multi_class/目录下以text-cnn为例进行多标签分类实例,转化为multi-onehot标签类别,分类则取一定阀值的类- sentence_similarity/目录下以bert为例进行两个句子文本相似度计算,数据格式如data/sim_webank/目录下所示- predict_bert_text_cnn.py- tet_char_bert_embedding.py- tet_char_bert_embedding.py- tet_char_xlnet_embedding.py- tet_char_random_embedding.py- tet_char_word2vec_embedding.py- tet_word_random_embedding.py- tet_word_word2vec_embedding.py- 数据下载 ** github项目中只是上传部分数据,需要的前往链接: https://pan.baidu.com/s/1I3vydhmFEQ9nuPG2fDou8Q 提取码: rket- baidu_qa_2019(百度qa问答语料,只取title作为分类样本,17个类,有一个是空'',已经压缩上传) - baike_qa_train.csv - baike_qa_valid.csv- byte_multi_news(今日头条2018新闻标题多标签语料,1070个标签,fate233爬取, 地址为: [byte_multi_news](https://github.com/fate233/toutiao-multilevel-text-classfication-dataset)) -labels.csv -train.csv -valid.csv- embeddings - chinese_L-12_H-768_A-12/(取谷歌预训练好点的模型,已经压缩上传, keras-bert还可以加载百度版ernie(需转换,[https://github.com/ArthurRizar/tensorflow_ernie](https://github.com/ArthurRizar/tensorflow_ernie)), 哈工大版bert-wwm(tf框架,[https://github.com/ymcui/Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)) - albert_base_zh/(brightmart训练的albert, 地址为https://github.com/brightmart/albert_zh) - chinese_xlnet_base_L-12_H-768_A-12/(哈工大预训练的中文xlnet模型[https://github.com/ymcui/Chinese-PreTrained-XLNet],12层) - term_char.txt(已经上传, 项目中已全, wiki字典, 还可以用新华字典什么的) - term_word.txt(未上传, 项目中只有部分, 可参考词向量的) - w2v_model_merge_short.vec(未上传, 项目中只有部分, 词向量, 可以用自己的) - w2v_model_wiki_char.vec(已上传百度网盘, 项目中只有部分, 自己训练的维基百科字向量, 可以用自己的)- model - fast_text/预训练模型存放地址- 构建了base基类(网络(graph)、向量嵌入(词、字、句子embedding)),后边的具体模型继承它们,代码简单
- keras_layers存放一些常用的layer, conf存放项目数据、模型的地址, data存放数据和语料, data_preprocess为数据预处理模块,
- FastText:Bag of Tricks for Efficient Text Classification
- TextCNN:Convolutional Neural Networks for Sentence Classification
- charCNN-kim:Character-Aware Neural Language Models
- charCNN-zhang:Character-level Convolutional Networks for Text Classification
- TextRNN:Recurrent Neural Network for Text Classification with Multi-Task Learning
- RCNN:Recurrent Convolutional Neural Networks for Text Classification
- DCNN:A Convolutional Neural Network for Modelling Sentences
- DPCNN:Deep Pyramid Convolutional Neural Networks for Text Categorization
- VDCNN:Very Deep Convolutional Networks
- CRNN:A C-LSTM Neural Network for Text Classification
- DeepMoji:Using millions of emojio ccurrences to learn any-domain represent ations for detecting sentiment, emotion and sarcasm
- SelfAttention:Attention Is All You Need
- HAN:Hierarchical Attention Networks for Document Classification
- CapsuleNet:Dynamic Routing Between Capsules
- Transformer(encode or decode):Attention Is All You Need
- Bert:BERT: Pre-trainingofDeepBidirectionalTransformersfor LanguageUnderstanding
- Xlnet:XLNet: Generalized Autoregressive Pretraining for Language Understanding
- Albert:ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
- RoBERTa:RoBERTa: A Robustly Optimized BERT Pretraining Approach
- ELECTRA:ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
- TextGCN:Graph Convolutional Networks for Text Classification
- 文本分类项目:https://github.com/mosu027/TextClassification
- 文本分类看山杯:https://github.com/brightmart/text_classification
- Kashgari项目:https://github.com/BrikerMan/Kashgari
- 文本分类Ipty :https://github.com/lpty/classifier
- keras文本分类:https://github.com/ShawnyXiao/TextClassification-Keras
- keras文本分类:https://github.com/AlexYangLi/TextClassification
- CapsuleNet模型:https://github.com/bojone/Capsule
- transformer模型:https://github.com/CyberZHG/keras-transformer
- keras_albert_model:https://github.com/TinkerMob/keras_albert_model
fromkeras_textclassificationimporttraintrain(graph='TextCNN',# 必填, 算法名, 可选"ALBERT","BERT","XLNET","FASTTEXT","TEXTCNN","CHARCNN",# "TEXTRNN","RCNN","DCNN","DPCNN","VDCNN","CRNN","DEEPMOJI",# "SELFATTENTION", "HAN","CAPSULE","TRANSFORMER"label=17,# 必填, 类别数, 训练集和测试集合必须一样path_train_data=None,# 必填, 训练数据文件, csv格式, 必须含'label,ques'头文件, 详见keras_textclassification/datapath_dev_data=None,# 必填, 测试数据文件, csv格式, 必须含'label,ques'头文件, 详见keras_textclassification/datarate=1,# 可填, 训练数据选取比例hyper_parameters=None)# 可填, json格式, 超参数, 默认embedding为'char','random'
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
@misc{Keras-TextClassification, howpublished = {\url{https://github.com/yongzhuo/Keras-TextClassification}}, title = {Keras-TextClassification}, author = {Yongzhuo Mo}, publisher = {GitHub}, year = {2019}}*希望对你有所帮助!
About
中文长文本分类、短句子分类、多标签分类、两句子相似度(Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,FastText,TextCNN,CharCNN,TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.