yongzhuo/Keras-TextClassificationPublic

NotificationsYou must be signed in to change notification settings
Fork402
Star1.8k

中文长文本分类、短句子分类、多标签分类、两句子相似度（Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short），字词句向量嵌入层（embeddings）和网络层（graph）构建基类，FastText，TextCNN，CharCNN，TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN

License

MIT license

1.8k stars 402 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
keras_textclassification		keras_textclassification
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

Keras-TextClassification

Install(安装)

pip install Keras-TextClassification

step2:downloadandunzipthedirof'data.rar',地址:链接：https://pan.baidu.com/s/1pIDzGaGXCZ7cjng1XU_kPA提取码：w6ps压缩包密码:2022coverthedirofdatatoanaconda,like'/anaconda/3.5.1/envs/tensorflow13/Lib/site-packages/keras_textclassification/data'step3:goto# Train&Usage(调用) and Predict&Usage(调用)

keras_textclassification（代码主体,未完待续...）

- Electra-fineture(todo)- Albert-fineture- Xlnet-fineture- Bert-fineture- FastText- TextCNN- charCNN- TextRNN- TextRCNN- TextDCNN- TextDPCNN- TextVDCNN- TextCRNN- DeepMoji- SelfAttention- HAN- CapsuleNet- Transformer-encode- SWEM- LEAM- TextGCN(todo)

run(运行, 以FastText为例)

- 1. 进入keras_textclassification/m01_FastText目录，- 2. 训练: 运行 train.py,   例如: python train.py- 3. 预测: 运行 predict.py, 例如: python predict.py- 说明: 默认不带pre train的random embedding，训练和验证语料只有100条，完整语料移步下面data查看下载

run(多标签分类/Embedding/test/sample实例)

- bert,word2vec,random样例在test/目录下, 注意word2vec(char or word), random-word,  bert(chinese_L-12_H-768_A-12)未全部加载,需要下载- multi_multi_class/目录下以text-cnn为例进行多标签分类实例，转化为multi-onehot标签类别，分类则取一定阀值的类- sentence_similarity/目录下以bert为例进行两个句子文本相似度计算,数据格式如data/sim_webank/目录下所示- predict_bert_text_cnn.py- tet_char_bert_embedding.py- tet_char_bert_embedding.py- tet_char_xlnet_embedding.py- tet_char_random_embedding.py- tet_char_word2vec_embedding.py- tet_word_random_embedding.py- tet_word_word2vec_embedding.py

keras_textclassification/data

- 数据下载  ** github项目中只是上传部分数据，需要的前往链接: https://pan.baidu.com/s/1I3vydhmFEQ9nuPG2fDou8Q 提取码: rket- baidu_qa_2019（百度qa问答语料，只取title作为分类样本，17个类，有一个是空''，已经压缩上传）   - baike_qa_train.csv   - baike_qa_valid.csv- byte_multi_news（今日头条2018新闻标题多标签语料，1070个标签，fate233爬取, 地址为: [byte_multi_news](https://github.com/fate233/toutiao-multilevel-text-classfication-dataset)）   -labels.csv   -train.csv   -valid.csv- embeddings   - chinese_L-12_H-768_A-12/(取谷歌预训练好点的模型,已经压缩上传,                              keras-bert还可以加载百度版ernie(需转换，[https://github.com/ArthurRizar/tensorflow_ernie](https://github.com/ArthurRizar/tensorflow_ernie)),                              哈工大版bert-wwm(tf框架，[https://github.com/ymcui/Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm))   - albert_base_zh/(brightmart训练的albert, 地址为https://github.com/brightmart/albert_zh)   - chinese_xlnet_base_L-12_H-768_A-12/(哈工大预训练的中文xlnet模型[https://github.com/ymcui/Chinese-PreTrained-XLNet],12层)   - term_char.txt(已经上传, 项目中已全, wiki字典, 还可以用新华字典什么的)   - term_word.txt(未上传, 项目中只有部分, 可参考词向量的)   - w2v_model_merge_short.vec(未上传, 项目中只有部分, 词向量, 可以用自己的)   - w2v_model_wiki_char.vec(已上传百度网盘, 项目中只有部分, 自己训练的维基百科字向量, 可以用自己的)- model   - fast_text/预训练模型存放地址

项目说明

1. 构建了base基类(网络(graph)、向量嵌入(词、字、句子embedding)),后边的具体模型继承它们，代码简单
1. keras_layers存放一些常用的layer, conf存放项目数据、模型的地址, data存放数据和语料, data_preprocess为数据预处理模块,

模型与论文paper题与地址

FastText:Bag of Tricks for Efﬁcient Text Classiﬁcation
TextCNN：Convolutional Neural Networks for Sentence Classiﬁcation
charCNN-kim：Character-Aware Neural Language Models
charCNN-zhang:Character-level Convolutional Networks for Text Classiﬁcation
TextRNN：Recurrent Neural Network for Text Classification with Multi-Task Learning
RCNN：Recurrent Convolutional Neural Networks for Text Classification
DCNN:A Convolutional Neural Network for Modelling Sentences
DPCNN:Deep Pyramid Convolutional Neural Networks for Text Categorization
VDCNN:Very Deep Convolutional Networks
CRNN:A C-LSTM Neural Network for Text Classification
DeepMoji:Using millions of emojio ccurrences to learn any-domain represent ations for detecting sentiment, emotion and sarcasm
SelfAttention:Attention Is All You Need
HAN:Hierarchical Attention Networks for Document Classification
CapsuleNet:Dynamic Routing Between Capsules
Transformer(encode or decode):Attention Is All You Need
Bert:BERT: Pre-trainingofDeepBidirectionalTransformersfor LanguageUnderstanding
Xlnet:XLNet: Generalized Autoregressive Pretraining for Language Understanding
Albert:ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
RoBERTa:RoBERTa: A Robustly Optimized BERT Pretraining Approach
ELECTRA:ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
TextGCN:Graph Convolutional Networks for Text Classification

参考/感谢

文本分类项目:https://github.com/mosu027/TextClassification
文本分类看山杯:https://github.com/brightmart/text_classification
Kashgari项目:https://github.com/BrikerMan/Kashgari
文本分类Ipty :https://github.com/lpty/classifier
keras文本分类:https://github.com/ShawnyXiao/TextClassification-Keras
keras文本分类:https://github.com/AlexYangLi/TextClassification
CapsuleNet模型:https://github.com/bojone/Capsule
transformer模型:https://github.com/CyberZHG/keras-transformer
keras_albert_model:https://github.com/TinkerMob/keras_albert_model

训练简单调用:

fromkeras_textclassificationimporttraintrain(graph='TextCNN',# 必填, 算法名, 可选"ALBERT","BERT","XLNET","FASTTEXT","TEXTCNN","CHARCNN",# "TEXTRNN","RCNN","DCNN","DPCNN","VDCNN","CRNN","DEEPMOJI",# "SELFATTENTION", "HAN","CAPSULE","TRANSFORMER"label=17,# 必填, 类别数, 训练集和测试集合必须一样path_train_data=None,# 必填, 训练数据文件, csv格式, 必须含'label,ques'头文件, 详见keras_textclassification/datapath_dev_data=None,# 必填, 测试数据文件, csv格式, 必须含'label,ques'头文件, 详见keras_textclassification/datarate=1,# 可填, 训练数据选取比例hyper_parameters=None)# 可填, json格式, 超参数, 默认embedding为'char','random'

Reference

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:

@misc{Keras-TextClassification,    howpublished = {\url{https://github.com/yongzhuo/Keras-TextClassification}},    title = {Keras-TextClassification},    author = {Yongzhuo Mo},    publisher = {GitHub},    year = {2019}}

*希望对你有所帮助!

About

blog.csdn.net/rensihui

Releases1

v0.1.7 Latest

Dec 19, 2020

Packages

No packages published

Contributors2

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Keras-TextClassification

Install(安装)

keras_textclassification（代码主体,未完待续...）

run(运行, 以FastText为例)

run(多标签分类/Embedding/test/sample实例)

keras_textclassification/data

项目说明

模型与论文paper题与地址

参考/感谢

训练简单调用:

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

yongzhuo/Keras-TextClassification

Folders and files

Latest commit

History

Repository files navigation

Keras-TextClassification

Install(安装)

keras_textclassification（代码主体,未完待续...）

run(运行, 以FastText为例)

run(多标签分类/Embedding/test/sample实例)

keras_textclassification/data

项目说明

模型与论文paper题与地址

参考/感谢

训练简单调用:

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages