Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

multiprocess unsupervised chinese_detect_words ngram_combination

NotificationsYou must be signed in to change notification settings

yiyepiaoling0715/unsupervised_extract_detect_words

Repository files navigation

1.思路:借鉴之前有一篇blog,利用人人网数据进行新词挖掘的思想,做了改进优化;

2.原始思路: 利用jieba对文档分词,3个相邻词为一组,计算两个词的左信息熵,右信息熵,内部的凝聚度,并据此进行计算分数,根据分数大小获取新词;

3.优化点:

      1.针对只能结合两个词,泛化到结合计算相邻N个词;      2.内部互信息【凝聚度计算】,归一化到长度=1个词的情况下的值,可以实现不同长度词在同一纬度下进行比较;            3.多进程处理,提高运行速度;            4.添加过滤机制,根据停用词,高频常用词等进行过滤

4.入口文件: segment_multi.py

执行方式: python segment_multi.py

参数修改文件:configs.py

5.效果展示

('_重大_疾病', 0.017789747314352424)('_保障_范围', 0.015639743403053734)('_本_公司', 0.014212133249451173)('_完全_丧失', 0.013672071599779227)('_意外_伤害', 0.010722245979224557)('_明确_诊断', 0.009062853195861094)('_日常生活_活动', 0.008990786509666062)('_六项_基本_日常生活', 0.008813957372202039)('_基本_日常生活', 0.008694797110512052)('_基本_日常生活_活动', 0.008671016020472998)('_保险_事故', 0.008504469334120192)('_六项_基本_日常生活_活动', 0.008471400808888209)('_能力_完全_丧失', 0.008404916576493579)('_全部_条件', 0.008136980840438046)('_无法_独立', 0.008091270307811042)('_满足_下列_全部_条件', 0.008055553080109046)        ('_现金_价值', 0.007895715475057304)

[8]ページ先頭

©2009-2025 Movatter.jp