- Notifications
You must be signed in to change notification settings - Fork261
中文文档simhash值计算
NotificationsYou must be signed in to change notification settings
yanyiwu/simhash
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
此项目用来对中文文档计算出对应的 simhash 值。 simhash 是谷歌用来进行文本去重的算法,现在广泛应用在文本处理中。
- 使用CppJieba 作为分词器和关键词抽取器
- 使用jenkins 作为 hash 函数
hpp
风格,所有源码都是.hpp
文件里面,方便使用。没有链接,就没有伤害。
- 本项目的副产品项目:simhash_server 提供了简单的 simhash HTTP 服务。
- g++ (version >= 4.1 recommended), or clang++ .
git clone --recurse-submodules https://github.com/yanyiwu/simhash.gitcd simhashmkdir buildcd buildcmake ..make
测试
make test
文本:"我是蓝翔技工拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上总经理,出任CEO,走上人生巅峰。"关键词序列是: ["蓝翔:11.7392", "CEO:11.7392", "升职:10.8562", "加薪:10.6426", "手扶拖拉机:10.0089"]simhash值是: 17831459094038722629100010110110和110001110011 simhash值的相等判断如下:海明距离阈值默认设置为3,则isEqual结果为:0海明距离阈值默认设置为5,则isEqual结果为:1
详情请看demo
./benchmark/benchmarking
结果如下:
Running ./benchmark/benchmarkingRun on (16 X 2494.14 MHz CPU s)CPU Caches: L1 Data 32 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 4096 KiB (x16) L3 Unified 36608 KiB (x1)Load Average: 0.07, 0.04, 0.03***WARNING*** Library was built as DEBUG. Timings may be affected.-------------------------------------------------------------------------------------------------Benchmark Time CPU Iterations-------------------------------------------------------------------------------------------------BENCHMARK_Simhasher_extract_text50_top5 13478 ns 13478 ns 52013BENCHMARK_Simhasher_extract_text50_top10 13843 ns 13843 ns 50833BENCHMARK_Simhasher_extract_text50_top15 13929 ns 13929 ns 49488BENCHMARK_Simhasher_extract_text50_top20 13842 ns 13842 ns 50541BENCHMARK_Simhasher_extract_text500_top5 184074 ns 184067 ns 3775BENCHMARK_Simhasher_make_text50_top5 14457 ns 14457 ns 48341BENCHMARK_Simhasher_make_text50_top10 15170 ns 15169 ns 46203BENCHMARK_Simhasher_make_text50_top15 15585 ns 15585 ns 44903BENCHMARK_Simhasher_make_text50_top20 15743 ns 15742 ns 44466BENCHMARK_Simhasher_binaryStringToUint64 0.000 ns 0.000 ns 1000000000BENCHMARK_Simhasher_toBinaryString 63.9 ns 63.9 ns 10937009BENCHMARK_Simhasher_make_from_predefined_keywords5 423 ns 423 ns 1644823BENCHMARK_Simhasher_make_from_predefined_keywords10 735 ns 735 ns 950156BENCHMARK_Simhasher_make_from_predefined_keywords20 1364 ns 1364 ns 508935BENCHMARK_Simhasher_make_from_predefined_keywords50 7876 ns 7875 ns 89006BENCHMARK_Simhasher_make_from_predefined_keywords100 21409 ns 21409 ns 32743BENCHMARK_Simhasher_make_from_predefined_keywords200 47469 ns 47468 ns 14728BENCHMARK_Simhasher_make_from_predefined_keywords500 124316 ns 124314 ns 5627BENCHMARK_Simhasher_make_from_predefined_keywords1000 251336 ns 251329 ns 2785BENCHMARK_Simhasher_binaryStringToUint64_isEqual 0.000 ns 0.000 ns 1000000000BENCHMARK_Simhasher_binaryStringToUint64_isEqual_10k 0.000 ns 0.000 ns 1000000000BENCHMARK_Simhasher_binaryStringToUint64_isEqual_1000k 0.000 ns 0.000 ns 1000000000