Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

中文文档simhash值计算

NotificationsYou must be signed in to change notification settings

yanyiwu/simhash

Repository files navigation

TestPlatformAuthorLicenseTag

简介

此项目用来对中文文档计算出对应的 simhash 值。 simhash 是谷歌用来进行文本去重的算法,现在广泛应用在文本处理中。

详见simhash算法原理及实现

特性

  • 使用CppJieba 作为分词器和关键词抽取器
  • 使用jenkins 作为 hash 函数
  • hpp 风格,所有源码都是.hpp 文件里面,方便使用。没有链接,就没有伤害。
  • 本项目的副产品项目:simhash_server 提供了简单的 simhash HTTP 服务。

依赖

  • g++ (version >= 4.1 recommended), or clang++ .

用法

git clone --recurse-submodules https://github.com/yanyiwu/simhash.gitcd simhashmkdir buildcd buildcmake ..make

测试

make test

演示

文本:"我是蓝翔技工拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上总经理,出任CEO,走上人生巅峰。"关键词序列是: ["蓝翔:11.7392", "CEO:11.7392", "升职:10.8562", "加薪:10.6426", "手扶拖拉机:10.0089"]simhash值是: 17831459094038722629100010110110和110001110011 simhash值的相等判断如下:海明距离阈值默认设置为3,则isEqual结果为:0海明距离阈值默认设置为5,则isEqual结果为:1

详情请看demo

Benchmark

./benchmark/benchmarking

结果如下:

Running ./benchmark/benchmarkingRun on (16 X 2494.14 MHz CPU s)CPU Caches:  L1 Data 32 KiB (x16)  L1 Instruction 32 KiB (x16)  L2 Unified 4096 KiB (x16)  L3 Unified 36608 KiB (x1)Load Average: 0.07, 0.04, 0.03***WARNING*** Library was built as DEBUG. Timings may be affected.-------------------------------------------------------------------------------------------------Benchmark                                                       Time             CPU   Iterations-------------------------------------------------------------------------------------------------BENCHMARK_Simhasher_extract_text50_top5                     13478 ns        13478 ns        52013BENCHMARK_Simhasher_extract_text50_top10                    13843 ns        13843 ns        50833BENCHMARK_Simhasher_extract_text50_top15                    13929 ns        13929 ns        49488BENCHMARK_Simhasher_extract_text50_top20                    13842 ns        13842 ns        50541BENCHMARK_Simhasher_extract_text500_top5                   184074 ns       184067 ns         3775BENCHMARK_Simhasher_make_text50_top5                        14457 ns        14457 ns        48341BENCHMARK_Simhasher_make_text50_top10                       15170 ns        15169 ns        46203BENCHMARK_Simhasher_make_text50_top15                       15585 ns        15585 ns        44903BENCHMARK_Simhasher_make_text50_top20                       15743 ns        15742 ns        44466BENCHMARK_Simhasher_binaryStringToUint64                    0.000 ns        0.000 ns   1000000000BENCHMARK_Simhasher_toBinaryString                           63.9 ns         63.9 ns     10937009BENCHMARK_Simhasher_make_from_predefined_keywords5            423 ns          423 ns      1644823BENCHMARK_Simhasher_make_from_predefined_keywords10           735 ns          735 ns       950156BENCHMARK_Simhasher_make_from_predefined_keywords20          1364 ns         1364 ns       508935BENCHMARK_Simhasher_make_from_predefined_keywords50          7876 ns         7875 ns        89006BENCHMARK_Simhasher_make_from_predefined_keywords100        21409 ns        21409 ns        32743BENCHMARK_Simhasher_make_from_predefined_keywords200        47469 ns        47468 ns        14728BENCHMARK_Simhasher_make_from_predefined_keywords500       124316 ns       124314 ns         5627BENCHMARK_Simhasher_make_from_predefined_keywords1000      251336 ns       251329 ns         2785BENCHMARK_Simhasher_binaryStringToUint64_isEqual            0.000 ns        0.000 ns   1000000000BENCHMARK_Simhasher_binaryStringToUint64_isEqual_10k        0.000 ns        0.000 ns   1000000000BENCHMARK_Simhasher_binaryStringToUint64_isEqual_1000k      0.000 ns        0.000 ns   1000000000

About

中文文档simhash值计算

Resources

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp