


技术领域technical field
本发明涉及大数据的数据处理技术领域,尤其涉及一种文本筛选方法、装置、设备及存储介质。The present invention relates to the technical field of data processing of big data, and in particular, to a text screening method, device, equipment and storage medium.
背景技术Background technique
爬虫爬取文本时,需要对爬取到的相同的或者相似性极高文本进行去重,文本去重操作大多是使用URL生成一种“指纹”,把“指纹”放到一个集合中进行去重,在实际的应用过程中,一篇文本可能被多个网站转发,会出现一些“指纹”不同,文本内容却相同的情况,且去重操作过程中,当爬取到某文本的摘要性文本,或某文本的总结性文本时,也难以对摘要性或总结性的文本进去重。When the crawler crawls the text, it needs to de-duplicate the same or highly similar text that has been crawled. Most of the text de-duplication operations use the URL to generate a "fingerprint", and put the "fingerprint" into a collection for deduplication. In the actual application process, a text may be forwarded by multiple websites, and some “fingerprints” may be different, but the text content is the same. Text, or the summary text of a text, it is also difficult to emphasize the summary or summary text.
发明内容SUMMARY OF THE INVENTION
鉴于以上内容,本发明提供一种文本筛选方法、装置、设备及存储介质,其目的在于解决现有技术难以对摘要性或总结性的文本进去重的技术问题。In view of the above content, the present invention provides a text screening method, device, device and storage medium, which aims to solve the technical problem that it is difficult to de-duplicate abstract or summary texts in the prior art.
为实现上述目的,本发明提供一种文本筛选方法,该方法包括:In order to achieve the above object, the present invention provides a text screening method, the method includes:
对待筛选的第一文本执行分词操作得到多个分词,从所述多个分词中提取预设词性的关键词,为各分词及各关键词分配相关联的权重;Performing a word segmentation operation on the first text to be screened to obtain a plurality of word segmentations, extracting keywords with preset parts of speech from the plurality of word segmentations, and assigning associated weights to each word segmentation and each keyword;
计算各分词及各关键词的第一哈希值,基于各分词的第一哈希值及权重执行加权操作得到各分词的权重向量,基于各关键词的第一哈希值及各权重执行加权操作得到各关键词的权重向量;Calculate the first hash value of each word segment and each keyword, perform a weighting operation based on the first hash value and weight of each word segment to obtain the weight vector of each word segment, and perform weighting based on the first hash value of each keyword and each weight The operation obtains the weight vector of each keyword;
将各分词的权重向量累加得到所述第一文本的第一权重向量,将各关键词的权重向量累加得到所述第一文本的第二权重向量,对所述第一权重向量及所述第二权重向量执行降维操作,分别得到所述第一文本的第一simhash值及第二simhash值;Accumulate the weight vector of each word segment to obtain the first weight vector of the first text, and accumulate the weight vector of each keyword to obtain the second weight vector of the first text. The two-weight vector performs a dimensionality reduction operation to obtain the first simhash value and the second simhash value of the first text respectively;
计算所述第一simhash值与预设存储空间的目标文本的第三simhash值的第一距离值,当所述第一距离值大于第一预设值时,计算所述第二simhash值与所述第三simhash值的第二距离值,当所述第二距离值小于或等于第二预设值时,筛除所述第一文本。Calculate the first distance value between the first simhash value and the third simhash value of the target text in the preset storage space, and when the first distance value is greater than the first preset value, calculate the second simhash value and the The second distance value of the third simhash value, when the second distance value is less than or equal to the second preset value, the first text is filtered out.
优选的,所述从所述多个分词中提取预设词性的关键词包括:Preferably, the keywords for extracting preset parts of speech from the plurality of word segmentations include:
计算各分词在所述第一文本中的词频,基于所述词频计算出各分词的IDF值及TF值,将各分词的IDF值与各分词对应的TF值相乘得到各分词的TF-ID值,判断所述第一文本中是否存在大于预设数量的预设词性的关键词,若是,基于各分词的TF-ID值选取预设数量的预设词性的关键词,其中,所述预设词性的关键词包括名词关键词和动词关键词。Calculate the word frequency of each participle in the first text, calculate the IDF value and TF value of each participle based on the word frequency, and multiply the IDF value of each participle by the TF value corresponding to each participle to obtain the TF-ID of each participle value, determine whether there are keywords with a preset part-of-speech greater than a preset number in the first text, if so, select a preset number of keywords with a preset part-of-speech based on the TF-ID value of each word segment, wherein the preset The keywords of part-of-speech include noun keywords and verb keywords.
优选的,所述判断所述第一文本中是否存在大于预设数量的预设词性的关键词包括:Preferably, the judging whether there are keywords with a preset part of speech greater than a preset number in the first text includes:
当判断所述第一文本中不存在大于预设数量的预设词性的关键词时,筛除所述第一文本,并从预设存储空间随机获取一篇文本作为所述待筛选的第一文本重新执行分词操作。When it is determined that there are no keywords with a preset part-of-speech greater than a preset number in the first text, the first text is screened out, and a piece of text is randomly obtained from the preset storage space as the first to be screened The text is re-executed for word segmentation.
优选的,所述方法还包括:Preferably, the method further includes:
当所述第一距离值小于或等于所述第一预设值时,筛除所述第一文本。When the first distance value is less than or equal to the first preset value, the first text is filtered out.
优选的,所述方法还包括:Preferably, the method further includes:
当所述第二距离值大于所述第二预设值时,将所述第一文本存储至所述预设目标文本所属的文本集。When the second distance value is greater than the second preset value, the first text is stored in the text set to which the preset target text belongs.
优选的,所述当所述第一距离值大于第一预设值时,所述方法还包括:Preferably, when the first distance value is greater than a first preset value, the method further includes:
计算所述第一simhash值与所述目标文本的第四simhash值的第三距离值,当所述第三距离值小于或等于第三预设值时,筛除所述第一文本。A third distance value between the first simhash value and the fourth simhash value of the target text is calculated, and when the third distance value is less than or equal to a third preset value, the first text is filtered out.
优选的,所述对待筛选的第一文本执行分词操作得到多个分词包括:Preferably, performing a word segmentation operation on the first text to be screened to obtain multiple word segmentations includes:
根据正向最大匹配法将读取到的文本与所述词库进行匹配,得到第一匹配结果,所述第一匹配结果中包含有第一数量的第一词组和第二数量的单字;Matching the read text with the thesaurus according to the forward maximum matching method to obtain a first matching result, where the first matching result contains a first number of first phrases and a second number of words;
根据逆向最大匹配法将读取到的文本与所述词库进行匹配,得到第二匹配结果,所述第二匹配结果中包含有第三数量的第二词组和第四数量的单字;According to the reverse maximum matching method, the read text is matched with the thesaurus, and a second matching result is obtained, and the second matching result contains a third number of second phrases and a fourth number of words;
若所述第一数量与所述第三数量相等且所述第二数量小于或者等于所述第四数量,或者,若所述第一数量小于所述第三数量,则将所述第一匹配结果作为该对象全称的分词结果;若所述第一数量与所述第二数量相等且所述第三数量大于所述第四数量,或者,若所述第一数量大于所述第三数量,则将所述第二匹配结果作为所述第一文本的分词结果。Match the first number if the first number is equal to the third number and the second number is less than or equal to the fourth number, or if the first number is less than the third number The result is the word segmentation result of the full name of the object; if the first number is equal to the second number and the third number is greater than the fourth number, or, if the first number is greater than the third number, The second matching result is used as the word segmentation result of the first text.
为实现上述目的,本发明还提供一种文本筛选装置,该文本筛选装置包括:In order to achieve the above object, the present invention also provides a text screening device, the text screening device includes:
提取模块:用于对待筛选的第一文本执行分词操作得到多个分词,从所述多个分词中提取预设词性的关键词,为各分词及各关键词分配相关联的权重;Extraction module: used to perform a word segmentation operation on the first text to be screened to obtain a plurality of word segmentations, extract keywords with preset parts of speech from the plurality of word segmentations, and assign associated weights to each word segmentation and each keyword;
加权模块:用于计算各分词及各关键词的第一哈希值,基于各分词的第一哈希值及权重执行加权操作得到各分词的权重向量,基于各关键词的第一哈希值及各权重执行加权操作得到各关键词的权重向量;Weighting module: used to calculate the first hash value of each word segment and each keyword, perform a weighting operation based on the first hash value and weight of each word segment to obtain the weight vector of each word segment, based on the first hash value of each keyword And each weight performs a weighting operation to obtain the weight vector of each keyword;
降维模块:用于将各分词的权重向量累加得到所述第一文本的第一权重向量,将各关键词的权重向量累加得到所述第一文本的第二权重向量,对所述第一权重向量及所述第二权重向量执行降维操作,分别得到所述第一文本的第一simhash值及第二simhash值;Dimensionality reduction module: used for accumulating the weight vectors of each word segment to obtain the first weight vector of the first text, and accumulating the weight vectors of each keyword to obtain the second weight vector of the first text. The weight vector and the second weight vector perform a dimensionality reduction operation to obtain the first simhash value and the second simhash value of the first text respectively;
筛除模块:用于计算所述第一simhash值与预设存储空间的目标文本的第三simhash值的第一距离值,当所述第一距离值大于第一预设值时,计算所述第二simhash值与所述第三simhash值的第二距离值,当所述第二距离值小于或等于第二预设值时,筛除所述第一文本。Screening module: used to calculate the first distance value between the first simhash value and the third simhash value of the target text in the preset storage space, when the first distance value is greater than the first preset value, calculate the A second distance value between the second simhash value and the third simhash value, when the second distance value is less than or equal to a second preset value, the first text is filtered out.
为实现上述目的,本发明还提供一种电子设备,所述电子设备包括:In order to achieve the above object, the present invention also provides an electronic device, the electronic device includes:
至少一个处理器;以及,at least one processor; and,
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的程序,所述程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上所述的文本筛选方法的任意步骤。The memory stores a program executable by the at least one processor, the program being executed by the at least one processor to enable the at least one processor to perform any of the steps of the text screening method described above.
为实现上述目的,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质中包括存储数据区和存储程序区,存储数据区存储根据区块链节点的使用所创建的数据,存储程序区存储有文本筛选程序,所述文本筛选程序被处理器执行时,实现如上所述文本筛选方法的任意步骤。In order to achieve the above object, the present invention also provides a computer-readable storage medium, the computer-readable storage medium includes a storage data area and a storage program area, and the storage data area stores data created according to the use of blockchain nodes, The stored program area stores a text filtering program, which, when executed by the processor, implements any steps of the text filtering method described above.
本发明提出的文本筛选方法、装置、设备及存储介质,通过对第一文本执行分词操作,及提取第一文本中预设词性的关键词,为各分词及各关键词分配相关联的权重,利用哈希函数和加权计算分词对应文本向量和关键词对应的向量得到对应的simhash值,再比较各个simhash值与预设值的距离,通过距离判断文本的相似度,可以提高文本相似度的识别过程,当第一文本为某文本的摘要性文本,或某文本的总结性文本时,结合关键字和分词得到的simhash值,可以准确的对摘要性或总结性的文本进去重操作。The text screening method, device, device and storage medium proposed by the present invention allocate associated weights to each word segment and each keyword by performing a word segmentation operation on the first text and extracting keywords with preset parts of speech in the first text. Use the hash function and weighted calculation to calculate the text vector corresponding to the word segmentation and the vector corresponding to the keyword to obtain the corresponding simhash value, then compare the distance between each simhash value and the preset value, and judge the similarity of the text by the distance, which can improve the recognition of text similarity In the process, when the first text is an abstract text of a certain text, or a summary text of a certain text, the simhash value obtained by combining the keywords and word segmentation can accurately re-operate the abstract or summary text.
附图说明Description of drawings
图1为本发明电子设备较佳实施例的示意图;1 is a schematic diagram of a preferred embodiment of an electronic device of the present invention;
图2为图1中文本筛选装置较佳实施例的模块示意图;Fig. 2 is the module schematic diagram of the preferred embodiment of the text screening device in Fig. 1;
图3为本发明文本筛选方法较佳实施例的流程图;Fig. 3 is the flow chart of the preferred embodiment of the text screening method of the present invention;
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
参照图1所示,为本发明电子设备1较佳实施例的示意图。Referring to FIG. 1 , it is a schematic diagram of a preferred embodiment of an electronic device 1 of the present invention.
该电子设备1包括但不限于:存储器11、处理器12、显示器13及网络接口14。所述电子设备1通过网络接口14连接网络,获取原始数据。其中,所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobilecommunication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi、通话网络等无线或有线网络。The electronic device 1 includes but is not limited to: a memory 11 , a processor 12 , a display 13 and a network interface 14 . The electronic device 1 is connected to the network through the network interface 14 to obtain original data. The network may be an intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, 5G network, Bluetooth (Bluetooth), Wi-Fi, call network and other wireless or wired networks.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述电子设备1的内部存储单元,例如该电子设备1的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述电子设备1的外部存储设备,例如该电子设备1配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述电子设备1的内部存储单元也包括其外部存储设备。本实施例中,存储器11通常用于存储安装于所述电子设备1的操作系统和各类应用软件,例如文本筛选程序10的程序代码等。此外,存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. In some embodiments, the memory 11 may be an internal storage unit of the electronic device 1 , such as a hard disk or a memory of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital ( Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device thereof. In this embodiment, the memory 11 is generally used to store the operating system and various application software installed in the electronic device 1 , such as the program code of the text filtering program 10 , and the like. In addition, the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述电子设备1的总体操作,例如执行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行文本筛选程序10的程序代码等。In some embodiments, the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 12 is generally used to control the overall operation of the electronic device 1, such as performing data interaction or communication-related control and processing. In this embodiment, the processor 12 is configured to run the program code or process data stored in the memory 11 , for example, run the program code of the text filtering program 10 and the like.
显示器13可以称为显示屏或显示单元。在一些实施例中显示器13可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-EmittingDiode,OLED)触摸器等。显示器13用于显示在电子设备1中处理的信息以及用于显示可视化的工作界面,例如显示数据统计的结果。The display 13 may be referred to as a display screen or a display unit. In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (Organic Light-Emitting Diode, OLED) touch device, and the like. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual working interface, for example, displaying the results of data statistics.
网络接口14可选地可以包括标准的有线接口、无线接口(如WI-FI接口),该网络接口14通常用于在所述电子设备1与其它电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and the network interface 14 is usually used to establish a communication connection between the electronic device 1 and other electronic devices.
图1仅示出了具有组件11-14以及文本筛选程序10的电子设备1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。Figure 1 shows only the electronic device 1 with components 11-14 and the text filter 10, but it should be understood that implementation of all of the illustrated components is not a requirement, and more or fewer components may be implemented instead.
可选地,所述电子设备1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may further include a user interface, and the user interface may include a display (Display), an input unit such as a keyboard (Keyboard), and an optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (Organic Light-Emitting Diode, OLED) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
该电子设备1还可以包括射频(Radio Frequency,RF)电路、传感器和音频电路等等,在此不再赘述。The electronic device 1 may also include a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, and the like, which will not be repeated here.
在上述实施例中,处理器12执行存储器11中存储的文本筛选程序10时可以实现如下步骤:In the above embodiment, when the processor 12 executes the text screening program 10 stored in the memory 11, the following steps may be implemented:
对待筛选的第一文本执行分词操作得到多个分词,从所述多个分词中提取预设词性的关键词,为各分词及各关键词分配相关联的权重;Performing a word segmentation operation on the first text to be screened to obtain a plurality of word segmentations, extracting keywords with preset parts of speech from the plurality of word segmentations, and assigning associated weights to each word segmentation and each keyword;
计算各分词及各关键词的第一哈希值,基于各分词的第一哈希值及权重执行加权操作得到各分词的权重向量,基于各关键词的第一哈希值及各权重执行加权操作得到各关键词的权重向量;Calculate the first hash value of each word segment and each keyword, perform a weighting operation based on the first hash value and weight of each word segment to obtain the weight vector of each word segment, and perform weighting based on the first hash value of each keyword and each weight The operation obtains the weight vector of each keyword;
将各分词的权重向量累加得到所述第一文本的第一权重向量,将各关键词的权重向量累加得到所述第一文本的第二权重向量,对所述第一权重向量及所述第二权重向量执行降维操作,分别得到所述第一文本的第一simhash值及第二simhash值;Accumulate the weight vector of each word segment to obtain the first weight vector of the first text, and accumulate the weight vector of each keyword to obtain the second weight vector of the first text. The two-weight vector performs a dimensionality reduction operation to obtain the first simhash value and the second simhash value of the first text respectively;
计算所述第一simhash值与预设存储空间的目标文本的第三simhash值的第一距离值,当所述第一距离值大于第一预设值时,计算所述第二simhash值与所述第三simhash值的第二距离值,当所述第二距离值小于或等于第二预设值时,筛除所述第一文本。Calculate the first distance value between the first simhash value and the third simhash value of the target text in the preset storage space, and when the first distance value is greater than the first preset value, calculate the second simhash value and the The second distance value of the third simhash value, when the second distance value is less than or equal to the second preset value, the first text is filtered out.
所述存储设备可以为电子设备1的存储器11,也可以为与电子设备1通讯连接的其它存储设备。The storage device may be the memory 11 of the electronic device 1 , or may be other storage devices communicatively connected to the electronic device 1 .
关于上述步骤的详细介绍,请参照下述图2关于文本筛选装置100实施例的功能模块图以及图3关于文本筛选方法实施例的流程图的说明。For a detailed introduction of the above steps, please refer to the following description of the functional block diagram of the embodiment of the text screening apparatus 100 in FIG. 2 and the flow chart of the embodiment of the text screening method in FIG. 3 .
参照图2所示,为本发明文本筛选装置100的功能模块图。Referring to FIG. 2 , it is a functional block diagram of the text screening apparatus 100 of the present invention.
本发明所述文本筛选装置100可以安装于电子设备中。根据实现的功能,所述文本筛选装置100可以包括提取模块110、加权模块120、降维模块130及筛除模块140。本发所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。The text screening apparatus 100 of the present invention can be installed in an electronic device. According to the implemented functions, the text screening apparatus 100 may include an extraction module 110 , a weighting module 120 , a dimension reduction module 130 and a screening module 140 . The modules described in the present invention can also be called units, which refer to a series of computer program segments that can be executed by the electronic device processor and can perform fixed functions, and are stored in the memory of the electronic device.
在本实施例中,关于各模块/单元的功能如下:In this embodiment, the functions of each module/unit are as follows:
提取模块110,用于对待筛选的第一文本执行分词操作得到多个分词,从所述多个分词中提取预设词性的关键词,为各分词及各关键词分配相关联的权重。The extraction module 110 is configured to perform a word segmentation operation on the first text to be screened to obtain a plurality of word segments, extract keywords with a preset part of speech from the plurality of word segments, and assign associated weights to each word segment and each keyword.
在本实施例中,以爬虫爬取文本时,需要对爬取到的相同的或者相似性极高文本去重为例对本方案进行说明,应该理解的是,本方案的应用场景不仅限于此。当爬取到某一篇文本时,需要判断该文本与已爬取到的文本是否为相似或相同文本,如果为相似或相同文本,则可以筛除该文本,具体地,当获取到待去重的第一文本时,对第一文本执行分词操作得到多个分词,从多个分词中提取出第一文本中预设词性的关键词,其中,预设词性的关键词可以是属于名词的关键词和属于动词的关键词,为各个分词及各个关键词分配相关联的权重,分配的权重可以是根据各个分词的数量进行分配。In this embodiment, when a crawler crawls text, it is necessary to de-duplicate the crawled text with the same or extremely high similarity as an example to illustrate this solution. It should be understood that the application scenario of this solution is not limited to this. When crawling a certain piece of text, it is necessary to judge whether the text is similar or the same as the crawled text. If it is similar or the same text, the text can be filtered out. When the first text is heavy, a word segmentation operation is performed on the first text to obtain multiple word segments, and keywords of the preset part of speech in the first text are extracted from the multiple word segments, wherein the keywords of the preset part of speech may belong to nouns. For keywords and keywords belonging to verbs, associated weights are assigned to each participle and each keyword, and the assigned weight may be assigned according to the number of each participle.
例如,第一文本中包含一段语句:“CSDN博客结构之法算法之道的作者July”,分词后为:“CSDN博客结构之法算法之道的作者July”,然后为每个分词赋予权值:CSDN(4)博客(5)结构(3)之(1)法(2)算法(3)之(1)道(2)的(1)作者(5)July(5),其中括号里的数字代表这个单词在整条语句中的重要程度,数字越大代表越重要。For example, the first text contains a sentence: "July, the author of the method and algorithm of the CSDN blog structure method", after the word segmentation: "July, the author of the method of the algorithm of the CSDN blog structure method", and then assign a weight to each word segmentation : CSDN (4) Blog (5) Structure (3) (1) Method (2) Algorithm (3) (1) (1) Road (2) (1) Author (5) July (5), in which the parentheses The number represents the importance of the word in the whole sentence, the higher the number, the more important.
在一个实施例中,所述从所述多个分词中提取预设词性的关键词包括:In one embodiment, the keyword for extracting a preset part of speech from the plurality of word segmentations includes:
计算各分词在所述第一文本中的词频,基于所述词频计算出各分词的IDF值及TF值,将各分词的IDF值与各分词对应的TF值相乘得到各分词的TF-ID值,判断所述第一文本中是否存在大于预设数量的预设词性的关键词,若是,基于各分词的TF-ID值选取预设数量的预设词性的关键词,其中,所述预设词性的关键词包括名词关键词和动词关键词。Calculate the word frequency of each participle in the first text, calculate the IDF value and TF value of each participle based on the word frequency, and multiply the IDF value of each participle by the TF value corresponding to each participle to obtain the TF-ID of each participle value, determine whether there are keywords with a preset part-of-speech greater than a preset number in the first text, if so, select a preset number of keywords with a preset part-of-speech based on the TF-ID value of each word segment, wherein the preset The keywords of part-of-speech include noun keywords and verb keywords.
统计第一文本中的所有词的出现次数,计算出IDF(逆文档频率值),然后再计算出第一文本中每个词的TF(词频)值。其中,TF=(词语在文本中出现次数)/(各词语在文本中出现次数的总和),将IDF值与TF值相乘,得到该词的TF-IDF值,TF-IDF值可以评估字词对于发言文本中的重要程度,TF-IDF值越大表示作为关键词的优先级越高。在进行TF-IDF计算时,通过对词频与逆文档频率得出某个字词的TF-IDF值,若TF-IDF值越大,该词对文本的重要性越高,因此可以将TF-IDF值排在前面的几个字词作为第一文本的关键词。判断第一文本中是否存在大于预设数量(例如,20)的预设词性的关键词,若是,选取TF-IDF值排在前20的名词关键词和动词关键词作为第一文本的预设词性的关键词。The number of occurrences of all words in the first text is counted, the IDF (inverse document frequency value) is calculated, and then the TF (word frequency) value of each word in the first text is calculated. Among them, TF=(the number of occurrences of the word in the text)/(the sum of the number of occurrences of each word in the text), multiply the IDF value and the TF value to obtain the TF-IDF value of the word, and the TF-IDF value can evaluate the word The importance of the word in the speech text, the larger the TF-IDF value, the higher the priority as a keyword. In the TF-IDF calculation, the TF-IDF value of a word is obtained by comparing the word frequency and the inverse document frequency. If the TF-IDF value is larger, the importance of the word to the text is higher. The first few words in the IDF value are used as the keywords of the first text. Determine whether there are keywords with a preset part of speech greater than a preset number (for example, 20) in the first text, and if so, select the noun keywords and verb keywords with the top 20 TF-IDF values as the preset of the first text Part-of-speech keywords.
进一步地,当判断所述第一文本中不存在大于预设数量的预设词性的关键词时,筛除所述第一文本,并从预设存储空间随机获取一篇文本作为所述待筛选的第一文本重新执行分词操作。Further, when it is judged that there are no keywords with a preset part of speech greater than a preset number in the first text, screen out the first text, and randomly obtain a text from a preset storage space as the to-be-screened text. The first text of the word segmentation operation is re-executed.
当第一文本中不足20个预设词性的关键词,则将该第一文本删除,并从预设存储空间随机获取一篇文本作为待筛选的第一文本重新执行分词操作。将关键词数量不足的文本删除,可以避免对关键词数量不足的本文(即特征不明显的文本)执行进一步的哈希和降维等操作,提高海量文本的去重速度。When there are less than 20 keywords of the preset part of speech in the first text, the first text is deleted, and a piece of text is randomly obtained from the preset storage space as the first text to be screened and the word segmentation operation is performed again. Deleting texts with insufficient keywords can avoid performing further operations such as hashing and dimensionality reduction for texts with insufficient keywords (that is, texts with insignificant features), and improve the deduplication speed of massive texts.
在一个实施例中,对待筛选的第一文本执行分词操作得到多个分词包括:In one embodiment, performing a word segmentation operation on the first text to be screened to obtain multiple word segmentations includes:
根据正向最大匹配法将读取到的文本与所述词库进行匹配,得到第一匹配结果,所述第一匹配结果中包含有第一数量的第一词组和第二数量的单字;Matching the read text with the thesaurus according to the forward maximum matching method to obtain a first matching result, where the first matching result contains a first number of first phrases and a second number of words;
根据逆向最大匹配法将读取到的文本与所述词库进行匹配,得到第二匹配结果,所述第二匹配结果中包含有第三数量的第二词组和第四数量的单字;According to the reverse maximum matching method, the read text is matched with the thesaurus, and a second matching result is obtained, and the second matching result contains a third number of second phrases and a fourth number of words;
若所述第一数量与所述第三数量相等且所述第二数量小于或者等于所述第四数量,或者,若所述第一数量小于所述第三数量,则将所述第一匹配结果作为该对象全称的分词结果;若所述第一数量与所述第二数量相等且所述第三数量大于所述第四数量,或者,若所述第一数量大于所述第三数量,则将所述第二匹配结果作为所述第一文本的分词结果。Match the first number if the first number is equal to the third number and the second number is less than or equal to the fourth number, or if the first number is less than the third number The result is the word segmentation result of the full name of the object; if the first number is equal to the second number and the third number is greater than the fourth number, or, if the first number is greater than the third number, The second matching result is used as the word segmentation result of the first text.
通过正反向同时进行分词匹配找出单字数量更少,词组数量更多的分词匹配结果,以作为切分的语句的分词结果,可提高分词的准确性。By performing word segmentation matching in the forward and reverse directions at the same time, the word segmentation matching results with fewer words and more phrases can be found as the word segmentation results of the segmented sentences, which can improve the accuracy of word segmentation.
加权模块120,用于计算各分词及各关键词的第一哈希值,基于各分词的第一哈希值及权重执行加权操作得到各分词的权重向量,基于各关键词的第一哈希值及各权重执行加权操作得到各关键词的权重向量。The weighting module 120 is configured to calculate the first hash value of each word segment and each keyword, perform a weighting operation based on the first hash value and weight of each word segment to obtain a weight vector of each word segment, and obtain a weight vector of each word segment based on the first hash value of each keyword. Perform a weighting operation on the value and each weight to obtain the weight vector of each keyword.
在本实施例中,可以利用哈希函数计算各个分词的第一哈希值及各关键词的第一哈希值,第一哈希值为二进制数“0”、“1”组成的n-bit签名,例如,“CSDN”的哈希值Hash(CSDN)为“100101”,“博客”的哈希值Hash(博客)为“101011”。之后,根据各分词的哈希值及各个分词对应的权重,执行加权操作得到该分词的权重向量,根据各关键词的哈希值及各关键词对应的权重执行加权操作得到该关键词的权重向量。In this embodiment, a hash function can be used to calculate the first hash value of each word segment and the first hash value of each keyword, and the first hash value is n- bit signature, for example, the hash value Hash (CSDN) of "CSDN" is "100101", and the hash value Hash (blog) of "blog" is "101011". Then, according to the hash value of each word segment and the corresponding weight of each word segment, perform a weighting operation to obtain the weight vector of the word segment, and perform a weighting operation according to the hash value of each keyword and the corresponding weight of each keyword to obtain the weight of the keyword vector.
具体地,在第一哈希值的基础上,给各分词和关键词进行加权,即W=Hash*weight,且遇到1则hash值和权值正相乘,遇到0则hash值和权值负相乘。例如,给“CSDN”的hash值“100101”执行加权操作得到的权重向量:W(CSDN)=100101 4=4 -4 -4 4 -4 4,给“博客”的hash值“101011”执行加权得到的权重向量:W(博客)=101011 5=5 -5 5 -5 55,其余分词和关键词执行类似操作。Specifically, on the basis of the first hash value, each word segment and keyword are weighted, that is, W=Hash*weight, and if 1 is encountered, the hash value and the weight are positively multiplied, and if 0 is encountered, the hash value and Weights are negatively multiplied. For example, a weight vector obtained by performing a weighting operation on the hash value "100101" of "CSDN": W(CSDN)=100101 4=4 -4 -4 4 -4 4, performing weighting on the hash value "101011" of "blog" The obtained weight vector: W (blog) = 101011 5 = 5 -5 5 -5 55, and other word segmentations and keywords perform similar operations.
降维模块130,用于将各分词的权重向量累加得到所述第一文本的第一权重向量,将各关键词的权重向量累加得到所述第一文本的第二权重向量,对所述第一权重向量及所述第二权重向量执行降维操作,分别得到所述第一文本的第一simhash值及第二simhash值。The dimensionality reduction module 130 is used for accumulating the weight vector of each word segment to obtain the first weight vector of the first text, and accumulating the weight vector of each keyword to obtain the second weight vector of the first text. A dimensionality reduction operation is performed on a weight vector and the second weight vector to obtain a first simhash value and a second simhash value of the first text, respectively.
在本实施例中,将各分词的权重向量累加得到所述第一文本的第一权重向量,将各关键词的权重向量累加得到所述第一文本的第二权重向量,将上述各个分词或关键词的权重向量进行累加,得到一个序列串作为第一文本的第一权重向量或第二权重向量,例如,将“CSDN”的“4 -4 -4 4 -4 4”和“博客”的“5 -5 5 -5 5 5”进行累加,得到“4+5 -4+-5 -4+5 4+-5 -4+5 4+5”,得到“9 -9 1 -1 1”。In this embodiment, the weight vector of each word segment is accumulated to obtain the first weight vector of the first text, the weight vector of each keyword is accumulated to obtain the second weight vector of the first text, and the above each word segment or The weight vectors of keywords are accumulated to obtain a sequence string as the first weight vector or the second weight vector of the first text. For example, "4 -4 -4 4 -4 4" of "CSDN" and "blog" of "5 -5 5 -5 5 5" to accumulate, get "4+5 -4+-5 -4+5 4+-5 -4+5 4+5", get "9 -9 1 -1 1" .
之后,对第一权重向量及第二权重向量执行降维操作,将高维的特征向量映射成低维的特征向量可以提高处理的速度,得到第一文本的第一simhash值及第二simhash值,第一simhash值是指第一文本分词对应的simhash值,第二simhash值是指第一文本的关键词对应的simhash值,具体地,对于第一文本的权重向量,如果大于0则置1,否则置0,从而得到第一文本的第一simhash值和第二simhash值。例如,将上述计算出来的“9 -9 1 -1 1 9”执行降维操作(某位大于0则置1,小于0则置0),得到的simhash值为:“1 0 1 0 1 1”。Afterwards, a dimensionality reduction operation is performed on the first weight vector and the second weight vector, and the high-dimensional feature vector is mapped to a low-dimensional feature vector, which can improve the processing speed, and obtain the first simhash value and the second simhash value of the first text. , the first simhash value refers to the simhash value corresponding to the word segmentation of the first text, and the second simhash value refers to the simhash value corresponding to the keywords of the first text. Specifically, for the weight vector of the first text, if it is greater than 0, it is set to 1 , otherwise set to 0, so as to obtain the first simhash value and the second simhash value of the first text. For example, perform the dimensionality reduction operation on "9 -9 1 -1 1 9" calculated above (a bit is set to 1 if it is greater than 0, and set to 0 if it is less than 0), and the obtained simhash value is: "1 0 1 0 1 1 ".
筛除模块140,用于计算所述第一simhash值与预设存储空间的目标文本的第三simhash值的第一距离值,当所述第一距离值大于第一预设值时,计算所述第二simhash值与所述第三simhash值的第二距离值,当所述第二距离值小于或等于第二预设值时,筛除所述第一文本。The screening module 140 is used to calculate the first distance value between the first simhash value and the third simhash value of the target text in the preset storage space, and when the first distance value is greater than the first preset value, calculate the first distance value. A second distance value between the second simhash value and the third simhash value, and when the second distance value is less than or equal to a second preset value, the first text is filtered out.
在本实施例中,由于在实际的文本去重操作过程中,会爬取到与该文本相似性很高的文本,某文本的摘要性文本,或某文本的总结性文本等,例如,某只股票的总结公告,可能会存在比该总结公告更加详细的总结公告文本,若仅根据文本分词得到的第一simhash值判断两篇文本是否相似,则判断结果会认为两篇文本并不重复,因此,需要结合根据文本关键词得到的第二simhash值进一步判断比较两篇文本是否为相似文本。In this embodiment, during the actual text deduplication process, texts with high similarity to the text, abstract texts of a certain text, or summary texts of a certain text, etc., will be crawled, for example, a certain text For the summary announcement of a single stock, there may be a more detailed summary announcement text than the summary announcement. If the first simhash value obtained from the text segmentation is used to judge whether the two texts are similar, the judgment result will be that the two texts are not repeated. Therefore, it is necessary to further judge whether the two texts are similar texts in combination with the second simhash value obtained according to the text keywords.
具体地,计算第一simhash值与目标文本的第三simhash值的第一距离值,可以理解的是,第三simhash值可以是目标文本分词得到的simhash值。第一距离值可以是海明距离值,当第一距离值大于第一预设值(例如,3)时,说明根据文本分词得到的第一simhash值判断两篇文本是不相同或不相似的,此时,可以进一步计算第二simhash值与目标文本的第三simhash值的第二距离值,当第二距离值小于第二预设值时,说明根据文本关键词得到的文本simhash值判断两篇文本属于相似文本,此时可以筛除第一文本,其中,第二预设值可根据实际情况设置。可以理解的是,预设存储空间的目标文本是指与第一文件进行比较是否相似或相同的文本,目标文本可以是第一文本之前爬取到的文本,也可以是数据库中的文本集中的任一文本。Specifically, the first distance value between the first simhash value and the third simhash value of the target text is calculated, and it can be understood that the third simhash value may be the simhash value obtained by segmenting the target text. The first distance value may be a Hamming distance value, and when the first distance value is greater than a first preset value (for example, 3), it indicates that the two texts are different or dissimilar according to the first simhash value obtained from the text segmentation. , at this time, the second distance value between the second simhash value and the third simhash value of the target text can be further calculated, and when the second distance value is less than the second preset value, it is explained that the text simhash value obtained according to the text keyword determines the two If the texts belong to similar texts, the first text can be filtered out at this time, and the second preset value can be set according to the actual situation. It can be understood that the target text of the preset storage space refers to the text that is similar or identical to the first file when compared, and the target text can be the text crawled before the first text, or it can be the text in the database. any text.
在一个实施例中,当所述第一距离值小于或等于所述第一预设值时,筛除所述第一文本。两篇文本的第一距离值小于第一预设值时,说明两篇文本的相似度较高,此时可以筛除第一文本。In one embodiment, when the first distance value is less than or equal to the first preset value, the first text is filtered out. When the first distance value of the two texts is smaller than the first preset value, it indicates that the similarity between the two texts is relatively high, and the first text can be filtered out at this time.
进一步地,当所述第二距离值大于所述第二预设值时,将所述第一文本存储至所述预设目标文本所属的文本集。当第二距离值大于第二预设值时,说明根据文本关键词得到的simhash值判断两篇文本不属于相似文本,因此可保留第一文本。Further, when the second distance value is greater than the second preset value, the first text is stored in the text set to which the preset target text belongs. When the second distance value is greater than the second preset value, it means that it is determined that the two texts do not belong to similar texts according to the simhash value obtained from the text keywords, so the first text can be retained.
在一个实施例中,所述当所述第一距离值大于第一预设值时,筛选模块还用于:计算所述第一simhash值与所述目标文本的第四simhash值的第三距离值,当所述第三距离值小于或等于第三预设值时,筛除所述第一文本。In one embodiment, when the first distance value is greater than a first preset value, the screening module is further configured to: calculate a third distance between the first simhash value and the fourth simhash value of the target text value, when the third distance value is less than or equal to a third preset value, the first text is filtered out.
第四simhash值是目标文本的关键词对应的simhash值,通过比较两篇文本的关键词simhash值的距离,可以进一步筛除相似文本。The fourth simhash value is the simhash value corresponding to the keyword of the target text. By comparing the distance of the simhash value of the keyword of the two texts, similar texts can be further filtered out.
在实际应用过程中,将分词结合关键词可以对摘要性、总结性的文本进行去重,对每篇文本保留两个simhash值,一个是分词的simhash值,一个是关键词的simhash值,优先级是分词,再判断关键词,可以明显提高Simhash在文本去重筛选应用的实际效果。In the actual application process, combining word segmentation with keywords can deduplicate abstract and summary texts, and keep two simhash values for each text, one is the simhash value of the word segmentation, and the other is the simhash value of the keyword, which is preferred. The level is word segmentation, and then judging keywords can significantly improve the actual effect of Simhash in the application of text deduplication screening.
此外,本发明还提供一种文本筛选方法。参照图3所示,为本发明文本筛选方法的实施例的方法流程示意图。电子设备1的处理器12执行存储器11中存储的文本筛选程序10时实现文本筛选方法的如下步骤:In addition, the present invention also provides a text screening method. Referring to FIG. 3 , it is a schematic flow chart of a method according to an embodiment of the text screening method of the present invention. When the processor 12 of the electronic device 1 executes the text filtering program 10 stored in the memory 11, the following steps of the text filtering method are implemented:
步骤S10:对待筛选的第一文本执行分词操作得到多个分词,从所述多个分词中提取预设词性的关键词,为各分词及各关键词分配相关联的权重。Step S10 : performing a word segmentation operation on the first text to be screened to obtain a plurality of word segments, extracting keywords with a preset part of speech from the plurality of word segments, and assigning associated weights to each word segment and each keyword.
在本实施例中,以爬虫爬取文本时,需要对爬取到的相同的或者相似性极高文本去重为例对本方案进行说明,应该理解的是,本方案的应用场景不仅限于此。当爬取到某一篇文本时,需要判断该文本与已爬取到的文本是否为相似或相同文本,如果为相似或相同文本,则可以筛除该文本,具体地,当获取到待去重的第一文本时,对第一文本执行分词操作得到多个分词,从多个分词中提取出第一文本中预设词性的关键词,其中,预设词性的关键词可以是属于名词的关键词和属于动词的关键词,为各个分词及各个关键词分配相关联的权重,分配的权重可以是根据各个分词的数量进行分配。In this embodiment, when a crawler crawls text, it is necessary to de-duplicate the crawled text with the same or extremely high similarity as an example to illustrate this solution. It should be understood that the application scenario of this solution is not limited to this. When crawling a certain piece of text, it is necessary to judge whether the text is similar or the same as the crawled text. If it is similar or the same text, the text can be filtered out. When the first text is heavy, a word segmentation operation is performed on the first text to obtain multiple word segments, and keywords of the preset part of speech in the first text are extracted from the multiple word segments, wherein the keywords of the preset part of speech may belong to nouns. For keywords and keywords belonging to verbs, associated weights are assigned to each participle and each keyword, and the assigned weight may be assigned according to the number of each participle.
例如,第一文本中包含一段语句:“CSDN博客结构之法算法之道的作者July”,分词后为:“CSDN博客结构之法算法之道的作者July”,然后为每个分词赋予权值:CSDN(4)博客(5)结构(3)之(1)法(2)算法(3)之(1)道(2)的(1)作者(5)July(5),其中括号里的数字代表这个单词在整条语句中的重要程度,数字越大代表越重要。For example, the first text contains a sentence: "July, the author of the method and algorithm of the CSDN blog structure method", after the word segmentation: "July, the author of the method of the algorithm of the CSDN blog structure method", and then assign a weight to each word segmentation : CSDN (4) Blog (5) Structure (3) (1) Method (2) Algorithm (3) (1) (1) Road (2) (1) Author (5) July (5), among which the parentheses The number represents the importance of the word in the whole sentence, the higher the number, the more important.
在一个实施例中,所述从所述多个分词中提取预设词性的关键词包括:In one embodiment, the keyword for extracting a preset part of speech from the plurality of word segmentations includes:
计算各分词在所述第一文本中的词频,基于所述词频计算出各分词的IDF值及TF值,将各分词的IDF值与各分词对应的TF值相乘得到各分词的TF-ID值,判断所述第一文本中是否存在大于预设数量的预设词性的关键词,若是,基于各分词的TF-ID值选取预设数量的预设词性的关键词,其中,所述预设词性的关键词包括名词关键词和动词关键词。Calculate the word frequency of each participle in the first text, calculate the IDF value and TF value of each participle based on the word frequency, and multiply the IDF value of each participle by the TF value corresponding to each participle to obtain the TF-ID of each participle value, determine whether there are keywords with a preset part-of-speech greater than a preset number in the first text, if so, select a preset number of keywords with a preset part-of-speech based on the TF-ID value of each word segment, wherein the preset The keywords of part-of-speech include noun keywords and verb keywords.
统计第一文本中的所有词的出现次数,计算出IDF(逆文档频率值),然后再计算出第一文本中每个词的TF(词频)值。其中,TF=(词语在文本中出现次数)/(各词语在文本中出现次数的总和),将IDF值与TF值相乘,得到该词的TF-IDF值,TF-IDF值可以评估字词对于发言文本中的重要程度,TF-IDF值越大表示作为关键词的优先级越高。在进行TF-IDF计算时,通过对词频与逆文档频率得出某个字词的TF-IDF值,若TF-IDF值越大,该词对文本的重要性越高,因此可以将TF-IDF值排在前面的几个字词作为第一文本的关键词。判断第一文本中是否存在大于预设数量(例如,20)的预设词性的关键词,若是,选取TF-IDF值排在前20的名词关键词和动词关键词作为第一文本的预设词性的关键词。The number of occurrences of all words in the first text is counted, the IDF (inverse document frequency value) is calculated, and then the TF (word frequency) value of each word in the first text is calculated. Among them, TF=(the number of occurrences of the word in the text)/(the sum of the number of occurrences of each word in the text), multiply the IDF value and the TF value to obtain the TF-IDF value of the word, and the TF-IDF value can evaluate the word The importance of the word to the speech text, the larger the TF-IDF value, the higher the priority as a keyword. In the TF-IDF calculation, the TF-IDF value of a word is obtained by comparing the word frequency and the inverse document frequency. If the TF-IDF value is larger, the importance of the word to the text is higher. The first few words in the IDF value are used as the keywords of the first text. Determine whether there are keywords with a preset part of speech greater than a preset number (for example, 20) in the first text, and if so, select the noun keywords and verb keywords with the top 20 TF-IDF values as the preset of the first text Part-of-speech keywords.
进一步地,当判断所述第一文本中不存在大于预设数量的预设词性的关键词时,筛除所述第一文本,并从预设存储空间随机获取一篇文本作为所述待筛选的第一文本重新执行分词操作。Further, when it is judged that there are no keywords with a preset part of speech greater than a preset number in the first text, screen out the first text, and randomly obtain a text from a preset storage space as the to-be-screened text. The first text of the word segmentation operation is re-executed.
当第一文本中不足20个预设词性的关键词,则将该第一文本删除,并从预设存储空间随机获取一篇文本作为待筛选的第一文本重新执行分词操作。将关键词数量不足的文本删除,可以避免对关键词数量不足的本文(即特征不明显的文本)执行进一步的哈希和降维等操作,提高海量文本的去重速度。When there are less than 20 keywords of the preset part of speech in the first text, the first text is deleted, and a piece of text is randomly obtained from the preset storage space as the first text to be screened and the word segmentation operation is performed again. Deleting texts with insufficient keywords can avoid performing further operations such as hashing and dimensionality reduction for texts with insufficient keywords (that is, texts with insignificant features), and improve the deduplication speed of massive texts.
在一个实施例中,对待筛选的第一文本执行分词操作得到多个分词包括:In one embodiment, performing a word segmentation operation on the first text to be screened to obtain multiple word segmentations includes:
根据正向最大匹配法将读取到的文本与所述词库进行匹配,得到第一匹配结果,所述第一匹配结果中包含有第一数量的第一词组和第二数量的单字;Matching the read text with the thesaurus according to the forward maximum matching method to obtain a first matching result, where the first matching result contains a first number of first phrases and a second number of words;
根据逆向最大匹配法将读取到的文本与所述词库进行匹配,得到第二匹配结果,所述第二匹配结果中包含有第三数量的第二词组和第四数量的单字;According to the reverse maximum matching method, the read text is matched with the thesaurus, and a second matching result is obtained, and the second matching result contains a third number of second phrases and a fourth number of words;
若所述第一数量与所述第三数量相等且所述第二数量小于或者等于所述第四数量,或者,若所述第一数量小于所述第三数量,则将所述第一匹配结果作为该对象全称的分词结果;若所述第一数量与所述第二数量相等且所述第三数量大于所述第四数量,或者,若所述第一数量大于所述第三数量,则将所述第二匹配结果作为所述第一文本的分词结果。Match the first number if the first number is equal to the third number and the second number is less than or equal to the fourth number, or if the first number is less than the third number The result is the word segmentation result of the full name of the object; if the first number is equal to the second number and the third number is greater than the fourth number, or, if the first number is greater than the third number, The second matching result is used as the word segmentation result of the first text.
通过正反向同时进行分词匹配找出单字数量更少,词组数量更多的分词匹配结果,以作为切分的语句的分词结果,可提高分词的准确性。By performing word segmentation matching in the forward and reverse directions at the same time, the word segmentation matching results with fewer words and more phrases can be found as the word segmentation results of the segmented sentences, which can improve the accuracy of word segmentation.
步骤S20:利用哈希函数计算各分词及各关键词的哈希值,基于各分词的哈希值及各分词对应的权重执行加权操作得到该分词的权重向量,基于各关键词的哈希值及各关键词对应的权重执行加权操作得到该关键词的权重向量。Step S20: Calculate the hash value of each word segment and each keyword by using a hash function, perform a weighting operation based on the hash value of each word segment and the corresponding weight of each word segment to obtain a weight vector of the word segment, and obtain the weight vector of the word segment based on the hash value of each keyword. and the weight corresponding to each keyword to perform a weighting operation to obtain the weight vector of the keyword.
在本实施例中,可以利用哈希函数计算各个分词的第一哈希值及各关键词的第一哈希值,第一哈希值为二进制数“0”、“1”组成的n-bit签名,例如,“CSDN”的哈希值Hash(CSDN)为“100101”,“博客”的哈希值Hash(博客)为“101011”。之后,根据各分词的哈希值及各个分词对应的权重,执行加权操作得到该分词的权重向量,根据各关键词的哈希值及各关键词对应的权重执行加权操作得到该关键词的权重向量。In this embodiment, a hash function can be used to calculate the first hash value of each word segment and the first hash value of each keyword, and the first hash value is n- bit signature, for example, the hash value Hash (CSDN) of "CSDN" is "100101", and the hash value Hash (blog) of "blog" is "101011". Then, according to the hash value of each word segment and the corresponding weight of each word segment, perform a weighting operation to obtain the weight vector of the word segment, and perform a weighting operation according to the hash value of each keyword and the corresponding weight of each keyword to obtain the weight of the keyword vector.
具体地,在第一哈希值的基础上,给各分词和关键词进行加权,即W=Hash*weight,且遇到1则hash值和权值正相乘,遇到0则hash值和权值负相乘。例如,给“CSDN”的hash值“100101”执行加权操作得到的权重向量:W(CSDN)=100101 4=4 -4 -4 4 -4 4,给“博客”的hash值“101011”执行加权得到的权重向量:W(博客)=101011 5=5 -5 5 -5 55,其余分词和关键词执行类似操作。Specifically, on the basis of the first hash value, each word segment and keyword are weighted, that is, W=Hash*weight, and if 1 is encountered, the hash value and the weight are positively multiplied, and if 0 is encountered, the hash value and Weights are negatively multiplied. For example, a weight vector obtained by performing a weighting operation on the hash value "100101" of "CSDN": W(CSDN)=100101 4=4 -4 -4 4 -4 4, performing weighting on the hash value "101011" of "blog" The obtained weight vector: W (blog) = 101011 5 = 5 -5 5 -5 55, and other word segmentations and keywords perform similar operations.
步骤S30:将各分词的权重向量累加得到所述第一文本的第一权重向量,将各关键词的权重向量累加得到所述第一文本的第二权重向量,对所述第一权重向量及所述第二权重向量执行降维操作,得到所述第一文本的第一simhash值及第二simhash值。Step S30: Accumulate the weight vector of each word segment to obtain the first weight vector of the first text, and accumulate the weight vector of each keyword to obtain the second weight vector of the first text. A dimensionality reduction operation is performed on the second weight vector to obtain a first simhash value and a second simhash value of the first text.
在本实施例中,将各分词的权重向量累加得到所述第一文本的第一权重向量,将各关键词的权重向量累加得到所述第一文本的第二权重向量,将上述各个分词或关键词的权重向量进行累加,得到一个序列串作为第一文本的第一权重向量或第二权重向量,例如,将“CSDN”的“4 -4 -4 4 -4 4”和“博客”的“5 -5 5 -5 5 5”进行累加,得到“4+5 -4+-5 -4+5 4+-5 -4+5 4+5”,得到“9 -9 1 -1 1”。In this embodiment, the weight vector of each word segment is accumulated to obtain the first weight vector of the first text, the weight vector of each keyword is accumulated to obtain the second weight vector of the first text, and the above each word segment or The weight vectors of keywords are accumulated to obtain a sequence string as the first weight vector or the second weight vector of the first text. For example, "4 -4 -4 4 -4 4" of "CSDN" and "blog" of "5 -5 5 -5 5 5" to accumulate, get "4+5 -4+-5 -4+5 4+-5 -4+5 4+5", get "9 -9 1 -1 1" .
之后,对第一权重向量及第二权重向量执行降维操作,将高维的特征向量映射成低维的特征向量可以提高处理的速度,得到第一文本的第一simhash值及第二simhash值,第一simhash值是指第一文本分词对应的simhash值,第二simhash值是指第一文本的关键词对应的simhash值,具体地,对于第一文本的权重向量,如果大于0则置1,否则置0,从而得到第一文本的simhash值。例如,将上述计算出来的“9 -9 1 -1 1 9”执行降维操作(某位大于0则置1,小于0则置0),得到的simhash值为:“1 0 1 0 1 1”。Afterwards, a dimensionality reduction operation is performed on the first weight vector and the second weight vector, and the high-dimensional feature vector is mapped to a low-dimensional feature vector, which can improve the processing speed, and obtain the first simhash value and the second simhash value of the first text. , the first simhash value refers to the simhash value corresponding to the word segmentation of the first text, and the second simhash value refers to the simhash value corresponding to the keywords of the first text. Specifically, for the weight vector of the first text, if it is greater than 0, it is set to 1 , otherwise set to 0, so as to get the simhash value of the first text. For example, perform the dimensionality reduction operation on "9 -9 1 -1 1 9" calculated above (a bit is set to 1 if it is greater than 0, and set to 0 if it is less than 0), and the obtained simhash value is: "1 0 1 0 1 1 ".
步骤S40:计算所述第一simhash值与预设目标文本的第三simhash值的第一距离值,当所述第一距离值大于第一预设值时,计算所述第二simhash值与所述第三simhash值的第二距离值,当所述第二距离值小于或等于第二预设值时,筛除所述第一文本。Step S40: Calculate the first distance value between the first simhash value and the third simhash value of the preset target text, and when the first distance value is greater than the first preset value, calculate the second simhash value and the preset target text. The second distance value of the third simhash value, when the second distance value is less than or equal to the second preset value, the first text is filtered out.
在本实施例中,由于在实际的文本去重操作过程中,会爬取到与该文本相似性很高的文本,某文本的摘要性文本,或某文本的总结性文本等,例如,某只股票的总结公告,可能会存在比该总结公告更加详细的总结公告文本,若仅根据文本分词得到的第一simhash值判断两篇文本是否相似,则判断结果会认为两篇文本并不重复,因此,需要结合根据文本关键词得到的第二simhash值进一步判断比较两篇文本是否为相似文本。In this embodiment, during the actual text deduplication process, texts with high similarity to the text, abstract texts of a certain text, or summary texts of a certain text, etc., will be crawled, for example, a certain text For the summary announcement of a single stock, there may be a more detailed summary announcement text than the summary announcement. If only the first simhash value obtained from the text segmentation is used to judge whether the two texts are similar, the judgment result will be that the two texts are not repeated. Therefore, it is necessary to further judge whether the two texts are similar texts in combination with the second simhash value obtained according to the text keywords.
具体地,计算第一simhash值与目标文本的第三simhash值的第一距离值,可以理解的是,第三simhash值可以是目标文本分词得到的simhash值。第一距离值可以是海明距离值,当第一距离值大于第一预设值(例如,3)时,说明根据文本分词得到的第一simhash值判断两篇文本是不相同或不相似的,此时,可以进一步计算第二simhash值与目标文本的第三simhash值的第二距离值,当第二距离值小于第二预设值时,说明根据文本关键词得到的文本simhash值判断两篇文本属于相似文本,此时可以筛除第一文本,其中,第二预设值可根据实际情况设置。可以理解的是,预设存储空间的目标文本是指与第一文件进行比较是否相似或相同的文本,目标文本可以是第一文本之前爬取到的文本,也可以是数据库中的文本集中的任一文本。Specifically, the first distance value between the first simhash value and the third simhash value of the target text is calculated, and it can be understood that the third simhash value may be the simhash value obtained by segmenting the target text. The first distance value may be a Hamming distance value, and when the first distance value is greater than a first preset value (for example, 3), it indicates that the two texts are different or dissimilar according to the first simhash value obtained from the text segmentation. , at this time, the second distance value between the second simhash value and the third simhash value of the target text can be further calculated, and when the second distance value is less than the second preset value, it is explained that the text simhash value obtained according to the text keyword determines the two If the texts belong to similar texts, the first text can be filtered out at this time, and the second preset value can be set according to the actual situation. It can be understood that the target text of the preset storage space refers to the text that is similar or identical to the first file when compared, and the target text can be the text crawled before the first text, or it can be the text in the database. any text.
在一个实施例中,当所述第一距离值小于或等于所述第一预设值时,筛除所述第一文本。两篇文本的第一距离值小于第一预设值时,说明两篇文本的相似度较高,此时可以筛除第一文本。In one embodiment, when the first distance value is less than or equal to the first preset value, the first text is filtered out. When the first distance value of the two texts is smaller than the first preset value, it indicates that the similarity between the two texts is relatively high, and the first text can be filtered out at this time.
进一步地,当所述第二距离值大于所述第二预设值时,将所述第一文本存储至所述预设目标文本所属的文本集。当第二距离值大于第二预设值时,说明根据文本关键词得到的simhash值判断两篇文本不属于相似文本,因此可保留第一文本。Further, when the second distance value is greater than the second preset value, the first text is stored in the text set to which the preset target text belongs. When the second distance value is greater than the second preset value, it means that it is determined that the two texts do not belong to similar texts according to the simhash value obtained from the text keywords, so the first text can be retained.
在一个实施例中,所述当所述第一距离值大于第一预设值时,筛选模块还用于:计算所述第一simhash值与所述目标文本的第四simhash值的第三距离值,当所述第三距离值小于或等于第三预设值时,筛除所述第一文本。In one embodiment, when the first distance value is greater than a first preset value, the screening module is further configured to: calculate a third distance between the first simhash value and the fourth simhash value of the target text value, when the third distance value is less than or equal to a third preset value, the first text is filtered out.
第四simhash值是目标文本的关键词对应的simhash值,通过比较两篇文本的关键词simhash值的距离,可以进一步筛除相似文本。The fourth simhash value is the simhash value corresponding to the keyword of the target text. By comparing the distance of the simhash value of the keyword of the two texts, similar texts can be further filtered out.
在实际应用过程中,将分词结合关键词可以对摘要性、总结性的文本进行去重,对每篇文本保留两个simhash值,一个是分词的simhash值,一个是关键词的simhash值,优先级是分词,再判断关键词,可以明显提高Simhash在文本去重筛选应用的实际效果。In the actual application process, combining word segmentation with keywords can deduplicate abstract and summary texts, and keep two simhash values for each text, one is the simhash value of the word segmentation, and the other is the simhash value of the keyword, which is preferred. The level is word segmentation, and then judging keywords can significantly improve the actual effect of Simhash in the application of text deduplication screening.
此外,本发明实施例还提出一种计算机可读存储介质,该计算机可读存储介质可以是硬盘、多媒体卡、SD卡、闪存卡、SMC、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器等等中的任意一种或者几种的任意组合。所述计算机可读存储介质中包括存储数据区和存储程序区,存储数据区存储根据区块链节点的使用所创建的数据,存储程序区存储有文本筛选程序10,所述文本筛选程序10被处理器执行时实现如下操作:In addition, an embodiment of the present invention also provides a computer-readable storage medium, where the computer-readable storage medium may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable Read-only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, etc. any one or any combination of several. The computer-readable storage medium includes a storage data area and a storage program area, the storage data area stores data created according to the use of blockchain nodes, and the storage program area stores a text screening program 10, and the text screening program 10 is stored. The processor executes the following operations:
对待筛选的第一文本执行分词操作得到多个分词,从所述多个分词中提取预设词性的关键词,为各分词及各关键词分配相关联的权重;Performing a word segmentation operation on the first text to be screened to obtain a plurality of word segmentations, extracting keywords with preset parts of speech from the plurality of word segmentations, and assigning associated weights to each word segmentation and each keyword;
计算各分词及各关键词的第一哈希值,基于各分词的第一哈希值及权重执行加权操作得到各分词的权重向量,基于各关键词的第一哈希值及各权重执行加权操作得到各关键词的权重向量;Calculate the first hash value of each word segment and each keyword, perform a weighting operation based on the first hash value and weight of each word segment to obtain the weight vector of each word segment, and perform weighting based on the first hash value of each keyword and each weight The operation obtains the weight vector of each keyword;
将各分词的权重向量累加得到所述第一文本的第一权重向量,将各关键词的权重向量累加得到所述第一文本的第二权重向量,对所述第一权重向量及所述第二权重向量执行降维操作,分别得到所述第一文本的第一simhash值及第二simhash值;Accumulate the weight vector of each word segment to obtain the first weight vector of the first text, and accumulate the weight vector of each keyword to obtain the second weight vector of the first text. The two-weight vector performs a dimensionality reduction operation to obtain the first simhash value and the second simhash value of the first text respectively;
计算所述第一simhash值与预设存储空间的目标文本的第三simhash值的第一距离值,当所述第一距离值大于第一预设值时,计算所述第二simhash值与所述第三simhash值的第二距离值,当所述第二距离值小于或等于第二预设值时,筛除所述第一文本。Calculate the first distance value between the first simhash value and the third simhash value of the target text in the preset storage space, and when the first distance value is greater than the first preset value, calculate the second simhash value and the The second distance value of the third simhash value, when the second distance value is less than or equal to the second preset value, the first text is filtered out.
在另一个实施例中,本发明所提供的文本筛选方法,为进一步保证上述所有出现的数据的私密和安全性,上述所有数据还可以存储于一区块链的节点中。例如文本的哈希值、需要保留的文本等等,这些数据均可存储在区块链节点中。In another embodiment, in the text screening method provided by the present invention, in order to further ensure the privacy and security of all the above-mentioned data, all the above-mentioned data can also be stored in a node of a blockchain. For example, the hash value of the text, the text that needs to be preserved, etc., these data can be stored in the blockchain node.
需要说明的是,本发明所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。It should be noted that the blockchain referred to in the present invention is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
本发明之计算机可读存储介质的具体实施方式与上述文本筛选方法的具体实施方式大致相同,在此不再赘述。The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the specific implementation of the above-mentioned text screening method, and will not be repeated here.
需要说明的是,上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprising", "comprising" or any other variation thereof herein are intended to encompass a non-exclusive inclusion such that a process, device, article or method comprising a list of elements includes not only those elements, but also includes no explicit Other elements listed, or those inherent to such a process, apparatus, article, or method are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,电子装置,或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on such understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disk), including several instructions to make a terminal device (which may be a mobile phone, a computer, an electronic device, or a network device, etc.) to execute the methods described in the various embodiments of the present invention.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields , are similarly included in the scope of patent protection of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011302193.8ACN112364625A (en) | 2020-11-19 | 2020-11-19 | Text screening method, device, equipment and storage medium |
| PCT/CN2021/123907WO2022105497A1 (en) | 2020-11-19 | 2021-10-14 | Text screening method and apparatus, device, and storage medium |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011302193.8ACN112364625A (en) | 2020-11-19 | 2020-11-19 | Text screening method, device, equipment and storage medium |
| Publication Number | Publication Date |
|---|---|
| CN112364625Atrue CN112364625A (en) | 2021-02-12 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202011302193.8APendingCN112364625A (en) | 2020-11-19 | 2020-11-19 | Text screening method, device, equipment and storage medium |
| Country | Link |
|---|---|
| CN (1) | CN112364625A (en) |
| WO (1) | WO2022105497A1 (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113254658A (en)* | 2021-07-07 | 2021-08-13 | 明品云(北京)数据科技有限公司 | Text information processing method, system, medium, and apparatus |
| CN113449073A (en)* | 2021-06-21 | 2021-09-28 | 福州米鱼信息科技有限公司 | Keyword selection method and system |
| CN114003275A (en)* | 2021-11-15 | 2022-02-01 | 北京天融信网络安全技术有限公司 | A detection method, device and storage medium for code clone |
| CN114219571A (en)* | 2021-12-16 | 2022-03-22 | 广州华多网络科技有限公司 | E-commerce independent site matching method and its device, equipment, medium and product |
| WO2022105497A1 (en)* | 2020-11-19 | 2022-05-27 | 深圳壹账通智能科技有限公司 | Text screening method and apparatus, device, and storage medium |
| CN114742042A (en)* | 2022-03-22 | 2022-07-12 | 杭州未名信科科技有限公司 | A text deduplication method, device, electronic device and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114936797A (en)* | 2022-06-16 | 2022-08-23 | 身边云(北京)信息服务有限公司 | A freelancer evaluation method, electronic device and storage medium |
| CN115358340A (en)* | 2022-08-30 | 2022-11-18 | 联洋国融(上海)科技有限公司 | Credit credit collection short message distinguishing method, system, equipment and storage medium |
| CN118964527B (en)* | 2024-10-15 | 2025-01-17 | 纽川技术有限公司 | Electronic case content analysis method and device for intelligent court |
| CN120200830B (en)* | 2025-04-25 | 2025-09-19 | 北京美创美利科技有限公司 | An industrial Internet encryption method and system based on blockchain evidence storage |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108132929A (en)* | 2017-12-25 | 2018-06-08 | 上海大学 | A kind of similarity calculation method of magnanimity non-structured text |
| CN109948125A (en)* | 2019-03-25 | 2019-06-28 | 成都信息工程大学 | Method and system of improved Simhash algorithm in text deduplication |
| CN110297879A (en)* | 2019-05-15 | 2019-10-01 | 平安科技(深圳)有限公司 | A kind of method, apparatus and storage medium of the data deduplication based on big data |
| CN111625647A (en)* | 2020-05-25 | 2020-09-04 | 红船科技(广州)有限公司 | Unsupervised news automatic classification method |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111381751A (en)* | 2016-10-18 | 2020-07-07 | 北京字节跳动网络技术有限公司 | Text processing method and device |
| CN107066623A (en)* | 2017-05-12 | 2017-08-18 | 湖南中周至尚信息技术有限公司 | A kind of article merging method and device |
| CN108776654A (en)* | 2018-05-30 | 2018-11-09 | 昆明理工大学 | One kind being based on improved simhash transcription comparison methods |
| CN110737748B (en)* | 2019-09-27 | 2023-08-08 | 成都数联铭品科技有限公司 | Text deduplication method and system |
| CN111339166A (en)* | 2020-02-29 | 2020-06-26 | 深圳壹账通智能科技有限公司 | Thesaurus-based matching recommendation method, electronic device and storage medium |
| CN112364625A (en)* | 2020-11-19 | 2021-02-12 | 深圳壹账通智能科技有限公司 | Text screening method, device, equipment and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108132929A (en)* | 2017-12-25 | 2018-06-08 | 上海大学 | A kind of similarity calculation method of magnanimity non-structured text |
| CN109948125A (en)* | 2019-03-25 | 2019-06-28 | 成都信息工程大学 | Method and system of improved Simhash algorithm in text deduplication |
| CN110297879A (en)* | 2019-05-15 | 2019-10-01 | 平安科技(深圳)有限公司 | A kind of method, apparatus and storage medium of the data deduplication based on big data |
| CN111625647A (en)* | 2020-05-25 | 2020-09-04 | 红船科技(广州)有限公司 | Unsupervised news automatic classification method |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022105497A1 (en)* | 2020-11-19 | 2022-05-27 | 深圳壹账通智能科技有限公司 | Text screening method and apparatus, device, and storage medium |
| CN113449073A (en)* | 2021-06-21 | 2021-09-28 | 福州米鱼信息科技有限公司 | Keyword selection method and system |
| CN113254658A (en)* | 2021-07-07 | 2021-08-13 | 明品云(北京)数据科技有限公司 | Text information processing method, system, medium, and apparatus |
| CN113254658B (en)* | 2021-07-07 | 2021-12-21 | 明品云(北京)数据科技有限公司 | Text information processing method, system, medium, and apparatus |
| CN114003275A (en)* | 2021-11-15 | 2022-02-01 | 北京天融信网络安全技术有限公司 | A detection method, device and storage medium for code clone |
| CN114219571A (en)* | 2021-12-16 | 2022-03-22 | 广州华多网络科技有限公司 | E-commerce independent site matching method and its device, equipment, medium and product |
| CN114742042A (en)* | 2022-03-22 | 2022-07-12 | 杭州未名信科科技有限公司 | A text deduplication method, device, electronic device and storage medium |
| Publication number | Publication date |
|---|---|
| WO2022105497A1 (en) | 2022-05-27 |
| Publication | Publication Date | Title |
|---|---|---|
| CN112364625A (en) | Text screening method, device, equipment and storage medium | |
| CN108304378B (en) | Text similarity computing method, apparatus, computer equipment and storage medium | |
| Urvoy et al. | Tracking web spam with html style similarities | |
| CN111797214A (en) | Question screening method, device, computer equipment and medium based on FAQ database | |
| CN112541338A (en) | Similar text matching method and device, electronic equipment and computer storage medium | |
| US9996588B2 (en) | Managing a search | |
| WO2019085335A1 (en) | Method for discovering investment objects with new words, device and storage medium | |
| WO2020228182A1 (en) | Big data-based data deduplication method and apparatus, device, and storage medium | |
| CN111930962A (en) | Document data value evaluation method and device, electronic equipment and storage medium | |
| CN109299235B (en) | Knowledge base searching method, device and computer readable storage medium | |
| WO2021051934A1 (en) | Method and apparatus for extracting key contract term on basis of artificial intelligence, and storage medium | |
| CN111737997A (en) | A text similarity determination method, device and storage medium | |
| US20210256115A1 (en) | Method and electronic device for generating semantic representation of document to determine data security risk | |
| WO2020000717A1 (en) | Web page classification method and device, and computer-readable storage medium | |
| CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
| WO2021052148A1 (en) | Contract sensitive word checking method and apparatus based on artificial intelligence, computer device, and storage medium | |
| US11914626B2 (en) | Machine learning approach to cross-language translation and search | |
| CN111339166A (en) | Thesaurus-based matching recommendation method, electronic device and storage medium | |
| WO2019085332A1 (en) | Financial data analysis method, application server, and computer readable storage medium | |
| WO2021012958A1 (en) | Original text screening method, apparatus, device and computer-readable storage medium | |
| WO2020258481A1 (en) | Method and apparatus for intelligently recommending personalized text, and computer-readable storage medium | |
| CN107924398B (en) | System and method for providing a review-centric news reader | |
| US8549309B1 (en) | Asymmetric content fingerprinting with adaptive window sizing | |
| WO2021068681A1 (en) | Tag analysis method and device, and computer readable storage medium | |
| CN111369148A (en) | Object index monitoring method, electronic device and storage medium |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |