Movatterモバイル変換


[0]ホーム

URL:


CN116015777A - A document detection method, device, equipment and storage medium - Google Patents

A document detection method, device, equipment and storage medium
Download PDF

Info

Publication number
CN116015777A
CN116015777ACN202211598627.2ACN202211598627ACN116015777ACN 116015777 ACN116015777 ACN 116015777ACN 202211598627 ACN202211598627 ACN 202211598627ACN 116015777 ACN116015777 ACN 116015777A
Authority
CN
China
Prior art keywords
document
detected
data
preset
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211598627.2A
Other languages
Chinese (zh)
Inventor
吕杰
吴栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co LtdfiledCriticalDBAPPSecurity Co Ltd
Priority to CN202211598627.2ApriorityCriticalpatent/CN116015777A/en
Publication of CN116015777ApublicationCriticalpatent/CN116015777A/en
Withdrawnlegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The application discloses a document detection method, a device, equipment and a storage medium, which relate to the technical field of computers and comprise the following steps: the method comprises the steps that a starting program obtains a document to be detected, and data to be processed in the document to be detected are determined based on preset document information extraction rules; obtaining a target compromise index to be detected by utilizing a preset compromise index extraction rule based on the data to be processed; and carrying out matching detection on the target compromise index to be detected by using a preset threat information library and a preset phishing website characteristic identification library so as to determine whether the document to be detected is a phishing document or not based on a detection result. According to the method and the device, the target compromise index determined based on the document to be detected is subjected to matching detection by utilizing the preset threat information library and the preset phishing website feature recognition library to identify whether the document to be detected is the phishing document, so that the phishing threat recognition of the network document can be realized, and the possibility of being attacked by the phishing document in the Internet environment is effectively reduced.

Description

Translated fromChinese
一种文档检测方法、装置、设备及存储介质A document detection method, device, equipment and storage medium

技术领域technical field

本发明涉及计算机技术领域,特别涉及一种文档检测方法、装置、设备及存储介质。The present invention relates to the field of computer technology, in particular to a document detection method, device, equipment and storage medium.

背景技术Background technique

当前,网络钓鱼一直是网络攻击的主力军。中国互联网络信息中心联合国家互联网应急中心发布的《2009年中国网民网络信息安全状况调查报告》显示,早在2009年有超过九成网民遇到过网络钓鱼,在遭遇过网络钓鱼事件的网民中,4500万网民蒙受了经济损失,占网民总数11.9%。网络钓鱼给网民造成的损失已达76亿元。据PhishMe在2016年的报告发现,91%的网络攻击是从钓鱼邮件开始的。Currently, phishing has been the main force of network attacks. The 2009 Survey Report on Internet Information Security of Chinese Netizens issued by the China Internet Network Information Center and the National Internet Emergency Response Center shows that as early as 2009, more than 90% of netizens had encountered phishing incidents. , 45 million netizens suffered economic losses, accounting for 11.9% of the total number of netizens. The losses caused by phishing to netizens have reached 7.6 billion yuan. According to a PhishMe report in 2016, 91% of cyber attacks started with phishing emails.

现阶段网络钓鱼的常见手法有钓鱼邮件、钓鱼文档、假冒网站、木马窃取、手机短信等,钓鱼攻击通常也会同时集成多种攻击方式,比如一个钓鱼邮件附件携带了一份钓鱼文档,文档内容中内嵌有伪造的专门窃取用户账户信息的网站。其中钓鱼文档又包含内嵌链接跳转、二维码链接跳转、宏代码执行、漏洞利用等多种使用钓鱼攻击的方式。综上,钓鱼文档的形式层出不穷,如何针对如钓鱼邮件投递、社交软件传递、挂马地址下载等不同场景下获取的文档进行有效钓鱼威胁判别的问题有待进一步解决。At present, the common methods of phishing include phishing emails, phishing documents, fake websites, Trojan horse stealing, mobile phone text messages, etc. Phishing attacks usually also integrate multiple attack methods at the same time. For example, a phishing email attachment carries a phishing document, and the content of the document There are fake websites embedded in it that specialize in stealing user account information. Among them, the phishing documents also include embedded link jumps, QR code link jumps, macro code execution, vulnerability exploitation and other methods of using phishing attacks. To sum up, there are endless forms of phishing documents, and the problem of how to effectively identify phishing threats for documents obtained under different scenarios such as phishing email delivery, social software delivery, and Trojan URL download needs to be further resolved.

发明内容Contents of the invention

有鉴于此,本发明的目的在于提供一种文档检测方法、装置、设备及存储介质,能够实现对网络文档的钓鱼威胁识别,有效降低了互联网环境下被钓鱼文档攻击的可能性。其具体方案如下:In view of this, the purpose of the present invention is to provide a document detection method, device, device and storage medium, which can realize the phishing threat identification of network documents, and effectively reduce the possibility of being attacked by phishing documents in the Internet environment. The specific plan is as follows:

第一方面,本申请提供了一种文档检测方法,包括:In a first aspect, the present application provides a document detection method, including:

启动程序获取待检测文档,并基于预设文档信息提取规则确定所述待检测文档中的待处理数据;所述待处理数据包括待处理文本数据、待处理图片数据与待处理行为数据;Start the program to obtain the document to be detected, and determine the data to be processed in the document to be detected based on preset document information extraction rules; the data to be processed includes text data to be processed, image data to be processed and behavior data to be processed;

基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标;Determining the compromise index of the document to be detected by using a preset compromise index extraction rule based on the data to be processed, and obtaining the target compromise index to be detected;

利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。Using a preset threat intelligence library and a preset phishing website feature recognition library to perform matching detection on the target to be detected compromise indicators, so as to determine whether the document to be detected is a phishing document based on the detection result; the preset threat intelligence library includes Threat compromise indicators, threat categories and threat alarms, the preset phishing website feature recognition library is a matching rule library formed by the text content of relevant phishing features in phishing webpages.

可选的,所述基于预设文档信息提取规则确定所述待检测文档中的待处理数据,包括:Optionally, the determining the data to be processed in the document to be detected based on preset document information extraction rules includes:

基于与待检测文档的格式对应的预设格式解析工具对所述待检测文档进行解析以得到目标静态分析数据,并提取所述目标静态分析数据中的静态文本数据和静态图片数据以得到相应的待处理文本数据和第一待处理图片数据;Analyze the document to be detected based on a preset format parsing tool corresponding to the format of the document to be detected to obtain target static analysis data, and extract static text data and static image data in the target static analysis data to obtain corresponding Text data to be processed and first image data to be processed;

通过将所述待检测文档放入沙盒进行动态分析得到目标动态分析数据,并提取所述目标动态分析数据中的动态行为数据以得到待处理行为数据;所述动态行为数据包括进程行为数据、文件行为数据、网络行为数据以及运行内存数据;The target dynamic analysis data is obtained by putting the document to be detected into a sandbox for dynamic analysis, and extracting the dynamic behavior data in the target dynamic analysis data to obtain the behavior data to be processed; the dynamic behavior data includes process behavior data, File behavior data, network behavior data, and running memory data;

通过预设预览图片生成规则确定与所述待检测文档对应的预览图片,以得到相应的第二待处理图片数据。A preview picture corresponding to the document to be detected is determined by a preset preview picture generation rule, so as to obtain corresponding second picture data to be processed.

可选的,所述基于与待检测文档的格式对应的预设格式解析工具对所述待检测文档进行解析以得到目标静态分析数据之前,还包括:Optionally, before parsing the document to be detected based on the preset format parsing tool corresponding to the format of the document to be detected to obtain target static analysis data, the method further includes:

当所述待检测文档的格式为压缩格式时,通过预设解压工具进行解压得到解压后的待检测文档,以基于所述解压后的待检测文档确定待处理数据。When the format of the document to be detected is a compressed format, the decompressed document to be detected is obtained by decompressing with a preset decompression tool, so as to determine the data to be processed based on the decompressed document to be detected.

可选的,所述提取所述目标静态分析数据中的静态文本数据以得到相应的待处理文本数据,包括:Optionally, the extracting the static text data in the target static analysis data to obtain corresponding text data to be processed includes:

通过预设宏代码提取工具提取所述待检测文档中的宏代码信息,以得到相应的待处理文本数据。The macro code information in the document to be detected is extracted by a preset macro code extraction tool to obtain corresponding text data to be processed.

可选的,所述通过预设预览图片生成规则确定与所述待检测文档对应的预览图片,以得到相应的第二待处理图片数据,包括:Optionally, the determining the preview picture corresponding to the document to be detected by the preset preview picture generation rule to obtain the corresponding second picture data to be processed includes:

利用预设格式转换工具将所述待检测文档格式转换为预设目标格式,得到转换后的待检测文档;Using a preset format conversion tool to convert the format of the document to be detected into a preset target format to obtain a converted document to be detected;

基于所述转换后的待检测文档利用预设图片转换工具得到与所述待检测文档对应的预览图片,以得到相应的待处理图片数据。Based on the converted document to be detected, a preset image conversion tool is used to obtain a preview image corresponding to the document to be detected, so as to obtain corresponding image data to be processed.

可选的,所述利用预设威胁情报库对所述目标待检测妥协指标进行匹配检测,包括:Optionally, using a preset threat intelligence library to perform matching detection on the target compromise indicators to be detected includes:

获取所述预设威胁情报库中的威胁妥协指标字段;Obtaining the threat compromise indicator field in the preset threat intelligence library;

基于所述威胁妥协指标字段对所述目标待检测妥协指标进行完全匹配或者字符串模糊匹配。Exact matching or fuzzy character string matching is performed on the target to-be-detected compromise indicator based on the threat compromise indicator field.

可选的,所述利用预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,包括:Optionally, using the preset phishing website feature recognition library to perform matching detection on the target compromise indicators to be detected includes:

基于所述目标待检测妥协指标中的待检测网页数据进行实时爬取,以得到爬取后数据;所述待检测网页数据包括域名数据、IP数据和URL数据;Real-time crawling is carried out based on the webpage data to be detected in the target to be detected compromise index, to obtain data after crawling; the webpage data to be detected includes domain name data, IP data and URL data;

基于预设识别方式利用预设钓鱼网站特征识别库对所述爬取后数据中的网页文本内容进行识别。The webpage text content in the crawled data is identified by using a preset phishing website feature recognition library based on a preset recognition method.

可选的,所述基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标,包括:Optionally, determining the compromise indicator of the document to be detected by using a preset compromise indicator extraction rule based on the data to be processed to obtain the target compromise indicator to be detected includes:

利用预设OCR工具对所述待处理图片数据中包含的待处理文本数据进行识别提取,并通过正则匹配的方式提取所述待处理数据中的所述待处理文本数据中包含的第一目标待检测妥协指标;Use a preset OCR tool to identify and extract the text data to be processed contained in the image data to be processed, and extract the first target text data contained in the text data to be processed in the data to be processed by regular matching. Detect indicators of compromise;

通过预设图片二维码识别工具对所述待处理图片数据中的二维码信息进行识别,并提取识别到的待检测二维码中包含的第二目标待检测妥协指标;Identifying the two-dimensional code information in the image data to be processed by a preset image two-dimensional code identification tool, and extracting the second target compromise indicator to be detected contained in the identified two-dimensional code to be detected;

分析并提取所述待处理行为数据中包含的第三目标待检测妥协指标。Analyzing and extracting the third target compromise indicator to be detected contained in the behavior data to be processed.

第二方面,本申请提供了一种文档检测装置,包括:In a second aspect, the present application provides a document detection device, including:

数据确定模块,用于启动程序获取待检测文档,并基于预设文档信息提取规则确定所述待检测文档中的待处理数据;所述待处理数据包括待处理文本数据、待处理图片数据与待处理行为数据;The data determination module is used to start the program to obtain the document to be detected, and determine the data to be processed in the document to be detected based on the preset document information extraction rules; the data to be processed includes text data to be processed, image data to be processed and data to be processed processing behavioral data;

妥协指标确定模块,用于基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标;A compromise indicator determining module, configured to determine the compromise indicator of the document to be detected by using a preset compromise indicator extraction rule based on the data to be processed, and obtain a target compromise indicator to be detected;

匹配检测模块,用于利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。A matching detection module, configured to use a preset threat intelligence library and a preset phishing website feature recognition library to perform matching detection on the target compromise indicator to be detected, so as to determine whether the document to be detected is a phishing document based on the detection result; It is assumed that the threat intelligence library includes threat compromise indicators, threat categories, and threat alarms, and the preset phishing website feature recognition library is a matching rule library formed by the text content of relevant phishing features in phishing webpages.

第三方面,本申请提供了一种电子设备,包括:In a third aspect, the present application provides an electronic device, including:

存储器,用于保存计算机程序;memory for storing computer programs;

处理器,用于执行所述计算机程序,以实现前述的文档检测方法的步骤。A processor, configured to execute the computer program, so as to realize the steps of the aforementioned document detection method.

第四方面,本申请提供了一种计算机可读存储介质,用于保存计算机程序,所述计算机程序被处理器执行时实现前述的文档检测方法的步骤。In a fourth aspect, the present application provides a computer-readable storage medium for storing a computer program, and when the computer program is executed by a processor, the steps of the aforementioned document detection method are implemented.

可见,本申请中,首先启动程序获取待检测文档,并基于预设文档信息提取规则确定所述待检测文档中的待处理数据;所述待处理数据包括待处理文本数据、待处理图片数据与待处理行为数据;基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标;利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。本申请通过利用预设威胁情报库和预设钓鱼网站特征识别库对从待检测文档中提取的目标妥协指标进行匹配检测以鉴别钓鱼文档,这样一来,能够实现对网络文档的钓鱼威胁识别,有效降低了互联网环境下被钓鱼文档攻击的可能性。It can be seen that in this application, the program is first started to obtain the document to be detected, and the data to be processed in the document to be detected is determined based on the preset document information extraction rules; the data to be processed includes text data to be processed, image data to be processed and Behavioral data to be processed; based on the data to be processed, using preset compromise indicator extraction rules to determine the compromise indicators of the document to be detected, and obtaining target compromise indicators to be detected; using a preset threat intelligence library and a preset phishing website feature recognition library to The target to-be-detected compromise indicator is matched and detected to determine whether the to-be-detected document is a phishing document based on the detection result; the preset threat intelligence library includes threat compromise indicators, threat categories, and threat alarms, and the preset phishing The website feature recognition library is a matching rule library formed by the text content of relevant phishing features in phishing webpages. This application uses the preset threat intelligence library and the preset phishing website feature recognition library to match and detect the target compromise indicators extracted from the documents to be detected to identify phishing documents. In this way, the phishing threat identification of network documents can be realized. It effectively reduces the possibility of being attacked by phishing documents in the Internet environment.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

图1为本申请提供的一种文档检测方法流程图;Fig. 1 is a flow chart of a document detection method provided by the present application;

图2为本申请提供的一种具体的妥协指数提取流程示意图;Figure 2 is a schematic diagram of a specific compromise index extraction process provided by this application;

图3为本申请提供的一种具体的妥协指数检测流程示意图;Figure 3 is a schematic diagram of a specific compromise index detection process provided by the present application;

图4为本申请提供的一种具体的文档检测方法流程图;FIG. 4 is a flow chart of a specific document detection method provided by the present application;

图5为本申请提供的一种文档检测装置结构示意图;FIG. 5 is a schematic structural diagram of a document detection device provided by the present application;

图6为本申请提供的一种电子设备结构图。FIG. 6 is a structural diagram of an electronic device provided by the present application.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

现阶段网络钓鱼的常见手法有钓鱼邮件、钓鱼文档、假冒网站、木马窃取、手机短信等,钓鱼攻击通常也会同时集成多种攻击方式,比如一个钓鱼邮件附件携带了一份钓鱼文档,文档内容中内嵌有伪造的专门窃取用户账户信息的网站。其中钓鱼文档又包含内嵌链接跳转、二维码链接跳转、宏代码执行、漏洞利用等多种使用钓鱼攻击的方式。综上,钓鱼文档的形式层出不穷,如何针对如钓鱼邮件投递、社交软件传递、挂马地址下载等不同场景下获取的文档进行有效钓鱼威胁判别的问题有待进一步解决。为此,本申请提供了一种文档检测方案,能够实现对网络文档的钓鱼威胁识别,进而降低互联网环境下被钓鱼文档攻击的可能性。At present, the common methods of phishing include phishing emails, phishing documents, fake websites, Trojan horse stealing, mobile phone text messages, etc. Phishing attacks usually also integrate multiple attack methods at the same time. For example, a phishing email attachment carries a phishing document, and the content of the document There are fake websites embedded in it that specialize in stealing user account information. Among them, the phishing documents also include embedded link jumps, QR code link jumps, macro code execution, vulnerability exploitation and other methods of using phishing attacks. To sum up, there are endless forms of phishing documents, and the problem of how to effectively identify phishing threats for documents obtained under different scenarios such as phishing email delivery, social software delivery, and Trojan URL download needs to be further resolved. For this reason, the present application provides a document detection scheme, which can realize the identification of phishing threats to network documents, thereby reducing the possibility of being attacked by phishing documents in the Internet environment.

参见图1所示,本发明实施例公开了一种文档检测方法,包括:Referring to Fig. 1, the embodiment of the present invention discloses a document detection method, including:

步骤S11、启动程序获取待检测文档,并基于预设文档信息提取规则确定所述待检测文档中的待处理数据;所述待处理数据包括待处理文本数据、待处理图片数据与待处理行为数据。Step S11, start the program to obtain the document to be detected, and determine the data to be processed in the document to be detected based on the preset document information extraction rules; the data to be processed includes text data to be processed, image data to be processed and behavior data to be processed .

本实施例中,当在钓鱼邮件投递、社交软件传递、挂马地址下载等不同场景下获取相应的文档时,得到待检测文档。之后基于预设文档信息提取规则确定所述待检测文档中的待处理文本数据、待处理图片数据与待处理行为数据,以得到待处理数据。如图2所示,获取所述待检测文档后,从多个维度提取所述待处理数据。首先可以对待检测文档进行静态解析,针对如doc(document,一种文件扩展名)、pdf(portable document format,便携式文档格式)等特殊格式可以通过相应的预设格式解析工具进行解析;进一步的,如docx、xlsx等压缩格式需在解析之前通过预设解压工具进行解压以得到满足doc、pdf等特殊格式的文档,之后分别提取静态解析后的文本数据以及图片数据得到相应的待处理文本数据以及待处理图片数据,其中,对于相应包含自身的宏代码信息的待检测文档可以利用预设宏代码提取工具进行提取。同时可以对待检测图像进行预览以得到相应的待处理图片数据,还可以将待检测文档放入沙盒进行动态分析以得到相应的待处理行为数据。In this embodiment, when corresponding documents are obtained in different scenarios such as phishing mail delivery, social software delivery, and Trojan URL download, the to-be-detected document is obtained. Then determine the text data to be processed, image data to be processed, and behavior data to be processed in the document to be detected based on preset document information extraction rules, so as to obtain data to be processed. As shown in FIG. 2, after the document to be detected is acquired, the data to be processed is extracted from multiple dimensions. First, the document to be detected can be statically analyzed, and special formats such as doc (document, a file extension) and pdf (portable document format, portable document format) can be analyzed through corresponding preset format analysis tools; further, Compressed formats such as docx, xlsx, etc. need to be decompressed by a preset decompression tool before parsing to obtain documents in special formats such as doc, pdf, etc., and then extract the text data and image data after static analysis to obtain the corresponding text data to be processed and For the image data to be processed, the document to be detected correspondingly containing its own macro code information can be extracted by using a preset macro code extraction tool. At the same time, the image to be detected can be previewed to obtain the corresponding image data to be processed, and the document to be detected can also be put into the sandbox for dynamic analysis to obtain the corresponding behavior data to be processed.

需要理解的是,在对待检测图像进行预览以得到相应的待处理图片数据时,首先需要利用预设格式转换工具将所述待检测文档格式转换为预设目标格式,得到转换后的待检测文档。然后基于所述转换后的待检测文档利用预设图片转换工具得到与所述待检测文档对应的预览图片,以得到相应的待处理图片数据。具体地,所述预设格式转换工具包括但不限于OpenOffice(一套跨平台的办公室软件套件,能在Windows、Linux、MacOS X(X11)和Solaris等操作系统上执行,可以与各个主要的办公室软件套件兼容)。所述预设图片转换工具包括但不限于Pdftoppm(一个Linux命令行工具,可以将PDF文档页面转换为PNG等格式的图片)。其中,PNG为Portable Network Graphics,一种便携式网络图形。It should be understood that when previewing the image to be detected to obtain the corresponding image data to be processed, it is first necessary to use a preset format conversion tool to convert the format of the document to be detected into a preset target format to obtain the converted document to be detected . Then, based on the converted document to be detected, a preset image conversion tool is used to obtain a preview image corresponding to the document to be detected, so as to obtain corresponding image data to be processed. Specifically, the preset format conversion tool includes but is not limited to OpenOffice (a set of cross-platform office software suite, which can be executed on operating systems such as Windows, Linux, MacOS X (X11) and Solaris, and can be compatible with all major office software suite compatible). The preset image conversion tools include but are not limited to Pdftoppm (a Linux command line tool that can convert PDF document pages into images in formats such as PNG). Among them, PNG is Portable Network Graphics, a portable network graphics.

步骤S12、基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标。Step S12 , based on the data to be processed, using a preset compromise indicator extraction rule to determine the compromise indicator of the document to be detected, and obtain the target compromise indicator to be detected.

本实施例中,在通过多个维度获取所述待检测文档的所述待处理数据后,基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标。In this embodiment, after the to-be-processed data of the to-be-detected document is obtained from multiple dimensions, the compromise index of the to-be-detected document is determined based on the to-be-processed data using preset compromise index extraction rules, and the target to-be-detected document is obtained. Detect indicators of compromise.

结合图2所示,所述基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标,包括:利用预设OCR工具对所述待处理图片数据中包含的待处理文本数据进行识别提取,并通过正则匹配的方式提取所述待处理数据中的所述待处理文本数据中包含的第一目标待检测妥协指标;通过预设图片二维码识别工具对所述待处理图片数据中的二维码信息进行识别,并提取识别到的待检测二维码中包含的第二目标待检测妥协指标;分析并提取所述待处理行为数据中包含的第三目标待检测妥协指标。其中,OCR(optical character recognition):是指电子设备(例如扫描仪或数码相机)检查纸上打印的字符,然后用字符识别方法将形状翻译成计算机文字的过程。计算机字符识别技术。通常可以理解为将图片中的文本信息进行提取。妥协指标,也即IOC(Indicators of Compromise),包括在系统日志条目或文件中找到的数据,用于识别系统或网络上的潜在恶意活动,可以包括的数据有相关的域名数据、IP(Internet Protocol,网络之间互连的协议)数据、URL(uniform resource locator,统一资源定位系统)数据、邮箱、路径、行为、互斥量等。As shown in FIG. 2, the use of preset compromise indicator extraction rules to determine the compromise indicator of the document to be detected based on the data to be processed, and obtaining the target compromise indicator to be detected includes: using a preset OCR tool to process the The text data to be processed included in the picture data is identified and extracted, and the first target compromise indicator to be detected contained in the text data to be processed in the data to be processed is extracted by means of regular matching; The code identification tool identifies the two-dimensional code information in the image data to be processed, and extracts the second target compromise indicator to be detected contained in the identified two-dimensional code to be detected; analyzes and extracts the second target compromise index contained in the behavior data to be processed. Contains a third target to detect indicators of compromise. Among them, OCR (optical character recognition): refers to the process of electronic devices (such as scanners or digital cameras) checking characters printed on paper, and then using character recognition methods to translate shapes into computer text. Computer character recognition technology. It can usually be understood as extracting text information from pictures. Compromise indicators, also known as IOC (Indicators of Compromise), include data found in system log entries or files, which are used to identify potential malicious activities on the system or network. The data that can be included include relevant domain name data, IP (Internet Protocol , protocol for interconnection between networks) data, URL (uniform resource locator, Uniform Resource Locator) data, mailboxes, paths, behaviors, mutexes, etc.

需要理解的是,沙盒(sandbox,又译为沙箱),是一种安全机制,为运行中的程序提供隔离环境,通常是作为一些来源不可信、具破坏力或无法判定程序意图的程序提供实验之用。可以进一步理解的是,利用所述预设图片二维码识别工具识别的二维码信息,并对识别到的二维码对应的域名数据、IP数据和URL数据等妥协指标进行提取。其中,常见的二维码为QR Code,QR全称Quick Response,是一种编码方式。它比传统的Bar Code条形码能存更多的信息,也能表示更多的数据类型。进一步的,提取所述待处理行为数据中包含的第三目标待检测妥协指标时,对所述待处理行为数据中包含的进程行为数据、文件行为数据以及网络行为数据中的域名数据、IP数据和URL数据等妥协指标进行分析提取。对运行内存数据中包含的字符串通过正则匹配的方式进行妥协指标提取。What needs to be understood is that a sandbox (sandbox, also translated as a sandbox) is a security mechanism that provides an isolated environment for running programs, usually as programs whose sources are untrustworthy, destructive, or unable to determine the intent of the program Provided for experimentation. It can be further understood that the two-dimensional code information identified by the preset image two-dimensional code identification tool is used to extract compromise indicators such as domain name data, IP data, and URL data corresponding to the identified two-dimensional code. Among them, the common two-dimensional code is QR Code, and the full name of QR is Quick Response, which is a coding method. It can store more information and represent more data types than the traditional Bar Code. Further, when extracting the third target to be detected compromise indicator included in the behavior data to be processed, process behavior data, file behavior data, and domain name data and IP data included in the behavior data to be processed Analysis and extraction of indicators of compromise such as URL data. Compromise indicator extraction is performed on the character strings contained in the running memory data through regular matching.

步骤S13、利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。Step S13, using the preset threat intelligence database and the preset phishing website feature recognition library to perform matching detection on the target compromise indicator to be detected, so as to determine whether the document to be detected is a phishing document based on the detection result; the preset threat intelligence The library includes threat compromise indicators, threat categories, and threat warnings, and the preset phishing website feature recognition library is a matching rule library formed by text content of relevant phishing features in phishing webpages.

本实施例中,得到与所述待检测文档对应的所述目标待检测妥协指标后,利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档。In this embodiment, after obtaining the target compromise indicators corresponding to the documents to be detected, the target compromise indicators to be detected are matched and detected by using the preset threat intelligence library and the preset phishing website feature recognition library, so as to Determine whether the document to be detected is a phishing document based on the detection result.

具体地,所述利用预设威胁情报库对所述目标待检测妥协指标进行匹配检测,包括:获取所述预设威胁情报库中的威胁妥协指标字段;基于所述威胁妥协指标字段对所述目标待检测妥协指标进行完全匹配或者字符串模糊匹配。所述利用预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,包括:基于所述目标待检测妥协指标中的待检测网页数据进行实时爬取,以得到爬取后数据;所述待检测网页数据包括域名数据、IP数据和URL数据;基于预设识别方式利用预设钓鱼网站特征识别库对所述爬取后数据中的网页文本内容进行识别。可以理解的是,在所述目标待检测妥协指标与所述预设威胁情报库进行碰撞匹配时,所述预设威胁情报数据库的匹配方式为以所述威胁妥协指标字段为索引进行的完全匹配或者字符串模糊匹配。所述预设识别方式为符合相关规则直接匹配以及单词频率满足规则要求的识别方式,所述预设钓鱼网站特征识别库包含但不限于特征字符串正则表达式规则以及预设特征字符串词频分析规则。之后,综合所述预设威胁情报库和所述预设钓鱼网站特征识别库匹配结果最终输出针对所述待检测文档的检测结果。Specifically, the matching detection of the target compromise indicator to be detected by using the preset threat intelligence library includes: obtaining the threat compromise indicator field in the preset threat intelligence library; Exact match or fuzzy string match for target compromise indicators to be detected. The matching detection of the target compromise index to be detected by using the preset phishing website feature recognition library includes: crawling in real time based on the web page data to be detected in the target compromise index to be detected to obtain the crawled data; The webpage data to be detected includes domain name data, IP data and URL data; based on a preset identification method, a preset phishing website feature recognition library is used to identify the webpage text content in the crawled data. It can be understood that, when the target to-be-detected compromise indicator collides with the preset threat intelligence database, the matching method of the preset threat intelligence database is an exact match using the threat compromise indicator field as an index. Or string fuzzy matching. The preset recognition method is a recognition method that directly matches the relevant rules and the word frequency meets the requirements of the rules. The preset phishing website feature recognition library includes but is not limited to feature string regular expression rules and preset feature string word frequency Analysis rules. Afterwards, the matching result of the preset threat intelligence database and the preset phishing website feature recognition database is integrated to finally output a detection result for the document to be detected.

由此可见,本申请实施例中,首先启动程序获取待检测文档,并基于预设文档信息提取规则确定所述待检测文档中的待处理数据;所述待处理数据包括待处理文本数据、待处理图片数据与待处理行为数据;基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标;利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。本申请通过利用预设威胁情报库和预设钓鱼网站特征识别库对从待检测文档中提取的目标妥协指标进行匹配检测以鉴别钓鱼文档,这样一来,能够实现对网络文档的钓鱼威胁识别,有效降低了互联网环境下被钓鱼文档攻击的可能性。It can be seen that in the embodiment of the present application, the program is first started to obtain the document to be detected, and the data to be processed in the document to be detected is determined based on the preset document information extraction rules; the data to be processed includes text data to be processed, Processing image data and behavior data to be processed; determining the compromise indicators of the document to be detected by using preset compromise indicator extraction rules based on the data to be processed, and obtaining target compromise indicators to be detected; using preset threat intelligence databases and preset phishing websites The feature recognition library performs matching detection on the target compromise indicators to be detected, so as to determine whether the document to be detected is a phishing document based on the detection result; the preset threat intelligence library includes threat compromise indicators, threat categories, and threat alarms. The preset phishing website feature recognition library is a matching rule library formed by the text content of relevant phishing features in phishing webpages. This application uses the preset threat intelligence library and the preset phishing website feature recognition library to match and detect the target compromise indicators extracted from the documents to be detected to identify phishing documents. In this way, the phishing threat identification of network documents can be realized. It effectively reduces the possibility of being attacked by phishing documents in the Internet environment.

由前一实施例可知,本申请基于所述待检测文档中的待处理数据确定与所述待检测文档对应的目标待检测妥协指标。为此,本实施例接下来对提取所述待检测文档中的待处理数据的过程进行详细描述。参见图4所示,本发明实施例公开了一种文档检测方法,包括:It can be seen from the previous embodiment that the present application determines the target compromise indicator to be detected corresponding to the document to be detected based on the data to be processed in the document to be detected. For this reason, the present embodiment next describes in detail the process of extracting the data to be processed in the document to be detected. Referring to Fig. 4, the embodiment of the present invention discloses a document detection method, including:

步骤S21、启动程序获取待检测文档。Step S21, start the program to acquire the document to be detected.

步骤S22、基于与待检测文档的格式对应的预设格式解析工具对所述待检测文档进行解析以得到目标静态分析数据,并提取所述目标静态分析数据中的静态文本数据和静态图片数据以得到相应的待处理文本数据和第一待处理图片数据。Step S22: Analyze the document to be detected based on a preset format analysis tool corresponding to the format of the document to be detected to obtain target static analysis data, and extract static text data and static image data in the target static analysis data to obtain The corresponding text data to be processed and the first image data to be processed are obtained.

本实施例中,所述基于与待检测文档的格式对应的预设格式解析工具对所述待检测文档进行解析以得到目标静态分析数据之前,还包括:当所述待检测文档的格式为压缩格式时,通过预设解压工具进行解压得到解压后的待检测文档,以基于所述解压后的待检测文档确定待处理数据。其中,基于与所述解压后的待检测文档的格式对应的预设格式解析工具对所述解压后的待检测文档进行解析以确定待处理数据。In this embodiment, before analyzing the document to be detected based on the preset format parsing tool corresponding to the format of the document to be detected to obtain the target static analysis data, it further includes: when the format of the document to be detected is compressed format, use a preset decompression tool to decompress to obtain the decompressed document to be detected, so as to determine the data to be processed based on the decompressed document to be detected. Wherein, the decompressed document to be detected is analyzed based on a preset format parsing tool corresponding to the format of the decompressed document to be detected to determine the data to be processed.

步骤S23、通过将所述待检测文档放入沙盒进行动态分析得到目标动态分析数据,并提取所述目标动态分析数据中的动态行为数据以得到待处理行为数据;所述动态行为数据包括进程行为数据、文件行为数据、网络行为数据以及运行内存数据。Step S23: Put the document to be detected into a sandbox for dynamic analysis to obtain the target dynamic analysis data, and extract the dynamic behavior data in the target dynamic analysis data to obtain the behavior data to be processed; the dynamic behavior data includes process Behavior data, file behavior data, network behavior data, and running memory data.

步骤S24、通过预设预览图片生成规则确定与所述待检测文档对应的预览图片,以得到相应的第二待处理图片数据。Step S24: Determine the preview picture corresponding to the document to be detected according to the preset preview picture generation rule, so as to obtain the corresponding second picture data to be processed.

步骤S25、基于所述待处理文本数据、所述第一待处理图片数据、所述动态行为数据以及所述第二待处理图片数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标。Step S25, based on the text data to be processed, the first image data to be processed, the dynamic behavior data, and the second image data to be processed, using preset compromise index extraction rules to determine the compromise index of the document to be detected , to get the compromise index of the target to be detected.

步骤S26、利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。Step S26, using the preset threat intelligence library and the preset phishing website feature recognition library to perform matching detection on the target compromise indicator to be detected, so as to determine whether the document to be detected is a phishing document based on the detection result; the preset threat information The library includes threat compromise indicators, threat categories, and threat warnings, and the preset phishing website feature recognition library is a matching rule library formed by text content of relevant phishing features in phishing webpages.

其中,关于上述步骤S21至步骤S26的具体过程可以参考前述实施例公开的相应内容,在此不再进行赘述。Wherein, for the specific process of the above step S21 to step S26, reference may be made to the corresponding content disclosed in the foregoing embodiments, and details are not repeated here.

由此可见,本申请实施例中,先启动程序获取待检测文档,然后基于与待检测文档的格式对应的预设格式解析工具对所述待检测文档进行解析以得到目标静态分析数据,并提取所述目标静态分析数据中的静态文本数据和静态图片数据以得到相应的待处理文本数据和第一待处理图片数据,并通过将所述待检测文档放入沙盒进行动态分析得到目标动态分析数据,并提取所述目标动态分析数据中的动态行为数据以得到待处理行为数据;所述动态行为数据包括进程行为数据、文件行为数据、网络行为数据以及运行内存数据。以及通过预设预览图片生成规则确定与所述待检测文档对应的预览图片,以得到相应的第二待处理图片数据。之后基于所述待处理文本数据、所述第一待处理图片数据、所述动态行为数据以及所述第二待处理图片数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标。最后利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。本实施例通过静态和动态分析结合的方式,从多个维度提取所述待检测文档的待处理数据,能够实现对网络钓鱼文档全面的威胁识别,减少被如内嵌链接钓鱼、二维码钓鱼、宏代码等钓鱼攻击的风险。It can be seen that in the embodiment of the present application, the program is first started to obtain the document to be detected, and then the document to be detected is parsed based on the preset format analysis tool corresponding to the format of the document to be detected to obtain the target static analysis data, and the Static text data and static image data in the target static analysis data to obtain corresponding text data to be processed and first image data to be processed, and the target dynamic analysis is obtained by putting the document to be detected into a sandbox for dynamic analysis data, and extract the dynamic behavior data in the target dynamic analysis data to obtain the behavior data to be processed; the dynamic behavior data includes process behavior data, file behavior data, network behavior data and running memory data. And determine the preview picture corresponding to the document to be detected through the preset preview picture generation rule, so as to obtain the corresponding second picture data to be processed. Then, based on the text data to be processed, the first image data to be processed, the dynamic behavior data and the second image data to be processed, the compromise index of the document to be detected is determined using a preset compromise index extraction rule, and the obtained Target pending indicators of compromise. Finally, use the preset threat intelligence library and the preset phishing website feature recognition library to perform matching detection on the target compromise indicator to be detected, so as to determine whether the document to be detected is a phishing document based on the detection result; in the preset threat intelligence library Including threat compromise indicators, threat categories and threat alarms, the preset phishing website feature recognition library is a matching rule library formed by text content of relevant phishing features in phishing webpages. This embodiment extracts the data to be processed of the document to be detected from multiple dimensions through the combination of static and dynamic analysis, which can realize comprehensive threat identification of phishing documents, and reduce threats such as embedded link phishing and QR code phishing. , macro code and other phishing attacks.

参见图5所示,本申请实施例还相应公开了一种文档检测装置,包括:Referring to Figure 5, the embodiment of the present application also discloses a corresponding document detection device, including:

数据确定模块11,用于启动程序获取待检测文档,并基于预设文档信息提取规则确定所述待检测文档中的待处理数据;所述待处理数据包括待处理文本数据、待处理图片数据与待处理行为数据;Thedata determining module 11 is used to start the program to obtain the document to be detected, and determine the data to be processed in the document to be detected based on the preset document information extraction rules; the data to be processed includes text data to be processed, image data to be processed and Behavioral data to be processed;

妥协指标确定模块12,用于基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标;A compromiseindicator determining module 12, configured to determine the compromise indicator of the document to be detected by using a preset compromise indicator extraction rule based on the data to be processed, and obtain a target compromise indicator to be detected;

匹配检测模块13,用于利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。The matchingdetection module 13 is configured to use a preset threat intelligence library and a preset phishing website feature recognition library to perform matching detection on the target compromise indicator to be detected, so as to determine whether the document to be detected is a phishing document based on the detection result; The preset threat intelligence library includes threat compromise indicators, threat categories, and threat alarms, and the preset phishing website feature recognition library is a matching rule library formed by text content of relevant phishing features in phishing webpages.

其中,关于上述各个模块更加具体的工作过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。For the more specific working process of each of the above modules, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.

由此可见,本申请中,首先启动程序获取待检测文档,并基于预设文档信息提取规则确定所述待检测文档中的待处理数据;所述待处理数据包括待处理文本数据、待处理图片数据与待处理行为数据;基于所述待处理数据利用预设妥协指标提取规则确定所述待检测文档的妥协指标,得到目标待检测妥协指标;利用预设威胁情报库和预设钓鱼网站特征识别库对所述目标待检测妥协指标进行匹配检测,以基于检测结果确定所述待检测文档是否为钓鱼文档;所述预设威胁情报库中包括威胁妥协指标、威胁类别和威胁告警,所述预设钓鱼网站特征识别库为钓鱼网页中相关钓鱼特征的文本内容形成的匹配规则库。本申请通过利用预设威胁情报库和预设钓鱼网站特征识别库对从待检测文档中提取的目标妥协指标进行匹配检测以鉴别钓鱼文档,这样一来,能够实现对网络文档的钓鱼威胁识别,有效降低了互联网环境下被钓鱼文档攻击的可能性。It can be seen that, in the present application, first start the program to obtain the document to be detected, and determine the data to be processed in the document to be detected based on the preset document information extraction rules; the data to be processed includes text data to be processed, pictures to be processed data and behavior data to be processed; based on the data to be processed, using a preset compromise indicator extraction rule to determine the compromise indicator of the document to be detected, and obtain the target compromise indicator to be detected; using a preset threat intelligence library and preset phishing website feature recognition The library performs matching detection on the target compromise indicators to be detected, so as to determine whether the document to be detected is a phishing document based on the detection result; the preset threat intelligence library includes threat compromise indicators, threat categories and threat alarms, and the preset Let the phishing website feature recognition library be the matching rule library formed by the text content of the relevant phishing features in the phishing webpage. This application uses the preset threat intelligence library and the preset phishing website feature recognition library to match and detect the target compromise indicators extracted from the documents to be detected to identify phishing documents. In this way, the phishing threat identification of network documents can be realized. It effectively reduces the possibility of being attacked by phishing documents in the Internet environment.

在一些具体实施例中,所述数据确定模块11,具体可以包括:In some specific embodiments, thedata determination module 11 may specifically include:

静态分析单元,用于基于与待检测文档的格式对应的预设格式解析工具对所述待检测文档进行解析以得到目标静态分析数据,并提取所述目标静态分析数据中的静态文本数据和静态图片数据以得到相应的待处理文本数据和第一待处理图片数据;A static analysis unit, configured to analyze the document to be detected based on a preset format parsing tool corresponding to the format of the document to be detected to obtain target static analysis data, and extract static text data and static text data in the target static analysis data image data to obtain corresponding text data to be processed and the first image data to be processed;

动态分析单元,用于通过将所述待检测文档放入沙盒进行动态分析得到目标动态分析数据,并提取所述目标动态分析数据中的动态行为数据以得到待处理行为数据;所述动态行为数据包括进程行为数据、文件行为数据、网络行为数据以及运行内存数据;A dynamic analysis unit, configured to obtain target dynamic analysis data by putting the document to be detected into a sandbox for dynamic analysis, and extract dynamic behavior data in the target dynamic analysis data to obtain behavior data to be processed; the dynamic behavior Data includes process behavior data, file behavior data, network behavior data, and running memory data;

待处理图片数据获取单元,用于通过预设预览图片生成规则确定与所述待检测文档对应的预览图片,以得到相应的第二待处理图片数据。The picture data to be processed acquisition unit is configured to determine a preview picture corresponding to the document to be detected according to a preset preview picture generation rule, so as to obtain corresponding second picture data to be processed.

在一些具体实施例中,所述文档监测装置,具体还可以包括:In some specific embodiments, the document monitoring device may specifically include:

文档解压单元,用于当所述待检测文档的格式为压缩格式时,通过预设解压工具进行解压得到解压后的待检测文档,以基于所述解压后的待检测文档确定待处理数据。The document decompression unit is configured to, when the format of the document to be detected is a compressed format, decompress by a preset decompression tool to obtain the decompressed document to be detected, so as to determine the data to be processed based on the decompressed document to be detected.

在一些具体实施例中,所述文档监测装置,具体可以包括:In some specific embodiments, the document monitoring device may specifically include:

宏代码信息提取单元,用于通过预设宏代码提取工具提取所述待检测文档中的宏代码信息,以得到相应的待处理文本数据。The macro code information extraction unit is configured to extract the macro code information in the document to be detected by a preset macro code extraction tool to obtain corresponding text data to be processed.

在一些具体实施例中,所述文档监测装置,具体可以包括:In some specific embodiments, the document monitoring device may specifically include:

文档格式转换单元,用于利用预设格式转换工具将所述待检测文档格式转换为预设目标格式,得到转换后的待检测文档;A document format conversion unit, configured to convert the format of the document to be detected into a preset target format by using a preset format conversion tool to obtain the converted document to be detected;

预览图片转换单元,用于基于所述转换后的待检测文档利用预设图片转换工具得到与所述待检测文档对应的预览图片,以得到相应的待处理图片数据。A preview image conversion unit, configured to obtain a preview image corresponding to the document to be detected by using a preset image conversion tool based on the converted document to be detected, so as to obtain corresponding image data to be processed.

在一些具体实施例中,所述文档监测装置,具体可以包括:In some specific embodiments, the document monitoring device may specifically include:

威胁妥协指标字段获取单元,用于获取所述预设威胁情报库中的威胁妥协指标字段;A threat compromise indicator field acquisition unit, configured to acquire the threat compromise indicator field in the preset threat intelligence library;

威胁妥协指标字段匹配单元,用于基于所述威胁妥协指标字段对所述目标待检测妥协指标进行完全匹配或者字符串模糊匹配。The threat compromise indicator field matching unit is configured to perform exact matching or character string fuzzy matching on the target compromise indicator to be detected based on the threat compromise indicator field.

在一些具体实施例中,所述文档监测装置,具体可以包括:In some specific embodiments, the document monitoring device may specifically include:

实时爬取单元,用于基于所述目标待检测妥协指标中的待检测网页数据进行实时爬取,以得到爬取后数据;所述待检测网页数据包括域名数据、IP数据和URL数据;A real-time crawling unit, configured to crawl in real time based on the webpage data to be detected in the target to be detected compromise indicators, to obtain data after crawling; the webpage data to be detected includes domain name data, IP data and URL data;

数据识别单元,用于基于预设识别方式利用预设钓鱼网站特征识别库对所述爬取后数据中的网页文本内容进行识别。The data identification unit is configured to identify the webpage text content in the crawled data by using a preset phishing website feature identification library based on a preset identification method.

在一些具体实施例中,所述妥协指标确定模块12,具体可以包括:In some specific embodiments, the compromiseindicator determination module 12 may specifically include:

第一目标待检测妥协指标提取单元,用于利用预设OCR工具对所述待处理图片数据中包含的待处理文本数据进行识别提取,并通过正则匹配的方式提取所述待处理数据中的所述待处理文本数据中包含的第一目标待检测妥协指标;The first target to be detected compromise index extraction unit is configured to use a preset OCR tool to identify and extract the text data to be processed contained in the image data to be processed, and extract all the text data to be processed in the data to be processed by regular matching. Describe the first target compromise indicator to be detected contained in the text data to be processed;

第二目标待检测妥协指标提取单元,用于通过预设图片二维码识别工具对所述待处理图片数据中的二维码信息进行识别,并提取识别到的待检测二维码中包含的第二目标待检测妥协指标;The second target to be detected compromise indicator extraction unit is used to identify the two-dimensional code information in the image data to be processed through a preset image two-dimensional code recognition tool, and extract the identified two-dimensional code contained in the two-dimensional code to be detected The second target is to detect compromise indicators;

第三目标待检测妥协指标提取单元,用于分析并提取所述待处理行为数据中包含的第三目标待检测妥协指标。The third target to be detected compromise indicator extraction unit is configured to analyze and extract the third target to be detected compromise indicator contained in the behavior data to be processed.

进一步的,本申请实施例还公开了一种电子设备,图6是根据一示例性实施例示出的电子设备20结构图,图中的内容不能认为是对本申请的使用范围的任何限制。Further, the embodiment of the present application also discloses an electronic device. FIG. 6 is a structural diagram of anelectronic device 20 according to an exemplary embodiment. The content in the figure should not be regarded as any limitation on the application scope of the present application.

图6为本申请实施例提供的一种电子设备20的结构示意图。该电子设备20,具体可以包括:至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中,所述存储器22用于存储计算机程序,所述计算机程序由所述处理器21加载并执行,以实现前述任一实施例公开的文档检测方法中的相关步骤。另外,本实施例中的电子设备20具体可以为电子计算机。FIG. 6 is a schematic structural diagram of anelectronic device 20 provided by an embodiment of the present application. Theelectronic device 20 may specifically include: at least one processor 21 , at least one memory 22 , a power supply 23 , a communication interface 24 , an input/output interface 25 and a communication bus 26 . Wherein, the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement relevant steps in the document detection method disclosed in any one of the foregoing embodiments. In addition, theelectronic device 20 in this embodiment may specifically be an electronic computer.

本实施例中,电源23用于为电子设备20上的各硬件设备提供工作电压;通信接口24能够为电子设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。In this embodiment, the power supply 23 is used to provide working voltage for each hardware device on theelectronic device 20; the communication interface 24 can create a data transmission channel between theelectronic device 20 and external devices, and the communication protocol it follows is applicable Any communication protocol in the technical solution of the present application is not specifically limited here; the input andoutput interface 25 is used to obtain external input data or output data to the external, and its specific interface type can be selected according to specific application needs, here Not specifically limited.

另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源可以包括操作系统221、计算机程序222等,存储方式可以是短暂存储或者永久存储。In addition, the memory 22, as a resource storage carrier, can be a read-only memory, random access memory, magnetic disk or optical disk, etc., and the resources stored thereon can include operating system 221, computer program 222, etc., and the storage method can be temporary storage or permanent storage. .

其中,操作系统221用于管理与控制电子设备20上的各硬件设备以及计算机程序222,其可以是Windows Server、Netware、Unix、Linux等。计算机程序222除了包括能够用于完成前述任一实施例公开的由电子设备20执行的文档检测方法的计算机程序之外,还可以进一步包括能够用于完成其他特定工作的计算机程序。Wherein, the operating system 221 is used to manage and control each hardware device on theelectronic device 20 and the computer program 222, which may be Windows Server, Netware, Unix, Linux, etc. In addition to the computer program 222 that can be used to complete the document detection method performed by theelectronic device 20 disclosed in any of the foregoing embodiments, the computer program 222 can further include a computer program that can be used to complete other specific tasks.

进一步的,本申请还公开了一种计算机可读存储介质,用于存储计算机程序;其中,所述计算机程序被处理器执行时实现前述公开的文档检测方法。关于该方法的具体步骤可以参考前述实施例中公开的相应内容,在此不再进行赘述。Further, the present application also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the document detection method disclosed above is realized. Regarding the specific steps of the method, reference may be made to the corresponding content disclosed in the foregoing embodiments, and details are not repeated here.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same or similar parts of each embodiment can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for the related information, please refer to the description of the method part.

专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Professionals can further realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two. In order to clearly illustrate the possible For interchangeability, in the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be directly implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个......”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this text, relational terms such as first and second etc. are only used to distinguish one entity or operation from another, and do not necessarily require or imply that these entities or operations, any such actual relationship or order exists. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

以上对本申请所提供的技术方案进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The technical solution provided by this application has been introduced in detail above, and specific examples have been used in this paper to illustrate the principle and implementation of this application. The description of the above embodiments is only used to help understand the method and core idea of this application; At the same time, for those skilled in the art, based on the idea of this application, there will be changes in the specific implementation and application scope. In summary, the content of this specification should not be construed as limiting the application.

Claims (11)

CN202211598627.2A2022-12-092022-12-09 A document detection method, device, equipment and storage mediumWithdrawnCN116015777A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202211598627.2ACN116015777A (en)2022-12-092022-12-09 A document detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202211598627.2ACN116015777A (en)2022-12-092022-12-09 A document detection method, device, equipment and storage medium

Publications (1)

Publication NumberPublication Date
CN116015777Atrue CN116015777A (en)2023-04-25

Family

ID=86023966

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202211598627.2AWithdrawnCN116015777A (en)2022-12-092022-12-09 A document detection method, device, equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN116015777A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117235716A (en)*2023-11-142023-12-15之江实验室 An unknown threat defense method and device for OOXML document template injection attack

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110955893A (en)*2019-11-222020-04-03杭州安恒信息技术股份有限公司Malicious file threat analysis platform and malicious file threat analysis method
CN111600788A (en)*2020-04-302020-08-28深信服科技股份有限公司Method and device for detecting harpoon mails, electronic equipment and storage medium
CN111737696A (en)*2020-06-282020-10-02杭州安恒信息技术股份有限公司Method, system and equipment for detecting malicious file and readable storage medium
CN115396184A (en)*2022-08-232022-11-25北京时代亿信科技股份有限公司Mail detection method and device and nonvolatile storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110955893A (en)*2019-11-222020-04-03杭州安恒信息技术股份有限公司Malicious file threat analysis platform and malicious file threat analysis method
CN111600788A (en)*2020-04-302020-08-28深信服科技股份有限公司Method and device for detecting harpoon mails, electronic equipment and storage medium
CN111737696A (en)*2020-06-282020-10-02杭州安恒信息技术股份有限公司Method, system and equipment for detecting malicious file and readable storage medium
CN115396184A (en)*2022-08-232022-11-25北京时代亿信科技股份有限公司Mail detection method and device and nonvolatile storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN117235716A (en)*2023-11-142023-12-15之江实验室 An unknown threat defense method and device for OOXML document template injection attack
CN117235716B (en)*2023-11-142024-02-13之江实验室 An unknown threat defense method and device for OOXML document template injection attack

Similar Documents

PublicationPublication DateTitle
US11373065B2 (en)Dictionary based deduplication of training set samples for machine learning based computer threat analysis
CN111835777B (en)Abnormal flow detection method, device, equipment and medium
CN109858248B (en) Malicious Word Document Detection Method and Device
CN114465780B (en)Feature extraction-based phishing mail detection method and system
US10389687B2 (en)Secure document transmission
GB2427048A (en)Detection of unwanted code or data in electronic mail
CN104361097A (en)Real-time detection method for electric power sensitive mail based on multimode matching
CN104168293A (en)Method and system for recognizing suspicious phishing web page in combination with local content rule base
CN103761478A (en)Judging method and device of malicious files
CN113472686A (en)Information identification method, device, equipment and storage medium
CN105653949A (en)Malicious program detection method and device
CN116738369A (en)Traffic data classification method, device, equipment and storage medium
CN114330280A (en)Sensitive data identification method and device
CN112989337B (en) A method and device for detecting malicious script code
CN116896455A (en)Network attack detection method and device, electronic equipment and storage medium
CN116015777A (en) A document detection method, device, equipment and storage medium
CN114143074B (en)webshell attack recognition device and method
CN114626061B (en)Webpage Trojan horse detection method and device, electronic equipment and medium
CN113810375A (en) Webshell detection method, apparatus, device and readable storage medium
CN113536300A (en)PDF file trust filtering and analyzing method, device, equipment and medium
CN114039776B (en)Method and device for generating flow detection rule, electronic equipment and storage medium
CN113722642B (en)Webpage conversion method and device, electronic equipment and storage medium
CN116910751A (en) Information security detection methods, devices, electronic equipment and storage media
CN116502192B (en)Data confusion method and device and electronic equipment
CN116170243B (en)POC (point-of-care) -based rule file generation method and device, electronic equipment and medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
WW01Invention patent application withdrawn after publication
WW01Invention patent application withdrawn after publication

Application publication date:20230425


[8]ページ先頭

©2009-2025 Movatter.jp