Movatterモバイル変換


[0]ホーム

URL:


CN107862050A - A kind of web site contents safety detecting system and method - Google Patents

A kind of web site contents safety detecting system and method
Download PDF

Info

Publication number
CN107862050A
CN107862050ACN201711090519.3ACN201711090519ACN107862050ACN 107862050 ACN107862050 ACN 107862050ACN 201711090519 ACN201711090519 ACN 201711090519ACN 107862050 ACN107862050 ACN 107862050A
Authority
CN
China
Prior art keywords
module
url
feature extraction
classifier
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711090519.3A
Other languages
Chinese (zh)
Inventor
王电钢
龚艳
母继元
毛启均
常健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INFORMATION & TELECOMMUNICATION COMPANY SICHUAN ELECTRIC POWER Corp
Original Assignee
INFORMATION & TELECOMMUNICATION COMPANY SICHUAN ELECTRIC POWER Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INFORMATION & TELECOMMUNICATION COMPANY SICHUAN ELECTRIC POWER CorpfiledCriticalINFORMATION & TELECOMMUNICATION COMPANY SICHUAN ELECTRIC POWER Corp
Priority to CN201711090519.3ApriorityCriticalpatent/CN107862050A/en
Publication of CN107862050ApublicationCriticalpatent/CN107862050A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种网站内容安全检测系统及方法,包括前端请求模块:输入待检测的URL网址,提交请求到爬虫模块;爬虫模块:爬取目标URL网址的图片信息;特征提取模块:将爬虫模块的图片信息和样本图片模块的图片信息均提取为特征向量;模型训练器:将样本图片的特征向量通过监督学习的方式生成分类器;FPGA硬件加速器:对特征提取模块提供硬件加速功能;安全仲裁模块:根据分类器对图片特征的分类结果,计算目标URL网址的安全系数。本发明通过上述原理,以样本图像特征作为模型训练器的输入得到分类器,使用FPGA硬件加速器对特征提取模块算法进行加速以提升系统响应速度,实现快速、高效且准确的网站内容安全检测的目的。

The invention discloses a website content security detection system and method, comprising a front-end request module: inputting a URL to be detected, and submitting a request to a crawler module; a crawler module: crawling picture information of a target URL; a feature extraction module: extracting The picture information of the module and the picture information of the sample picture module are extracted as feature vectors; the model trainer: the feature vector of the sample picture is used to generate a classifier through supervised learning; the FPGA hardware accelerator: provides hardware acceleration for the feature extraction module; security Arbitration module: Calculate the safety factor of the target URL according to the classification results of the image features by the classifier. Through the above principles, the present invention uses the sample image features as the input of the model trainer to obtain a classifier, uses the FPGA hardware accelerator to accelerate the feature extraction module algorithm to improve the system response speed, and realizes the purpose of fast, efficient and accurate website content security detection .

Description

Translated fromChinese
一种网站内容安全检测系统及方法System and method for website content security detection

技术领域technical field

本发明涉及网络安全技术领域,具体涉及一种网站内容安全检测系统及方法。The invention relates to the technical field of network security, in particular to a website content security detection system and method.

背景技术Background technique

随着互联网技术的发展,Web应用程序为人们的生活带来了极大的便利,极大地丰富了信息的传播方式。但一些非法分子通过制作钓鱼、赌博和色情等网站来为自己谋取利益,给人们安全健康上网带来了极大的安全隐患。因此,恶意网站的检测已经成为了一个严重的网络安全问题。With the development of Internet technology, Web applications have brought great convenience to people's life and greatly enriched the way of dissemination of information. However, some illegal elements seek benefits for themselves by creating websites such as phishing, gambling and pornography, which brings great security risks to people's safety and health when surfing the Internet. Therefore, the detection of malicious websites has become a serious network security problem.

目前对恶意网页的检测主要包括静态特征检测和动态特征检测两种方法。静态特征检测包括对网页的DNS信息、WHOIS信息、URL语法特征、HTML内容和JavaScript代码等进行分析;动态特征检测包括对链接跳转关系、浏览器行为和注册表变化等进行分析,使用机器学习的方式对网页进行分类检测也是对上述两类做法的补充。此外,使用蜜罐技术对恶意网页进行检测也是较为成熟的做法。At present, the detection of malicious web pages mainly includes two methods: static feature detection and dynamic feature detection. Static feature detection includes analysis of web page DNS information, WHOIS information, URL syntax features, HTML content, and JavaScript code; dynamic feature detection includes analysis of link jump relationships, browser behavior, and registry changes, etc., using machine learning It is also a supplement to the above two types of methods to classify and detect web pages. In addition, using honeypot technology to detect malicious web pages is also a relatively mature practice.

在文献《Beyond Blacklists:Learning to Detect Malicious Web Sites fromSuspicious URLs》中,Justin等研究者依据DNS信息、WHOIS信息以及URL语法特征,采用机器学习的方式对恶意的URL进行识别。该方式存在以下缺点:(1)一些恶意URL在语法特征和WHOIS注册信息上没有明显恶意特征,与正常URL有极大的相似性,误报率较高;(2)缺少对网页JavaScript和HTML内容的分析,仅通过分析DNS、WHOIS和URL信息来判断URL的安全性是片面的。In the document "Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs", Justin and other researchers used machine learning to identify malicious URLs based on DNS information, WHOIS information, and URL syntax features. This method has the following disadvantages: (1) Some malicious URLs have no obvious malicious features in grammatical features and WHOIS registration information, which are very similar to normal URLs, and the false positive rate is high; It is one-sided to judge the security of URL only by analyzing DNS, WHOIS and URL information.

在文献《Prophiler:A Fast Filter for the Large-Scale Detection ofMalicious Web Pages》中,Davide在Justin的研究基础上增加了对网页Javascript和HTML特征的分析,通过对网页内容的检测提升了对恶意网站的识别准确率;在论文《基于数据挖掘和机器学习的木马检测系统设计与实现》中,施宇通过提取网页特征,并使用机器学习和BP神经网络的方式对网页进行分类,从而达到对恶意网站的识别。以上两种方法较Justin的研究有了极大的改进,但都忽视了几个重要的问题:(1)对网页内容的分类,尤其是对图片的分类,使用SVM模型或是BP神经网络分类复杂图像时表现并不好,容易产生较大的偏差;(2)使用机器学习或深度学习的方式分类网页内容会给系统带来极大的开销,针对现在热门的通过使用硬件加速的方式提升系统响应速度的措施,二者没有做类似的加速处理。In the document "Prophiler: A Fast Filter for the Large-Scale Detection of Malicious Web Pages", Davide added the analysis of Javascript and HTML features of webpages based on Justin's research, and improved the detection of malicious websites through the detection of webpage content. Recognition accuracy; in the paper "Design and Implementation of Trojan Horse Detection System Based on Data Mining and Machine Learning", Shi Yu extracted the features of webpages, and used machine learning and BP neural network to classify webpages, so as to achieve the detection of malicious websites. identification. The above two methods have greatly improved compared with Justin's research, but they have ignored several important issues: (1) For the classification of web content, especially for the classification of pictures, use SVM model or BP neural network classification The performance of complex images is not good, and it is easy to produce large deviations; (2) Using machine learning or deep learning to classify web content will bring great overhead to the system. As for the measures of system response speed, the two did not do similar acceleration processing.

发明内容Contents of the invention

本发明所要解决的技术问题是提升现有网站内容安全检测的响应速度,对网页内容进行分析,减少误报率,目的在于提供一种网站内容安全检测系统及方法,以样本图像特征作为模型训练器的输入得到分类器,使用FPGA硬件加速器对特征提取模块算法进行加速以提升系统响应速度,实现快速、高效且准确的网站内容安全检测的目的。The technical problem to be solved by the present invention is to improve the response speed of the existing website content security detection, analyze the webpage content, and reduce the false alarm rate. The purpose is to provide a website content security detection system and method, using sample image features as model training The input of the classifier is obtained by the classifier, and the FPGA hardware accelerator is used to accelerate the algorithm of the feature extraction module to improve the system response speed and achieve the purpose of fast, efficient and accurate website content security detection.

本发明通过下述技术方案实现:The present invention realizes through following technical scheme:

一种网站内容安全检测系统,包括A website content security detection system, comprising

前端请求模块:输入待检测的URL网址,提交请求到爬虫模块;Front-end request module: input the URL address to be detected, and submit the request to the crawler module;

爬虫模块:爬取目标URL网址的图片信息;Crawler module: Crawl the image information of the target URL;

特征提取模块:将爬虫模块的图片信息和样本图片模块的图片信息均提取为特征向量;Feature extraction module: extract the image information of the crawler module and the image information of the sample image module as feature vectors;

模型训练器:将样本图片的特征向量通过监督学习的方式生成分类器;Model trainer: the feature vector of the sample picture is generated into a classifier through supervised learning;

FPGA硬件加速器:对特征提取模块提供硬件加速功能;FPGA hardware accelerator: provide hardware acceleration function for the feature extraction module;

安全仲裁模块:根据分类器对图片特征的分类结果,计算目标URL网址的安全系数;Safety arbitration module: calculate the safety factor of the target URL according to the classification results of the image features by the classifier;

数据存储模块:存储爬虫模块爬取的图片信息,存储对目标URL的检测结果信息;Data storage module: store the image information crawled by the crawler module, and store the detection result information of the target URL;

响应器:向前端请求模块返回目标URL的安全系数。Responder: Returns the safety factor of the target URL to the front-end request module.

本方案通过使用机器学习的方式对网站内容进行安全检测,特征提取模块提取图像特征,模型训练器依据提取的样本图像特征训练得到分类器,分类器依据图像特征对图像进行分类,实现将图像进行分类判断,不会将恶意URL在语法特征和WHOIS注册信息上没有明显恶意特征,与正常URL相混淆,发生误判断,本方案的判断方法偏差小,误报率底,并使用FPGA硬件加速器对特征提取模块算法进行加速以提升系统响应速度,实现快速、高效且准确的网站内容安全检测的目的。This solution uses machine learning to detect the security of website content, the feature extraction module extracts image features, the model trainer trains the classifier based on the extracted sample image features, and the classifier classifies the images according to the image features to achieve image classification Classification and judgment will not confuse malicious URLs with normal URLs without obvious malicious characteristics in grammatical features and WHOIS registration information, resulting in misjudgment. The judgment method of this solution has small deviation and low false positive rate, and FPGA hardware accelerator is used to The feature extraction module algorithm is accelerated to improve the system response speed and achieve the purpose of fast, efficient and accurate website content security detection.

优选的,FPGA硬件加速器使用Xilinx可重配置加速堆栈,结合Caffe机器学习框架和Xilinx深度神经网络DNN库予以实现。Preferably, the FPGA hardware accelerator is implemented using a Xilinx reconfigurable acceleration stack combined with the Caffe machine learning framework and the Xilinx deep neural network DNN library.

优选的,Caffe机器学习框架为一个CNN卷积神经网络深度学习的集成框架。现有技术使用SVM模型或是BP神经网络分类复杂图像时,容易产生较大的偏差,而本方案分类器将爬取得文本和图片内容,通过使用CNN卷积神经网络深度学习的方法提取图像特征向量,以样本图像特征作为模型训练器的输入得到分类器的行式,在分析复杂图像时较SVM模型或BP神经网络分类算法不易产生偏差,网站筛选结果更准确。本方案特征提取模块使用Xilinx可重配置加速堆栈FPGA硬件加速器进行核心算法的加速,极大的提高了系统的响应速度。Preferably, the Caffe machine learning framework is an integrated framework for CNN convolutional neural network deep learning. When the existing technology uses the SVM model or BP neural network to classify complex images, it is easy to produce large deviations. However, the classifier of this scheme will crawl to obtain text and picture content, and extract image features by using CNN convolutional neural network deep learning method Vector, using the sample image features as the input of the model trainer to obtain the row formula of the classifier, it is less prone to deviation than the SVM model or BP neural network classification algorithm when analyzing complex images, and the website screening results are more accurate. The feature extraction module of this solution uses the Xilinx reconfigurable acceleration stack FPGA hardware accelerator to accelerate the core algorithm, which greatly improves the response speed of the system.

优选的,安全仲裁模块通过被标记非安全的图片数目是否超过设定阈值,来计算得到目标网站安全系数。Preferably, the safety arbitration module calculates the safety factor of the target website according to whether the number of pictures marked as unsafe exceeds a set threshold.

优选的,样本图片模块包括正常图片和非正常图片,非正常图片指有赌博和色情特征的图片。Preferably, the sample picture module includes normal pictures and abnormal pictures, and the abnormal pictures refer to pictures with characteristics of gambling and pornography.

一种网站内容安全检测方法,包括如下步骤:A method for detecting website content security, comprising the steps of:

S1:特征提取模块将样本图片模块的图片信息提取为特征向量的形式;S1: The feature extraction module extracts the picture information of the sample picture module into the form of a feature vector;

S2:将S1得到的样本特征向量为输入,模型训练器使用监督学习的方式生成分类器;S2: The sample feature vector obtained in S1 is used as input, and the model trainer generates a classifier by means of supervised learning;

S3:在前端请求模块输入待检测的URL网址,检测该网址的合法性,将该网址提交到爬虫模块;S3: Input the URL to be detected in the front-end request module, detect the validity of the URL, and submit the URL to the crawler module;

S4:爬虫模块接收来自前端请求模块发送的URL网址,爬取目标URL网址的图片信息,并将爬取内容存储到数据存储模块;S4: The crawler module receives the URL sent by the front-end request module, crawls the picture information of the target URL, and stores the crawled content in the data storage module;

S5:特征提取模块提取S4爬取的图片的特征向量;S5: the feature extraction module extracts the feature vector of the picture crawled by S4;

S6:以S5提取的图像特征向量为输入,分类器对爬取的图像进行分类;S6: Taking the image feature vector extracted by S5 as input, the classifier classifies the crawled images;

S7:安全仲裁模块根据S6的分类结果,计算目标网址的安全系数,并以目标URL网址、本地保存目标网站的图片路径、检测时间及安全系数进行存储;S7: The security arbitration module calculates the safety factor of the target website according to the classification result of S6, and stores it with the target URL address, the image path of the locally saved target website, the detection time and the safety factor;

S8:响应模块将目标网址的检测结果发送到前端请求模块。S8: The response module sends the detection result of the target website to the front-end request module.

优选的,特征提取模块使用FPGA加速器对图片特征提取算法进行加速。Preferably, the feature extraction module uses an FPGA accelerator to accelerate the image feature extraction algorithm.

优选的,FPGA硬件加速器使用Xilinx可重配置加速堆栈,结合Caffe机器学习框架和Xilinx深度神经网络DNN库予以实现,Caffe机器学习框架为一个CNN卷积神经网络深度学习的集成框架。Preferably, the FPGA hardware accelerator uses a Xilinx reconfigurable acceleration stack, combined with the Caffe machine learning framework and the Xilinx deep neural network DNN library to be implemented, and the Caffe machine learning framework is an integrated framework for deep learning of CNN convolutional neural networks.

本发明与现有技术相比,具有如下的优点和有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明以样本图像特征作为模型训练器的输入得到分类器,通过使用机器学习的方式对网站内容进行安全检测,并使用FPGA加速器对图片特征提取算法进行加速,实现了一种网站内容实现快速、高效且准确的网站内容安全检测。1. The present invention uses sample image features as the input of the model trainer to obtain a classifier, uses machine learning to perform safety detection on website content, and uses an FPGA accelerator to accelerate the image feature extraction algorithm, thereby realizing a website content realization Fast, efficient and accurate website content security detection.

2、本发明分类器将爬取的文本和图片内容,使用CNN深度学习的方式进行图像特征的提取,在分析复杂图像时较SVM模型或BP神经网络分类算法,不易产生较大的偏差,提取效果更好。2. The classifier of the present invention uses the CNN deep learning method to extract image features from the crawled text and picture content. Compared with the SVM model or BP neural network classification algorithm when analyzing complex images, it is not easy to produce large deviations. Better results.

3、本发明提取模块使用Xilinx可重配置加速堆栈FPGA硬件加速器进行核心算法的加速,极大的提高了系统的响应速度。3. The extraction module of the present invention uses the Xilinx reconfigurable acceleration stack FPGA hardware accelerator to accelerate the core algorithm, which greatly improves the response speed of the system.

附图说明Description of drawings

此处所说明的附图用来提供对本发明实施例的进一步理解,构成本申请的一部分,并不构成对本发明实施例的限定。在附图中:The drawings described here are used to provide a further understanding of the embodiments of the present invention, constitute a part of the application, and do not limit the embodiments of the present invention. In the attached picture:

图1为本发明结构示意图;Fig. 1 is a structural representation of the present invention;

图2为Xilinx可重配置加速协议栈示意图。Figure 2 is a schematic diagram of the Xilinx reconfigurable acceleration protocol stack.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白,下面结合实施例和附图,对本发明作进一步的详细说明,本发明的示意性实施方式及其说明仅用于解释本发明,并不作为对本发明的限定。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the examples and accompanying drawings. As a limitation of the present invention.

实施例1:Example 1:

如图1-2所示,本发明包括一种网站内容安全检测系统,包括As shown in Figure 1-2, the present invention includes a website content security detection system, including

前端请求模块:输入待检测的URL网址,提交请求到爬虫模块;Front-end request module: input the URL address to be detected, and submit the request to the crawler module;

爬虫模块:爬取目标URL网址的图片信息;Crawler module: Crawl the image information of the target URL;

特征提取模块:将爬虫模块的图片信息和样本图片模块的图片信息均提取为特征向量;Feature extraction module: extract the image information of the crawler module and the image information of the sample image module as feature vectors;

模型训练器:将样本图片的特征向量通过监督学习的方式生成分类器;Model trainer: the feature vector of the sample picture is generated into a classifier through supervised learning;

FPGA硬件加速器:对特征提取模块提供硬件加速功能;FPGA hardware accelerator: provide hardware acceleration function for the feature extraction module;

安全仲裁模块:根据分类器对图片特征的分类结果,计算目标URL网址的安全系数;Safety arbitration module: calculate the safety factor of the target URL according to the classification results of the image features by the classifier;

数据存储模块:存储爬虫模块爬取的图片信息,存储对目标URL的检测结果信息;Data storage module: store the image information crawled by the crawler module, and store the detection result information of the target URL;

响应器:向前端请求模块返回目标URL的安全系数。Responder: Returns the safety factor of the target URL to the front-end request module.

现有对恶意网站检测的系统对一些恶意URL在语法特征和WHOIS注册信息上没有明显恶意特征,与正常URL有极大的相似性的网页,误报率较高;同时缺少对网页JavaScript和HTML内容的分析,仅通过分析DNS、WHOIS和URL信息来判断URL的安全性,判断非常的片面;对网页内容的分类,尤其是对复杂图像的分类,容易产生较大的偏差,影响最终的判断结果;采用机器学习或深度学习的方式分类网页内容,系统响应慢,影响效率。The existing malicious website detection system has a high false alarm rate for some malicious URLs that have no obvious malicious features in terms of grammatical features and WHOIS registration information, and have a great similarity with normal URLs; Content analysis, only by analyzing DNS, WHOIS and URL information to judge the security of the URL, the judgment is very one-sided; the classification of web content, especially the classification of complex images, is prone to large deviations, affecting the final judgment The result; using machine learning or deep learning to classify web content, the system responds slowly and affects efficiency.

本方案通过使用机器学习的方式对网站内容进行安全检测,特征提取模块提取图像特征,模型训练器依据提取的样本图像特征训练得到分类器,分类器依据图像特征对图像进行分类,实现将图像进行分类判断,不会将恶意URL在语法特征和WHOIS注册信息上没有明显恶意特征,与正常URL相混淆,发生误判断,本方案的判断方法偏差小,误报率底,并使用FPGA硬件加速器对特征提取模块算法进行加速以提升系统响应速度,实现快速、高效且准确的网站内容安全检测的目的。This solution uses machine learning to detect the security of website content, the feature extraction module extracts image features, the model trainer trains the classifier based on the extracted sample image features, and the classifier classifies the images according to the image features to achieve image classification Classification and judgment will not confuse malicious URLs with normal URLs without obvious malicious characteristics in grammatical features and WHOIS registration information, resulting in misjudgment. The judgment method of this solution has small deviation and low false positive rate, and FPGA hardware accelerator is used to The feature extraction module algorithm is accelerated to improve the system response speed and achieve the purpose of fast, efficient and accurate website content security detection.

实施例2:Example 2:

本实施例在实施例1的基础上优选如下:FPGA硬件加速器使用Xilinx可重配置加速堆栈,结合Caffe机器学习框架和Xilinx深度神经网络DNN库予以实现。This embodiment is preferably as follows on the basis of Embodiment 1: the FPGA hardware accelerator uses the Xilinx reconfigurable acceleration stack, combined with the Caffe machine learning framework and the Xilinx deep neural network DNN library to implement.

Caffe机器学习框架为一个CNN卷积神经网络深度学习的集成框架。现有技术使用SVM模型或是BP神经网络分类复杂图像时,容易产生较大的偏差,而本方案分类器将爬取得文本和图片内容,通过使用CNN卷积神经网络深度学习的方法提取图像特征向量,以样本图像特征作为模型训练器的输入得到分类器的行式,在分析复杂图像时较SVM模型或BP神经网络分类算法不易产生偏差,网站筛选结果更准确。本方案特征提取模块使用Xilinx可重配置加速堆栈FPGA硬件加速器进行核心算法的加速,极大的提高了系统的响应速度。The Caffe machine learning framework is an integrated framework for deep learning of CNN convolutional neural networks. When the existing technology uses the SVM model or BP neural network to classify complex images, it is easy to produce large deviations. However, the classifier of this scheme will crawl to obtain text and picture content, and extract image features by using CNN convolutional neural network deep learning method Vector, using the sample image features as the input of the model trainer to obtain the row formula of the classifier, it is less prone to deviation than the SVM model or BP neural network classification algorithm when analyzing complex images, and the website screening results are more accurate. The feature extraction module of this solution uses the Xilinx reconfigurable acceleration stack FPGA hardware accelerator to accelerate the core algorithm, which greatly improves the response speed of the system.

安全仲裁模块通过被标记非安全的图片数目是否超过设定阈值,来计算得到目标网站安全系数。The safety arbitration module calculates the safety factor of the target website according to whether the number of pictures marked as unsafe exceeds a set threshold.

样本图片模块包括正常图片和非正常图片,非正常图片指有赌博和色情等特征的图片。通过样本图片模块生成的分类器,用于判断URL网址的图片是否为非正常图片判断准确率高。The sample picture module includes normal pictures and abnormal pictures, and abnormal pictures refer to pictures with characteristics of gambling and pornography. The classifier generated by the sample image module is used to judge whether the image of the URL is an abnormal image with high accuracy.

实施例3:Example 3:

一种网站内容安全检测方法,包括如下步骤:A method for detecting website content security, comprising the steps of:

S1:特征提取模块将样本图片模块的图片信息提取为特征向量的形式;S1: The feature extraction module extracts the picture information of the sample picture module into the form of a feature vector;

S2:将S1得到的样本特征向量为输入,模型训练器使用监督学习的方式生成分类器;S2: The sample feature vector obtained in S1 is used as input, and the model trainer generates a classifier by means of supervised learning;

S3:在前端请求模块输入待检测的URL网址,检测该网址的合法性,将该网址提交到爬虫模块;S3: Input the URL to be detected in the front-end request module, detect the validity of the URL, and submit the URL to the crawler module;

S4:爬虫模块接收来自前端请求模块发送的URL网址,爬取目标URL网址的图片信息,并将爬取内容存储到数据存储模块;S4: The crawler module receives the URL sent by the front-end request module, crawls the picture information of the target URL, and stores the crawled content in the data storage module;

S5:特征提取模块提取S4爬取的图片的特征向量;S5: the feature extraction module extracts the feature vector of the picture crawled by S4;

S6:以S5提取的图像特征向量为输入,分类器对爬取的图像进行分类;S6: Taking the image feature vector extracted by S5 as input, the classifier classifies the crawled images;

S7:安全仲裁模块根据S6的分类结果,计算目标网址的安全系数,并以目标URL网址、本地保存目标网站的图片路径、检测时间及安全系数进行存储;S7: The security arbitration module calculates the safety factor of the target website according to the classification result of S6, and stores it with the target URL address, the image path of the locally saved target website, the detection time and the safety factor;

S8:响应模块将目标网址的检测结果发送到前端请求模块。S8: The response module sends the detection result of the target website to the front-end request module.

特征提取模块使用FPGA加速器对图片特征提取算法进行加速。The feature extraction module uses the FPGA accelerator to accelerate the image feature extraction algorithm.

FPGA硬件加速器使用Xilinx可重配置加速堆栈,结合Caffe机器学习框架和Xilinx深度神经网络DNN库予以实现,Caffe机器学习框架为一个CNN卷积神经网络深度学习的集成框架。The FPGA hardware accelerator uses the Xilinx reconfigurable acceleration stack, combined with the Caffe machine learning framework and the Xilinx deep neural network DNN library. The Caffe machine learning framework is an integrated framework for deep learning of CNN convolutional neural networks.

本方案第一步骤使用Caffe框架的convert_imageset方法将训练集样本图片转化为其可以运行的.leveldb文件,调用该方法时使用-resize_width和-resize_height参数选项使训练集样本图片尺寸保持一致,本方法使用的图像修正后的分辨率为256*256,并且训练集样本图片都是预先经过标签过程的。The first step of this program uses the convert_imageset method of the Caffe framework to convert the training set sample image into a .leveldb file that can be run. When calling this method, use the -resize_width and -resize_height parameter options to keep the size of the training set sample image consistent. This method uses The corrected image resolution is 256*256, and the training set sample images are pre-labeled.

第二步骤,继续使用Caffe框架的extract_features方法对上面生成的.leveldb文件以特征向量形式提取样本图像特征,并调用Xilinx可重配置加速堆栈深度神经网络库DNN对该过程进行硬件加速,以提升该模块的运行速度。In the second step, continue to use the extract_features method of the Caffe framework to extract the sample image features in the form of feature vectors from the .leveldb file generated above, and call the Xilinx reconfigurable acceleration stack deep neural network library DNN to hardware-accelerate the process to improve the The operating speed of the module.

第三步骤,启动模型训练器,通过定义name.prototxt和name_solver.prototxt文件,使用Caffe框架的模型训练train方法及其参数--solver对步骤二得到的特征向量使用监督学习的方式训练模型,该过程使用fine-turning操作对模型进行不断修正,最终生成与标签数目相同的并可以对敏感(赌博、色情等)图片进行划分的分类器。The third step is to start the model trainer. By defining the name.prototxt and name_solver.prototxt files, use the model training train method of the Caffe framework and its parameters -- the solver uses supervised learning to train the model for the feature vector obtained in step 2. The process uses the fine-turning operation to continuously correct the model, and finally generates a classifier with the same number of labels and can classify sensitive (gambling, pornographic, etc.) pictures.

第四步骤,使用Html、CSS和JavaScript编写前端界面,在前端输入框填写要检测的目标URL,检测该URL的合法性,如输入的内容是否可能引起XSS、SQL注入等安全漏洞。若输入的URL合法,使用JQuery库的ajax post()方法将该URL发送到爬虫模块。The fourth step is to use Html, CSS and JavaScript to write the front-end interface, fill in the target URL to be detected in the front-end input box, and check the legitimacy of the URL, such as whether the input content may cause security vulnerabilities such as XSS and SQL injection. If the input URL is valid, use the ajax post() method of the JQuery library to send the URL to the crawler module.

第五步骤,爬虫模块接收到前端请求模块的URL检测请求,使用Python Scrapy框架对目标URL爬取图片信息,并以本地文件存储的方式将爬取的图片进行保存。In the fifth step, the crawler module receives the URL detection request from the front-end request module, uses the Python Scrapy framework to crawl the image information of the target URL, and saves the crawled image in the form of local file storage.

第六步骤,类似于步骤一,对步骤五爬取的图片进行尺寸修订和生成Caffe可以运行的.leveldb文件。并使用将步骤五爬取的图片作为测试集特征提取模块提取爬虫图像的特征向量,使用步骤三生成的分类器依据该特征向量对爬虫图像进行分类,将敏感图像标记为非安全图像。The sixth step, similar to step 1, is to modify the size of the image crawled in step 5 and generate a .leveldb file that Caffe can run. And use the picture crawled in step 5 as the test set feature extraction module to extract the feature vector of the crawler image, use the classifier generated in step 3 to classify the crawler image according to the feature vector, and mark the sensitive image as an unsafe image.

第七步骤,安全仲裁模块通过被标记非安全的图片数目是否超过设定阈值,计算得到目标网站安全系数,并以目标URL网址、本地保存目标网站的图片路径、检测时间及安全系数等为字段存储数据存储模块。In the seventh step, the safety arbitration module calculates the safety factor of the target website based on whether the number of pictures marked as unsafe exceeds the set threshold, and uses the target URL, the path of the picture stored locally on the target website, detection time and safety factor as fields Store data storage modules.

第八步骤,响应器向前端请求模块发送本次目标URL安全检测数据。In the eighth step, the responder sends the target URL security detection data to the front-end request module.

本方法先抓取需要检测网站的图片信息,通过分类器进行智能分类后,计算得到准确的检测网站安全系数值,然后返回给前端请求模块显示。本方案通过使用机器学习的方式对网站内容进行安全检测,特征提取模块提取图像特征,模型训练器依据提取的样本图像特征训练得到分类器,分类器依据图像特征对图像进行分类,实现将图像进行分类判断,偏差小,误报率底,并使用FPGA硬件加速器对特征提取模块算法进行加速以提升系统响应速度,实现快速、高效且准确的网站内容安全检测的目的。This method first captures the image information of the website to be detected, and after intelligent classification by the classifier, calculates the accurate value of the safety factor of the detection website, and then returns it to the front-end request module for display. This solution uses machine learning to detect the security of website content, the feature extraction module extracts image features, the model trainer trains the classifier based on the extracted sample image features, and the classifier classifies the images according to the image features to achieve image classification Classification judgment, small deviation, low false alarm rate, and use FPGA hardware accelerator to accelerate the feature extraction module algorithm to improve system response speed, and achieve fast, efficient and accurate website content security detection.

以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Protection scope, within the spirit and principles of the present invention, any modification, equivalent replacement, improvement, etc., shall be included in the protection scope of the present invention.

Claims (8)

Translated fromChinese
1.一种网站内容安全检测系统,其特征在于,包括1. A website content security detection system, characterized in that, comprising前端请求模块:输入待检测的URL网址,提交请求到爬虫模块;Front-end request module: input the URL address to be detected, and submit the request to the crawler module;爬虫模块:爬取目标URL网址的图片信息;Crawler module: Crawl the image information of the target URL;特征提取模块:将爬虫模块的图片信息和样本图片模块的图片信息均提取为特征向量;Feature extraction module: extract the image information of the crawler module and the image information of the sample image module as feature vectors;模型训练器:将样本图片的特征向量通过监督学习的方式生成分类器;Model trainer: the feature vector of the sample picture is generated into a classifier through supervised learning;FPGA硬件加速器:对特征提取模块提供硬件加速功能;FPGA hardware accelerator: provide hardware acceleration function for the feature extraction module;安全仲裁模块:根据分类器对图片特征的分类结果,计算目标URL网址的安全系数;Safety arbitration module: calculate the safety factor of the target URL according to the classification results of the image features by the classifier;数据存储模块:存储爬虫模块爬取的图片信息,存储对目标URL的检测结果信息;Data storage module: store the image information crawled by the crawler module, and store the detection result information of the target URL;响应器:向前端请求模块返回目标URL的安全系数。Responder: Returns the safety factor of the target URL to the front-end request module.2.根据权利要求1所述的一种网站内容安全检测系统,其特征在于,FPGA硬件加速器使用Xilinx可重配置加速堆栈,结合Caffe机器学习框架和Xilinx深度神经网络DNN库予以实现。2. A kind of website content security detection system according to claim 1, is characterized in that, FPGA hardware accelerator uses Xilinx reconfigurable acceleration stack, realizes in conjunction with Caffe machine learning framework and Xilinx deep neural network DNN library.3.根据权利要求2所述的一种网站内容安全检测系统,其特征在于,Caffe机器学习框架为一个CNN卷积神经网络深度学习的集成框架。3. A kind of website content security detection system according to claim 2, is characterized in that, Caffe machine learning framework is the integration framework of a CNN convolutional neural network deep learning.4.根据权利要求1所述的一种网站内容安全检测系统,其特征在于,安全仲裁模块通过被标记非安全的图片数目是否超过设定阈值,来计算得到目标网站安全系数。4. A website content safety detection system according to claim 1, wherein the safety arbitration module calculates the safety factor of the target website according to whether the number of pictures marked as unsafe exceeds a set threshold.5.根据权利要求1所述的一种网站内容安全检测系统,其特征在于,样本图片模块包括正常图片和非正常图片,非正常图片指有赌博和色情特征的图片。5. A website content security detection system according to claim 1, wherein the sample picture module includes normal pictures and abnormal pictures, and the abnormal pictures refer to pictures with characteristics of gambling and pornography.6.一种网站内容安全检测方法,其特征在于,包括如下步骤:6. A website content security detection method, is characterized in that, comprises the following steps:S1:特征提取模块将样本图片模块的图片信息提取为特征向量的形式;S1: The feature extraction module extracts the picture information of the sample picture module into the form of a feature vector;S2:将S1得到的样本特征向量为输入,模型训练器使用监督学习的方式生成分类器;S2: The sample feature vector obtained in S1 is used as input, and the model trainer generates a classifier by means of supervised learning;S3:在前端请求模块输入待检测的URL网址,检测该网址的合法性,将该网址提交到爬虫模块;S3: Input the URL to be detected in the front-end request module, detect the validity of the URL, and submit the URL to the crawler module;S4:爬虫模块接收来自前端请求模块发送的URL网址,爬取目标URL网址的图片信息,并将爬取内容存储到数据存储模块;S4: The crawler module receives the URL sent by the front-end request module, crawls the image information of the target URL, and stores the crawled content in the data storage module;S5:特征提取模块提取S4爬取的图片的特征向量;S5: the feature extraction module extracts the feature vector of the picture crawled by S4;S6:以S5提取的图像特征向量为输入,分类器对爬取的图像进行分类;S6: Taking the image feature vector extracted by S5 as input, the classifier classifies the crawled images;S7:安全仲裁模块根据S6的分类结果,计算目标网址的安全系数,并以目标URL网址、本地保存目标网站的图片路径、检测时间及安全系数进行存储;S7: The security arbitration module calculates the safety factor of the target website according to the classification result of S6, and stores it with the target URL address, the image path of the locally saved target website, the detection time and the safety factor;S8:响应模块将目标网址的检测结果发送到前端请求模块。S8: The response module sends the detection result of the target website to the front-end request module.7.根据权利要求6所述的一种网站内容安全检测方法,其特征在于,特征提取模块使用FPGA硬件加速器对图片特征提取算法进行加速。7. A method for detecting website content security according to claim 6, wherein the feature extraction module uses an FPGA hardware accelerator to accelerate the image feature extraction algorithm.8.根据权利要求7所述的一种网站内容安全检测方法,其特征在于,FPGA硬件加速器使用Xilinx可重配置加速堆栈,结合Caffe机器学习框架和Xilinx深度神经网络DNN库予以实现,Caffe机器学习框架为一个CNN卷积神经网络深度学习的集成框架。8. A kind of website content security detection method according to claim 7, it is characterized in that, FPGA hardware accelerator uses Xilinx reconfigurable acceleration stack, realizes in conjunction with Caffe machine learning framework and Xilinx deep neural network DNN storehouse, Caffe machine learning The framework is an integrated framework for deep learning of CNN convolutional neural network.
CN201711090519.3A2017-11-082017-11-08A kind of web site contents safety detecting system and methodPendingCN107862050A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201711090519.3ACN107862050A (en)2017-11-082017-11-08A kind of web site contents safety detecting system and method

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201711090519.3ACN107862050A (en)2017-11-082017-11-08A kind of web site contents safety detecting system and method

Publications (1)

Publication NumberPublication Date
CN107862050Atrue CN107862050A (en)2018-03-30

Family

ID=61701187

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201711090519.3APendingCN107862050A (en)2017-11-082017-11-08A kind of web site contents safety detecting system and method

Country Status (1)

CountryLink
CN (1)CN107862050A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110275958A (en)*2019-06-262019-09-24北京市博汇科技股份有限公司Site information recognition methods, device and electronic equipment
CN110633226A (en)*2018-06-222019-12-31武汉海康存储技术有限公司Fusion memory, storage system and deep learning calculation method
CN111091019A (en)*2019-12-232020-05-01支付宝(杭州)信息技术有限公司 A kind of information prompting method, device and equipment
CN111401115A (en)*2019-08-012020-07-10江苏农林职业技术学院Strawberry disease and pest hyperspectral data processing method and device based on FPGA
CN111475699A (en)*2020-03-072020-07-31咪咕文化科技有限公司Website data crawling method and device, electronic equipment and readable storage medium
CN111626309A (en)*2020-05-262020-09-04北京墨云科技有限公司Website fingerprint identification method based on deep learning
CN111651658A (en)*2020-06-052020-09-11杭州安恒信息技术股份有限公司Method and computer equipment for automatically identifying website based on deep learning
CN112731305A (en)*2020-12-172021-04-30国网四川省电力公司信息通信公司Direct wave suppression method and system based on adaptive Doppler domain beam cancellation
CN113177409A (en)*2021-05-062021-07-27上海慧洲信息技术有限公司Intelligent sensitive word recognition system
CN113657453A (en)*2021-07-222021-11-16珠海高凌信息科技股份有限公司Harmful website detection method based on generation of countermeasure network and deep learning
CN113728334A (en)*2019-04-242021-11-30瑞典爱立信有限公司Method for protecting pattern classification nodes from malicious requests, and related network and nodes
CN114118398A (en)*2020-08-312022-03-01中移(苏州)软件技术有限公司 Detection method, system, electronic device and storage medium for target type website
CN115186263A (en)*2022-07-152022-10-14深圳安巽科技有限公司Method, system and storage medium for preventing illegal induced activities
US11609989B2 (en)2019-03-262023-03-21Proofpoint, Inc.Uniform resource locator classifier and visual comparison platform for malicious site detection
CN118646569A (en)*2024-06-042024-09-13湖北华博网智信息技术有限公司 A network security early warning method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101968813A (en)*2010-10-252011-02-09华北电力大学Method for detecting counterfeit webpage
US20140270350A1 (en)*2013-03-142014-09-18Xerox CorporationData driven localization using task-dependent representations
CN106776946A (en)*2016-12-022017-05-31重庆大学A kind of detection method of fraudulent website

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101968813A (en)*2010-10-252011-02-09华北电力大学Method for detecting counterfeit webpage
US20140270350A1 (en)*2013-03-142014-09-18Xerox CorporationData driven localization using task-dependent representations
CN106776946A (en)*2016-12-022017-05-31重庆大学A kind of detection method of fraudulent website

Cited By (28)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110633226A (en)*2018-06-222019-12-31武汉海康存储技术有限公司Fusion memory, storage system and deep learning calculation method
US11609989B2 (en)2019-03-262023-03-21Proofpoint, Inc.Uniform resource locator classifier and visual comparison platform for malicious site detection
US12166796B2 (en)2019-03-262024-12-10Proofpoint, Inc.Uniform resource locator classifier and visual comparison platform for malicious site detection
EP3716575B1 (en)*2019-03-262024-07-17Proofpoint, Inc.Visual comparison platform for malicious site detection
US11924246B2 (en)2019-03-262024-03-05Proofpoint, Inc.Uniform resource locator classifier and visual comparison platform for malicious site detection preliminary
US11799905B2 (en)2019-03-262023-10-24Proofpoint, Inc.Uniform resource locator classifier and visual comparison platform for malicious site detection
US12355815B2 (en)2019-03-262025-07-08Proofpoint, Inc.Uniform resource locator classifier and visual comparison platform for malicious site detection
CN113728334A (en)*2019-04-242021-11-30瑞典爱立信有限公司Method for protecting pattern classification nodes from malicious requests, and related network and nodes
CN113728334B (en)*2019-04-242025-07-25瑞典爱立信有限公司Method for protecting pattern classification nodes from malicious requests, and related network and nodes
CN110275958B (en)*2019-06-262021-07-27北京市博汇科技股份有限公司Website information identification method and device and electronic equipment
CN110275958A (en)*2019-06-262019-09-24北京市博汇科技股份有限公司Site information recognition methods, device and electronic equipment
CN111401115B (en)*2019-08-012024-02-27江苏农林职业技术学院Method and device for processing strawberry disease and pest hyperspectral data based on FPGA
CN111401115A (en)*2019-08-012020-07-10江苏农林职业技术学院Strawberry disease and pest hyperspectral data processing method and device based on FPGA
CN111091019A (en)*2019-12-232020-05-01支付宝(杭州)信息技术有限公司 A kind of information prompting method, device and equipment
CN111475699A (en)*2020-03-072020-07-31咪咕文化科技有限公司Website data crawling method and device, electronic equipment and readable storage medium
CN111475699B (en)*2020-03-072023-09-08咪咕文化科技有限公司Website data crawling method and device, electronic equipment and readable storage medium
CN111626309A (en)*2020-05-262020-09-04北京墨云科技有限公司Website fingerprint identification method based on deep learning
CN111651658A (en)*2020-06-052020-09-11杭州安恒信息技术股份有限公司Method and computer equipment for automatically identifying website based on deep learning
CN114118398A (en)*2020-08-312022-03-01中移(苏州)软件技术有限公司 Detection method, system, electronic device and storage medium for target type website
CN112731305A (en)*2020-12-172021-04-30国网四川省电力公司信息通信公司Direct wave suppression method and system based on adaptive Doppler domain beam cancellation
CN112731305B (en)*2020-12-172024-05-03国网四川省电力公司信息通信公司Direct wave inhibition method and system based on self-adaptive Doppler domain beam cancellation
CN113177409B (en)*2021-05-062024-05-31上海慧洲信息技术有限公司Intelligent sensitive word recognition system
CN113177409A (en)*2021-05-062021-07-27上海慧洲信息技术有限公司Intelligent sensitive word recognition system
CN113657453B (en)*2021-07-222023-08-01珠海高凌信息科技股份有限公司Detection method based on harmful website generating countermeasure network and deep learning
CN113657453A (en)*2021-07-222021-11-16珠海高凌信息科技股份有限公司Harmful website detection method based on generation of countermeasure network and deep learning
CN115186263A (en)*2022-07-152022-10-14深圳安巽科技有限公司Method, system and storage medium for preventing illegal induced activities
CN118646569A (en)*2024-06-042024-09-13湖北华博网智信息技术有限公司 A network security early warning method and system
CN118646569B (en)*2024-06-042024-11-26湖北华博网智信息技术有限公司 A network security early warning method and system

Similar Documents

PublicationPublication DateTitle
CN107862050A (en)A kind of web site contents safety detecting system and method
US9935967B2 (en)Method and device for detecting malicious URL
CN108737423B (en)Phishing website discovery method and system based on webpage key content similarity analysis
CN103544436B (en)System and method for distinguishing phishing websites
EP2877956B1 (en)System and method to provide automatic classification of phishing sites
CN103530367B (en)A kind of fishing website identification system and method
CN104168293B (en)The method and system of suspicious fishing webpage are recognized with reference to local content rule base
CN102624713B (en)The method of website tamper Detection and device
CN102004764A (en)Internet bad information detection method and system
CN102591965B (en) Method and device for black chain detection
CN105184159A (en)Web page falsification identification method and apparatus
CN111835777B (en)Abnormal flow detection method, device, equipment and medium
CN103226688B (en)The authentication method of the anti-tamper and anti-counterfeiting of a kind of Quick Response Code
CN101901221A (en)Method and device for detecting cross site scripting
CN103593615B (en)The detection method of a kind of webpage tamper and device
CN105975523A (en)Hidden hyperlink detection method based on stack
CN104133870B (en)A kind of webpage similarity calculating method and device
CN107911360A (en)One kind is hacked website detection method and system
CN112199569A (en) A method, system, computer equipment and storage medium for identifying prohibited website
CN104036189A (en)Page distortion detecting method and black link database generating method
CN104036190A (en)Method and device for detecting page tampering
CN106060038B (en)Detection method for phishing site based on client-side program behavioural analysis
CN106446123A (en)Webpage verification code element identification method
CN120200832A (en) Anti-phishing web page detection system and method
CN104077353B (en)A kind of method and device of detecting black chain

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
RJ01Rejection of invention patent application after publication
RJ01Rejection of invention patent application after publication

Application publication date:20180330


[8]ページ先頭

©2009-2025 Movatter.jp