CN110708296A

Movatterモバイル変換

Info

Publication number: CN110708296A
Application number: CN201910884661.8A
Authority: CN
Inventors: 周海清; 何丹; 孙成胜; 张焱; 王伟; 康英来
Original assignee: China Electronic Technology Cyber Security Co Ltd
Current assignee: China Electronic Technology Cyber Security Co Ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-01-17
Anticipated expiration: 2039-09-19
Also published as: CN110708296B

Abstract

Translated fromChinese

本发明公开了一种基于长时间行为分析的VPN账号失陷智能检测模型，包括如下检测流程：步骤一、数据读取阶段：从大数据分布式存储系统读取汇总后的VPN账号登录数据；步骤二、数据预处理阶段：对读取的数据进行数据清洗操作；步骤三、特征工程阶段：利用数据预处理后的数据生成建立推测VPN账号失陷模型所需的多维特征；步骤四、模型训练阶段：训练评分模型和常用列表模型；步骤五、模型预测阶段：使用训练好的模型，以及读取的VPN账号数据，预测其中存在不同风险程度的失陷VPN账号。本发明既不依赖安全数据中的正负例样本标注，省去了大量的人力和时间成本，又可以切实结合VPN账号失陷的业务场景，有效提高召回率和准确率。

The invention discloses an intelligent detection model for VPN account loss based on long-term behavior analysis. 2. Data preprocessing stage: perform data cleaning operations on the read data; Step 3. Feature engineering stage: use the preprocessed data to generate multi-dimensional features required to establish a model for inferring VPN account loss; Step 4. Model training stage : Train the scoring model and the commonly used list model; Step 5. Model prediction stage: Use the trained model and the read VPN account data to predict the lost VPN accounts with different risk levels. The invention does not rely on the positive and negative sample labeling in the security data, saves a lot of manpower and time costs, and can effectively combine the business scenario of VPN account loss, thereby effectively improving the recall rate and the accuracy rate.

Description

Translated fromChinese

一种基于长时间行为分析的VPN账号失陷智能检测模型An intelligent detection model for VPN account loss based on long-term behavior analysis

技术领域technical field

本发明涉及一种基于长时间行为分析的VPN账号失陷智能检测模型。The invention relates to an intelligent detection model for VPN account loss based on long-term behavior analysis.

背景技术Background technique

自从2017年6月1日，国家施行《中华人民共和国网络安全法》来，企业对自身安全的重视也从聚焦于漏洞、APT事件、主机失陷等各类威胁的网络安全，进一步深入到对包括员工信息泄露、违规操作、VPN账号失陷等的IT办公安全，以及再深入一层的业务安全。对于企业的业务安全来说，IT办公安全是最后一道屏障，其中如果账号失陷，便标志着IT办公安全的屏障被突破，直接威胁到企业的业务安全。所以，对企业账号失陷的预警和告警便显得尤为重要。Since June 1, 2017, when the state implemented the "Cybersecurity Law of the People's Republic of China", the emphasis of enterprises on their own security has also shifted from focusing on cybersecurity of various threats such as vulnerabilities, APT incidents, and host failures, to further in-depth measures including IT office security such as employee information leakage, illegal operations, VPN account loss, etc., as well as a further layer of business security. For the business security of an enterprise, IT office security is the last barrier. If the account is compromised, it means that the barrier of IT office security has been broken, which directly threatens the business security of the enterprise. Therefore, the early warning and warning of the loss of enterprise accounts is particularly important.

但是，企业自身的IT办公环境中，充斥着大量的员工信息，VPN账号信息等数据，尤其当企业规模壮大之后，员工的信息、VPN账号等数据会呈现数量级的增大。如何从海量的数据中检测出已经失陷的VPN账号，以及预测出失陷风险较大的VPN账号，就成了企业迫切需要解决的企业安全问题之一。However, the enterprise's own IT office environment is flooded with a large amount of employee information, VPN account information and other data, especially when the scale of the enterprise grows, the employee information, VPN account and other data will increase by orders of magnitude. How to detect the compromised VPN accounts from the massive data, and how to predict the VPN accounts with high risk of being compromised, has become one of the enterprise security issues that enterprises urgently need to solve.

从以往的安全方式、手段来看，人们长期以来都依赖于已知的规则来做检测。在已知的规则中，规则阈值都是人为设定的，往往召回率比较低，准确率也有待提高。所以，各个安全厂商纷纷开始尝试使用机器学习算法的方式，在海量数据中，从聚焦数据内容本身到内容上下文关系、行为分析等特征，从单点单条检测到多维度大数据分析来检测失陷VPN账号和预测失陷高风险的VPN账号，以此应对企业与日俱增的数据体量，以及提高失陷VPN账号检测的召回率和准确率。From the perspective of previous security methods and means, people have long relied on known rules for detection. In the known rules, the rule thresholds are all artificially set, and the recall rate is often low, and the accuracy rate needs to be improved. Therefore, various security vendors have begun to try to use machine learning algorithms to detect lost VPNs in massive data, from focusing on data content itself, content context, behavior analysis and other features, from single point and single detection to multi-dimensional big data analysis. Accounts and predict VPN accounts with high risk of loss, in order to cope with the increasing data volume of enterprises, and improve the recall rate and accuracy of VPN account detection.

可是，在实际的企业安全数据中，绝大部分时间都是正常的操作，只有少数的时间段内会发成异常的操作，或者遭到攻击。所以在企业安全数据中，负例样本相对较少，正负例样本极不平衡，而且运维人员如何在海量的数据中，发现、确认在较短时间段内发生的异常操作或者遭受攻击的数据，进行标注，也是一项非常耗费人力和时间的事情，很少有企业愿意在这上面花费人力和资源。所以，在没有充足线上业务数据样本的前提下，利用分类、回归等有监督学习算法来训练模型，并使用此模型检测失陷VPN账号的方式，得到检测结果的召回率和准确率都并不尽如人意。而不依赖样本的无监督学习算法，如聚类算法、时间序列预测算法等，算法本身的准确度就并不高，往往需要结合其他分类算法一起使用，才能得到比较令人满意的效果。However, in the actual enterprise security data, most of the time is normal operation, only a few time periods will be sent to abnormal operation or attacked. Therefore, in the enterprise security data, there are relatively few negative samples, and the positive and negative samples are extremely unbalanced, and how can operation and maintenance personnel find and confirm abnormal operations or attacks that have occurred in a relatively short period of time in the massive data. Data and labeling are also very labor-intensive and time-consuming, and few companies are willing to spend manpower and resources on this. Therefore, in the absence of sufficient online business data samples, using supervised learning algorithms such as classification and regression to train the model, and using this model to detect lost VPN accounts, the recall rate and accuracy of the detection results are not high. As expected. Unsupervised learning algorithms that do not rely on samples, such as clustering algorithms, time series prediction algorithms, etc., are not very accurate in themselves and often need to be used in conjunction with other classification algorithms to obtain satisfactory results.

发明内容SUMMARY OF THE INVENTION

为了克服现有技术的上述缺点，本发明提出了一种基于长时间行为分析的VPN账号失陷智能检测模型，该模型针对VPN账号失陷的实际过程，提炼出多维度的特征，并以此建立多重评分函数，使用动态或静态阈值，评判每一个VPN账号的失陷风险。使用这种方式，既不依赖安全数据中的正负例样本标注，省去了大量的人力和时间成本，又可以切实结合VPN账号失陷的业务场景，有效提高召回率和准确率。并且，模型还通过数个月的较长时间窗口，使用循环神经网络RNN算法或强化学习的方式，持续学习每一个VPN账号的登录习惯，形成每一个VPN账号的常用列表，以实现对VPN账号失陷评估的更准确判断。In order to overcome the above-mentioned shortcomings of the prior art, the present invention proposes an intelligent detection model for VPN account loss based on long-term behavior analysis. The model extracts multi-dimensional features according to the actual process of VPN account loss, and establishes multiple Scoring function, using dynamic or static thresholds, to judge the risk of compromise of each VPN account. Using this method, it does not rely on the positive and negative sample labeling in the security data, which saves a lot of manpower and time costs, and can effectively combine the business scenario of VPN account loss, effectively improving the recall rate and accuracy rate. In addition, the model also continuously learns the login habits of each VPN account through a long time window of several months, using the recurrent neural network RNN algorithm or reinforcement learning method, and forms a common list of each VPN account, so as to realize the verification of VPN accounts. More accurate judgment of the fallout assessment.

本发明解决其技术问题所采用的技术方案是：一种基于长时间行为分析的VPN账号失陷智能检测模型，包括如下检测流程：The technical solution adopted by the present invention to solve the technical problem is: an intelligent detection model for VPN account loss based on long-term behavior analysis, including the following detection process:

步骤一、数据读取阶段：从大数据分布式存储系统读取汇总后的VPN账号登录数据；Step 1. Data reading stage: read the aggregated VPN account login data from the big data distributed storage system;

步骤二、数据预处理阶段：对读取的数据进行数据清洗操作；Step 2: Data preprocessing stage: perform data cleaning operations on the read data;

步骤三、特征工程阶段：利用数据预处理后的数据生成建立推测VPN账号失陷模型所需的多维特征；Step 3, feature engineering stage: use the preprocessed data to generate the multi-dimensional features required to establish a model for inferring that the VPN account has been compromised;

步骤四、模型训练阶段：训练评分模型和常用列表模型；Step 4. Model training stage: training scoring model and common list model;

步骤五、模型预测阶段：使用训练好的模型，以及读取的VPN账号数据，预测其中存在不同风险程度的失陷VPN账号。Step 5. Model prediction stage: Use the trained model and the read VPN account data to predict the VPN accounts with different risk levels.

与现有技术相比，本发明的积极效果是：Compared with the prior art, the positive effects of the present invention are:

1、基于大数据的分钟级告警模型1. Minute-level alarm model based on big data

本发明提供了一种基于大数据分析，并且从输入各服务器汇总之后的海量数据到产生告警输出，实现分钟级告警的模型。为企业内汇总各业务职能部门所使用的不同服务器中VPN账号登录等行为数据，综合分析评判各VPN账号的失陷风险，快速产生告警，提供了一种切实有效的方法和思路。The present invention provides a model for realizing minute-level alarms based on big data analysis and from inputting massive data aggregated by each server to generating alarm output. It provides a practical and effective method and idea for collecting behavior data such as VPN account logins in different servers used by various business functional departments in the enterprise, comprehensively analyzing and evaluating the risk of loss of each VPN account, and quickly generating alarms.

2、省略人工标注训练和检测样本的环节2. Omit the link of manual labeling of training and testing samples

本发明结合VPN账号失陷的业务场景，登录时间、登录地理位置、VPN账号登录行为的数学统计分析(如次数、频率、登录失败率等)等特征，建立多重评分函数，使用动态或静态阈值，评判每一个VPN账号的失陷风险值。本发明不依赖安全数据中的正负例样本标注，省略了人工标注训练样本和检测样本的环节，节约了大量的人力和时间成本。The present invention establishes multiple scoring functions by combining the business scenario of VPN account loss, login time, login geographic location, mathematical statistical analysis of VPN account login behavior (such as number of times, frequency, login failure rate, etc.), etc., and uses dynamic or static thresholds. Judge the risk of loss of each VPN account. The present invention does not rely on the positive and negative sample labeling in the security data, omits the link of manually labeling training samples and testing samples, and saves a lot of manpower and time costs.

3、支持通过数月的较长时间窗口智能学习每个VPN账号的行为习惯3. Support intelligent learning of the behavior habits of each VPN account through a long time window of several months

本发明支持使用循环神经网络RNN算法或强化学习的方式，通过数个月的较长时间窗口，使用海量数据，持续学习每一个VPN账号的登录习惯，形成每一个VPN账号的细粒度的常用列表，以此优化每一个VPN账号的失陷风险值，提高告警准确度。The present invention supports the use of the cyclic neural network RNN algorithm or the reinforcement learning method to continuously learn the login habits of each VPN account through a relatively long time window of several months and use massive data to form a fine-grained common list of each VPN account. , so as to optimize the loss risk value of each VPN account and improve the alarm accuracy.

附图说明Description of drawings

本发明将通过例子并参照附图的方式说明，其中：The invention will be described by way of example and with reference to the accompanying drawings, in which:

图1为VPN账号失陷的网络攻击流程示意；Figure 1 is a schematic diagram of the network attack process of VPN account loss;

图2为一种基于长时间行为分析的VPN账号失陷智能检测模型；Fig. 2 is a kind of intelligent detection model of VPN account loss based on long-term behavior analysis;

图3为企业内部署VPN账号失陷检测模型的示意流程。FIG. 3 is a schematic process of deploying a VPN account loss detection model in an enterprise.

具体实施方式Detailed ways

典型的VPN账号失陷攻击流程如图1所示，它一般包括准备期、入侵期和收益期，细分为如下五个阶段：A typical VPN account compromise attack process is shown in Figure 1. It generally includes a preparation period, an intrusion period, and a revenue period, which are subdivided into the following five stages:

(1)准备期的两个阶段：(1) Two stages of preparation period:

阶段1—侦查扫描：攻击者对目标企业进行研究、嗅探和扫描，典型的方法包括使用互联网爬虫，收集例如会议记录、电子邮件地址、社会关系等信息，或用特殊方法和手段收集信息。Phase 1 - Reconnaissance Scanning: Attackers research, sniff, and scan the target enterprise, typically using Internet crawlers, collecting information such as meeting minutes, email addresses, social connections, or using special methods and means to collect information.

阶段2—制作武器：攻击者将包含漏洞的远程木马使用自动化工具改装，并植入到特定的载体中，如客户端常用的PDF或者Office办公软件等的数据文件格式。Stage 2—Weapon making: The attacker modifies the remote Trojan with vulnerabilities using automated tools, and implants it into a specific carrier, such as the data file format of PDF or Office software commonly used by the client.

(2)入侵期的两个阶段：(2) Two stages of the invasion period:

阶段3—投递植入：将制作好的带有攻击武器的载体传输到目标环境。根据洛克希德—马丁公司的计算机事件响应小组(LM-CIRT)的报告，APT攻击者使用的三种最流行的交付载体是电子邮件附件、网站页面及U盘。Phase 3—Delivery Implantation: The fabricated carrier with the attack weapon is delivered to the target environment. According to a report by Lockheed Martin's Computer Incident Response Team (LM-CIRT), the three most popular delivery vehicles used by APT attackers are email attachments, website pages, and USB sticks.

阶段4—漏洞利用：载体传递到受害者主机或服务器后，以主动或被动方式触发恶意代码。在大多数情况下攻击者利用应用程序或操作系统的漏洞完成这一步骤，常用的攻击手段包括SQL注入漏洞攻击、逻辑漏洞攻击等；但用户也有可能在不知情的情况下主动执行这些代码。Stage 4 - Exploitation: After the vector is delivered to the victim host or server, the malicious code is triggered in an active or passive manner. In most cases, attackers use application or operating system vulnerabilities to complete this step. Common attack methods include SQL injection vulnerability attacks, logic vulnerability attacks, etc.; but it is also possible for users to actively execute these codes without knowing it.

(3)收益期的一个阶段：(3) A stage of the income period:

阶段5—VPN账号失陷：攻击者按照预定计划，成功窃取到目标企业的一个或多个VPN账号，并以此VPN账号为突破口，使用这些VPN账号，以合法的身份成功登录企业内部网络，进一步窃取受害企业的业务资料，经过汇总、压缩和加密后传输到受害企业环境之外；或者直接破坏受害企业业务数据的完整性及业务的可用性等。Stage 5 - VPN account loss: The attacker successfully steals one or more VPN accounts of the target enterprise according to the predetermined plan, and uses these VPN accounts as a breakthrough to successfully log in to the internal network of the enterprise as a legal identity, and further. Steal the business data of the victim enterprise, and transmit it outside the environment of the victim enterprise after aggregation, compression and encryption; or directly destroy the integrity of the victim enterprise's business data and the availability of services.

如上述攻击流程所示，攻击流程中的每一个环节都是下一个环节的充分条件。如果防御方能及时检测并阻断其中的一步，那么攻击方就必须放弃或者寻找其它适当的途径继续进行攻击。从整个攻击流程来看，及早发现攻击者的踪迹，进行控制或阻断，才能在网络攻击的攻防博弈过程中占据主动。基于企业海量数据的现实环境，选取与VPN账号行为相关的数据记录，提取多维特征，建立多重评分机制，并利用较长时间窗口内的大量数据学习企业内部每一个员工VPN账号的行为习惯，以此实现对当日在企业内部活动的VPN账号中失陷VPN账号的分钟级检测。As shown in the attack flow above, each link in the attack flow is a sufficient condition for the next link. If the defender can detect and block one of these steps in time, the attacker must give up or find other appropriate ways to continue the attack. From the perspective of the entire attack process, only by discovering the traces of attackers early and controlling or blocking them can they take the initiative in the game of offense and defense of network attacks. Based on the real environment of the massive data of the enterprise, select data records related to VPN account behavior, extract multi-dimensional features, establish a multiple scoring mechanism, and use a large amount of data in a long time window to learn the behavior habits of each employee VPN account in the enterprise, so as to achieve This implements minute-level detection of lost VPN accounts among VPN accounts that are active within the enterprise on that day.

一种基于长时间行为分析的VPN账号失陷智能检测模型如图2所示，此模型的核心为行为分析。该模型具有处理大数据量、长时间窗口、多维特征、多重评分机制、智能检测的特征，贯穿数据收集、数据预处理及特征工程、模型训练、模型预测四个层次，提供企业内部跨部门、跨分/子公司，长时间窗口的海量数据汇总，分钟级检测告警的VPN账号失陷检测服务。An intelligent detection model of VPN account loss based on long-term behavior analysis is shown in Figure 2. The core of this model is behavior analysis. The model has the characteristics of processing large data volume, long-term windows, multi-dimensional features, multiple scoring mechanisms, and intelligent detection. It runs through four levels of data collection, data preprocessing and feature engineering, model training, and model prediction. Cross-branch/subsidiary company, massive data aggregation for a long time window, minute-level detection alarm VPN account loss detection service.

(一)数据收集：实现企业内部跨部门、跨分/子公司的各不同服务器，跨数月的较长时间窗口数据汇总功能。主要包括各部门、子公司数据库(如MySQL或MongoDB数据库)中的VPN账号登录数据，安全设备上报的告警数据等，将数据汇总到大数据分析集群的HDFS系统或ElasticSearch系统中。(1) Data collection: realize the data aggregation function of different servers across departments and branches/subsidiaries within the enterprise and over a long time window of several months. It mainly includes VPN account login data in various departments and subsidiary databases (such as MySQL or MongoDB databases), alarm data reported by security devices, etc., and aggregates the data into the HDFS system or ElasticSearch system of the big data analysis cluster.

(二)数据预处理及特征工程：实现将汇总后的海量数据进行按照字段名选取、按照时间范围选取、去空去重等数据清洗操作；再按照不同的字段(如源IP、用户名等)进行分组统计，并对数据记录按照不同字段(如数据记录唯一标识码等)进行聚合统计；针对带有时序维度的特征，需要设定时窗大小，在每个时窗内，根据具体需求(如计算登录频率，或登录失败频率等)选取不同的字段进行分组和聚合统计计算。最终生成包括登录时间、登录地点、登录次数、登录频率、登录失败率等在内的多维特征。使用Spark框架，支持高性能的批量数据处理。(2) Data preprocessing and feature engineering: realize the data cleaning operations such as selecting by field name, selecting by time range, de-emptying and de-duplicating the aggregated massive data; ) for grouping statistics, and aggregate statistics for data records according to different fields (such as data record unique identification codes, etc.); for features with timing dimensions, the size of the time window needs to be set, and within each time window, according to specific needs (For example, to calculate the login frequency, or the login failure frequency, etc.) Select different fields for grouping and aggregation statistics calculation. Finally, multi-dimensional features including login time, login location, login times, login frequency, login failure rate, etc. are generated. Use the Spark framework to support high-performance batch data processing.

(三)模型训练：这一过程主要训练两类模型。一类是评分模型，另一类是常用列表模型。在评分模型中，使用数据预处理及特征工程中生成的多维特征，结合VPN账号失陷的具体场景过程及涉及到的不同特征的取值特性，选取多个特征分配不同的权重基础分值，并在每个特征下，针对特征值划分多个范围区间，不同的范围区间分配不同的系数，当被检测特征的值落到具体的某一范围区间时，此特征的权重基础分与系数的乘积，即为此特征维度所得的评分，把多个特征维度的评分相加，便可得到针对不同场景进行评价的最终评分。针对VPN账号失陷的场景，可以选取登录失败的时间、登录失败的地点、登录失败的次数、登录失败的频率、多次登录失败后是否发生登录成功等多个维度，按照上述计算方法，在每个维度下，划分不同的范围区间，分配不同的系数，之后把多个特征维度的评分相加，得到针对此场景的最终评分，从而实现对数据量的多重筛选，逐步缩小高风险失陷VPN账号的检测范围。根据最终评分的数值，可以将VPN账号对照不同的数值区间以不同失陷风险等级进行分类，并使用动态或静态阈值的方式，最终返回超过阈值的VPN账号作为高风险失陷VPN账号。(3) Model training: This process mainly trains two types of models. One is a scoring model and the other is a commonly used list model. In the scoring model, using the multi-dimensional features generated in data preprocessing and feature engineering, combined with the specific scenario process of VPN account loss and the value characteristics of different features involved, multiple features are selected to assign different weights to the basic score, and Under each feature, multiple range intervals are divided for the feature value, and different range intervals are assigned different coefficients. When the value of the detected feature falls within a specific range interval, the product of the weight base score of this feature and the coefficient , that is, the score obtained for this feature dimension. By adding the scores of multiple feature dimensions, the final score for evaluating different scenarios can be obtained. For the scenario of VPN account loss, you can select the time of login failure, the location of login failure, the number of login failures, the frequency of login failures, and whether login success occurs after multiple login failures. According to the above calculation method, in each Under each dimension, divide different ranges and assign different coefficients, and then add the scores of multiple feature dimensions to obtain the final score for this scenario, so as to realize multiple screening of data volume and gradually reduce high-risk VPN accounts. detection range. According to the value of the final score, VPN accounts can be classified according to different value ranges with different loss risk levels, and the VPN accounts that exceed the threshold can be returned as high-risk VPN accounts by using dynamic or static thresholds.

在另一类针对每一名员工建立常用列表的模型中，可使用循环神经网络(RNN)中的长短期记忆(LSTM)建立带有记忆功能的神经网络，或使用强化学习的方式，实现模型逐步自我修正的功能。通过对数个月较长时间范围内的数据进行学习，建立针对企业内部每一名员工VPN账号的常用列表(如常用登陆时间列表、常用登陆地点列表、常用登陆设备列表等)，并在学习的过程中，根据中间过程的检测结果，智能的排除判断为异常行为的数据记录，避免持续学习的行为习惯出现严重偏差。最终将学习到的每个VPN账号行为习惯的不同维度特征数据输出到制定的数据库中。In another type of model that builds a common list for each employee, a neural network with memory can be built using long short-term memory (LSTM) in a recurrent neural network (RNN), or a reinforcement learning approach can be used to implement the model Gradual self-correction function. By studying data in a relatively long time range of several months, establish a common list of VPN accounts for each employee in the enterprise (such as a list of common login times, a list of common login locations, a list of common login devices, etc.), and during the learning process In the process, according to the detection results of the intermediate process, the data records judged to be abnormal behaviors are intelligently excluded to avoid serious deviations in the behavioral habits of continuous learning. Finally, output the characteristic data of different dimensions of the learned behavior and habits of each VPN account to the formulated database.

(四)模型预测：在企业当天汇总的VPN账号行为相关数据中，先读取常用列表模型输出到数据库中的各VPN账号行为习惯的数据，在特征工程之后每个VPN账号的多维特征中过滤掉其对应常用列表中的数值，实现进一步精准筛选特征值。然后使用多重评分函数模型对每一个VPN账号进行评分，并采用白名单策略，实现模型与前端界面的实时交互，同时也有效提高模型的准确率。经过以上步骤的过滤之后，多重评分函数模型最终输出的VPN账号，即判定为具有高风险的失陷VPN账号，产生告警。(4) Model prediction: In the data about the behavior of VPN accounts aggregated by the enterprise on the day, first read the data of the behavior habits of each VPN account output from the commonly used list model into the database, and filter the multi-dimensional features of each VPN account after feature engineering. Drop the value in the corresponding common list to achieve further accurate screening of feature values. Then use the multiple scoring function model to score each VPN account, and adopt the whitelist strategy to realize the real-time interaction between the model and the front-end interface, and also effectively improve the accuracy of the model. After the filtering in the above steps, the VPN account finally output by the multiple scoring function model is determined to be a lost VPN account with high risk, and an alarm is generated.

实现VPN账号失陷模型的预测功能，首先要把训练好的模型部署在集群中，具体的部署流程如图3所示。整个流程包括搭建大数据存储系统(如HDFS或ElasticSearch等)汇总数据，搭建Spark集群，配置模型所需的集群环境、数据库环境及相关的第三方库，安装模型程序，配置模型参数，执行模型6个步骤。该模型既支持设置定时任务进行自动化检测，同时也支持根据特定任务手动执行模型进行即时检测。To realize the prediction function of the VPN account loss model, the trained model must first be deployed in the cluster. The specific deployment process is shown in Figure 3. The whole process includes building a big data storage system (such as HDFS or ElasticSearch) to summarize data, build a Spark cluster, configure the cluster environment, database environment and related third-party libraries required for the model, install the model program, configure the model parameters, and execute the model 6 steps. The model supports not only setting timed tasks for automatic detection, but also manually executing the model according to specific tasks for instant detection.

Claims

1. The utility model provides a VPN account number collapse intelligent detection model based on long-time behavior analysis which characterized in that: the method comprises the following detection processes:

step one, a data reading stage: reading the collected VPN account login data from the big data distributed storage system;

step two, data preprocessing stage: performing data cleaning operation on the read data;

step three, a characteristic engineering stage: generating multidimensional characteristics required for building a presumed VPN account collapse model by utilizing data after data preprocessing;

step four, a model training stage: training a scoring model and a common list model;

step five, model prediction stage: and predicting the lost VPN account with different risk degrees by using the trained model and the read VPN account data.

2. The intelligent detection model for VPN account number collapse based on long-time behavior analysis according to claim 1, characterized in that: and the data cleaning operation in the second step comprises selection according to the field name, selection according to the time range, and emptying and duplicate removal.

3. The intelligent detection model for VPN account number collapse based on long-time behavior analysis according to claim 1, characterized in that: step three, the method for generating the multidimensional characteristics comprises the following steps: performing grouping statistics according to different fields, and performing aggregation statistics on the data records according to different fields; and setting the size of a time window aiming at the characteristics with time sequence dimensionality, and selecting different fields for grouping and aggregating statistical calculation to generate the multidimensional characteristics according to specific requirements in each time window.

4. The VPN account number collapse intelligent detection model based on long-time behavior analysis according to claim 3, characterized in that: the multi-dimensional features include: login time, login location, login times, login frequency, login failure rate, and the like.

5. The intelligent detection model for VPN account number collapse based on long-time behavior analysis according to claim 1, characterized in that: when a scoring model is trained, multi-dimensional features generated in a feature engineering stage are used, a plurality of features are selected according to a specific scene process of VPN account collapse and the value characteristics of the related different features, different weight basic scores are distributed, a plurality of range intervals are divided according to the feature values under each feature, different coefficients are distributed in different range intervals, when the value of the detected feature falls into a specific certain range interval, the product of the weight basic score and the coefficient of the feature is the score obtained by the feature dimension, and then the scores of the feature dimensions are added to obtain the final score for evaluating different scenes.

6. The VPN account number collapse intelligent detection model based on long-time behavior analysis according to claim 5, wherein: aiming at the scene of VPN account collapse, the selected multidimensional characteristics comprise: the time of login failure, the place of login failure, the number of login failure, the frequency of login failure, whether there is a data record of login success after multiple login failures, etc.

7. The intelligent detection model for VPN account number collapse based on long-time behavior analysis according to claim 1, characterized in that: when the common list model is trained, the login habit of each VPN account is continuously learned through a Recurrent Neural Network (RNN) algorithm or a reinforcement learning mode, and a common list of each VPN account is formed, so that the VPN account collapse risk can be accurately judged.

8. The intelligent detection model for VPN account number collapse based on long-time behavior analysis according to claim 7, characterized in that: the common list includes: a list of common login times, a list of common login locations, a list of common login devices, etc.

9. The intelligent detection model for VPN account number collapse based on long-time behavior analysis according to claim 1, characterized in that: the method for predicting the lost VPN account by using the trained model comprises the following steps: in the VPN account behavior related data gathered by the enterprise on the same day, reading the data of each VPN account behavior habit output to a database by a common list model, further screening characteristic values by using the multidimensional characteristics of each VPN account generated in a characteristic engineering stage, grading each VPN account by using a grading model, realizing the real-time interaction between the model and a front-end interface by adopting a white list strategy, and finally outputting the VPN account which is the detected lost VPN account with high risk.