CN118585904A

Movatterモバイル変換

Info

Publication number: CN118585904A
Application number: CN202410642032.5A
Authority: CN
Inventors: 黄自力; 杨阳; 陈舟; 熊璐; 秦璐; 张叶
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2024-05-22
Filing date: 2024-05-22
Publication date: 2024-09-03

Abstract

Translated fromChinese

本申请实施例提供了一种异常访问序列检测方法、装置、设备及存储介质，涉及人工智能技术领域，包括：获取待检测访问序列和待检测访问序列的目标序列长度；待检测访问序列输入目标序列长度关联的异常检测模型，获得待检测访问序列对应的目标观测概率，其中，异常检测模型是无监督或半监督学习获得的时序概率模型；根据目标观测概率和目标序列长度对应的惩罚系数矩阵，获得待检测访问序列的异常概率；若待检测访问序列的异常概率大于目标序列长度对应的预设阈值，则待检测访问序列为异常访问序列，通过半监督或者无监督学习获得的异常检测模型，充分考虑了访问序列的时序特征，有效提高了异常访问序列检测的准确性。

The embodiments of the present application provide an abnormal access sequence detection method, apparatus, device and storage medium, which relate to the field of artificial intelligence technology, including: obtaining an access sequence to be detected and a target sequence length of the access sequence to be detected; inputting an abnormal detection model associated with the target sequence length into the access sequence to be detected to obtain a target observation probability corresponding to the access sequence to be detected, wherein the abnormal detection model is a time series probability model obtained by unsupervised or semi-supervised learning; obtaining the abnormal probability of the access sequence to be detected according to a penalty coefficient matrix corresponding to the target observation probability and the target sequence length; if the abnormal probability of the access sequence to be detected is greater than a preset threshold corresponding to the target sequence length, the access sequence to be detected is an abnormal access sequence, and the abnormal detection model obtained by semi-supervised or unsupervised learning fully considers the time series characteristics of the access sequence, and effectively improves the accuracy of abnormal access sequence detection.

Description

Translated fromChinese

一种异常访问序列检测方法、装置、设备及存储介质Abnormal access sequence detection method, device, equipment and storage medium

技术领域Technical Field

本申请实施例涉及人工智能技术领域，尤其涉及一种异常访问序列检测方法、装置、设备及存储介质。The embodiments of the present application relate to the field of artificial intelligence technology, and in particular to an abnormal access sequence detection method, device, equipment and storage medium.

背景技术Background Art

在日常的网站访问行为中，除了来自爬虫类与扫描类的自动化网络嗅探之外，还有一部分来自攻击者的人工渗透与定点测试。例如，攻击者针对网站网页的系统漏洞、业务的逻辑漏洞等进行反复调试与渗透。攻击者会根据网站的请求响应反馈，通过抓包来反复测试网站服务器对于不同请求的响应，以此分析并判断后端服务器存在的漏洞。由于人工渗透的访问频次很低，与正常业务的普通用户访问的特征差别不大，故无法通过管理员设计的规则或设定的攻击手段与路径所捕获，常用的安全设备及态势感知系统难以识别到此类攻击。一旦系统漏洞被发现并利用，运维人员无法及时察觉到攻击者的人工渗透的路径与手段，从而影响系统安全性。因此，亟需从大量的网络流量与访问日志中，分析并检测攻击者的人工渗透路径，提前发现并预警。In daily website access behaviors, in addition to automated network sniffing from crawlers and scanners, there is also manual penetration and fixed-point testing from attackers. For example, attackers repeatedly debug and penetrate system vulnerabilities of website pages and business logic vulnerabilities. According to the request response feedback of the website, attackers will repeatedly test the response of the website server to different requests by capturing packets, so as to analyze and determine the vulnerabilities of the backend server. Since the access frequency of manual penetration is very low and the characteristics of ordinary user access in normal business are not much different, it cannot be captured by the rules designed by the administrator or the set attack methods and paths. Commonly used security equipment and situational awareness systems are difficult to identify such attacks. Once a system vulnerability is discovered and exploited, the operation and maintenance personnel cannot detect the attacker's manual penetration path and means in time, thus affecting the security of the system. Therefore, it is urgent to analyze and detect the attacker's manual penetration path from a large amount of network traffic and access logs, and to discover and warn in advance.

现有技术中采用有监督学习的方式检测人工渗透的异常访问序列。具体地，先采集已知的异常访问序列的流量或日志数据，并提取异常访问序列的特征，然后通过将当前访问序列的特征与异常访问序列的特征匹配，从而找到人工渗透的异常访问序列。The prior art uses supervised learning to detect artificially infiltrated abnormal access sequences. Specifically, the traffic or log data of known abnormal access sequences is first collected, and the features of the abnormal access sequences are extracted. Then, the features of the current access sequence are matched with the features of the abnormal access sequence to find the artificially infiltrated abnormal access sequences.

然而，各类人工渗透通常没有通用的特征，可能具备攻击者本身的习惯与特性，故难以汇总出一套攻击者特征的样本集作为通用模板，因此，将当前访问序列的特征与已知的异常访问序列的特征匹配，检测异常访问序列的技术方案，其准确性较低。However, various types of artificial penetration usually do not have common features and may have the habits and characteristics of the attackers themselves, so it is difficult to summarize a sample set of attacker features as a common template. Therefore, the technical solution of matching the features of the current access sequence with the features of known abnormal access sequences to detect abnormal access sequences has low accuracy.

发明内容Summary of the invention

本申请实施例提供了一种异常访问序列检测方法、装置、设备及存储介质，充分考虑了访问序列的时序特征，有效提高了异常访问序列检测的准确性。The embodiments of the present application provide a method, apparatus, device and storage medium for detecting an abnormal access sequence, which fully consider the timing characteristics of the access sequence and effectively improve the accuracy of detecting an abnormal access sequence.

第一方面，本申请实施例提供了一种异常访问序列检测方法，包括：In a first aspect, an embodiment of the present application provides a method for detecting an abnormal access sequence, comprising:

获取待检测访问序列和所述待检测访问序列的目标序列长度；Acquire an access sequence to be detected and a target sequence length of the access sequence to be detected;

将所述待检测访问序列输入所述目标序列长度关联的异常检测模型，获得所述待检测访问序列对应的目标观测概率，其中，所述异常检测模型是无监督或半监督学习获得的时序概率模型；Inputting the access sequence to be detected into an anomaly detection model associated with the target sequence length to obtain a target observation probability corresponding to the access sequence to be detected, wherein the anomaly detection model is a time series probability model obtained by unsupervised or semi-supervised learning;

根据所述目标观测概率和所述目标序列长度对应的惩罚系数矩阵，获得所述待检测访问序列的异常概率；Obtaining the abnormal probability of the access sequence to be detected according to the penalty coefficient matrix corresponding to the target observation probability and the target sequence length;

若所述待检测访问序列的异常概率大于所述目标序列长度对应的预设阈值，则所述待检测访问序列为异常访问序列。If the abnormal probability of the access sequence to be detected is greater than a preset threshold corresponding to the target sequence length, the access sequence to be detected is an abnormal access sequence.

本申请实施例中，首先通过无监督或半监督学习获得的时序概率模型作为异常检测模型，得到待检测访问序列对应的目标观测概率，然后根据目标观测概率和目标序列长度对应的惩罚系数矩阵，获得待检测访问序列的异常概率，最后检测出异常概率大于目标序列长度对应的预设阈值的待检测访问序列为异常访问序列，不仅充分考虑了访问序列的时序特征，而且有效提高了异常访问序列检测的准确性。In an embodiment of the present application, a time series probability model obtained through unsupervised or semi-supervised learning is first used as an anomaly detection model to obtain a target observation probability corresponding to an access sequence to be detected, and then the abnormal probability of the access sequence to be detected is obtained based on a penalty coefficient matrix corresponding to the target observation probability and the target sequence length. Finally, an access sequence to be detected whose abnormal probability is greater than a preset threshold corresponding to the target sequence length is detected as an abnormal access sequence. This not only fully considers the time series characteristics of the access sequence, but also effectively improves the accuracy of abnormal access sequence detection.

一种可选实施方式中，获取日志记录，所述日志记录包括：多个访问请求和相应的访问时间；In an optional implementation, a log record is obtained, wherein the log record includes: a plurality of access requests and corresponding access times;

对每个访问请求进行数据预处理，获得相应的编码请求；Perform data preprocessing on each access request to obtain a corresponding encoding request;

按照获得的多个编码请求各自的访问时间，对所述多个编码请求进行排序，获取待检测访问序列，其中，所述多个编码请求的数量，为所述待检测访问序列的目标序列长度。The multiple coding requests are sorted according to their respective access times to obtain an access sequence to be detected, wherein the number of the multiple coding requests is a target sequence length of the access sequence to be detected.

一种可选实施方式中，所述将所述待检测访问序列输入所述目标序列长度关联的异常检测模型，获得所述待检测访问序列对应的目标观测概率之前，从已训练的多个序列长度关联的异常检测模型中，选取所述目标序列长度关联的异常检测模型。In an optional implementation, before inputting the access sequence to be detected into the anomaly detection model associated with the target sequence length and obtaining the target observation probability corresponding to the access sequence to be detected, the anomaly detection model associated with the target sequence length is selected from multiple trained anomaly detection models associated with sequence length.

一种可选实施方式中，所述多个序列长度关联的异常检测模型是通过以下方式训练获得的，包括：In an optional implementation, the anomaly detection model associated with multiple sequence lengths is obtained by training in the following manner, including:

按照序列长度对访问序列样本集中的多个样本访问序列进行分类，获得所述多个序列长度各自的训练子集；Classifying multiple sample access sequences in the access sequence sample set according to sequence lengths to obtain training subsets of the multiple sequence lengths;

采用所述多个序列长度各自的训练子集，训练获得所述多个序列长度各自关联的异常检测模型。The training subsets of the plurality of sequence lengths are respectively used to train and obtain the anomaly detection models associated with the plurality of sequence lengths.

一种可选实施方式中，所述多个序列长度中的每个序列长度关联的异常检测模型的训练过程包括以下操作：In an optional implementation, the training process of the anomaly detection model associated with each sequence length of the multiple sequence lengths includes the following operations:

随机生成多个初始状态矩阵；Randomly generate multiple initial state matrices;

针对所述多个初始状态矩阵，分别执行以下操作：基于一个初始状态矩阵和所述序列长度对应的训练子集，生成一组候选概率矩阵，所述一组候选概率矩阵包括：状态转移概率矩阵和观测状态概率矩阵；将样本访问序列作为观测序列，基于所述一组候选概率矩阵，获得相应的候选观测概率；并基于所述候选观测概率和所述一个初始状态矩阵，确定所述一组候选概率矩阵的预测误差；For the multiple initial state matrices, the following operations are performed respectively: based on an initial state matrix and a training subset corresponding to the sequence length, a set of candidate probability matrices is generated, the set of candidate probability matrices including: a state transition probability matrix and an observation state probability matrix; taking a sample access sequence as an observation sequence, based on the set of candidate probability matrices, corresponding candidate observation probabilities are obtained; and based on the candidate observation probabilities and the one initial state matrix, a prediction error of the set of candidate probability matrices is determined;

从获得的多组候选概率矩阵中，选取预测误差最小的一组候选概率矩阵作为所述序列长度关联的异常检测模型的模型参数。From the obtained multiple groups of candidate probability matrices, a group of candidate probability matrices with the smallest prediction error is selected as the model parameters of the sequence length associated anomaly detection model.

一种可选实施方式中，所述预测误差是以下任意一种：In an optional implementation manner, the prediction error is any one of the following:

均方误差、平均绝对误差、均方误差最小和平均绝对误差之和。Mean square error, mean absolute error, minimum mean square error and sum of mean absolute error.

一种可选实施方式中，针对所述目标序列长度对应的训练子集，按照候选观测概率从大到小的顺序，对所述训练子集中的多个样本访问序列进行排序，获得排序结果；In an optional implementation, for the training subset corresponding to the target sequence length, multiple sample access sequences in the training subset are sorted in descending order of candidate observation probability to obtain a sorting result;

将所述排序结果中，排在第N位的样本访问序列的候选观测概率作为所述目标序列长度对应的预设阈值，N为预设正整数。The candidate observation probability of the sample access sequence ranked at the Nth position in the sorting result is used as the preset threshold corresponding to the target sequence length, where N is a preset positive integer.

一种可选实施方式中，所述多个序列长度关联的异常检测模型是通过以下方式测试的，包括：In an optional implementation, the anomaly detection model associated with multiple sequence lengths is tested in the following manner, including:

获取待测试访问序列；Get the access sequence to be tested;

从所述待测试访问序列中提取多个序列长度的测试观测序列；Extracting test observation sequences of multiple sequence lengths from the access sequence to be tested;

针对所述多个序列长度的测试观测序列，分别执行以下操作：将一个测试观测序列输入相应的序列长度所关联的异常检测模型，获得所述一个测试观测序列对应的测试观测概率；For the test observation sequences of the plurality of sequence lengths, the following operations are respectively performed: inputting a test observation sequence into an anomaly detection model associated with a corresponding sequence length to obtain a test observation probability corresponding to the test observation sequence;

根据所述测试观测概率和所述序列长度对应的惩罚系数矩阵，获得所述测试观测序列的异常概率；Obtaining the abnormal probability of the test observation sequence according to the penalty coefficient matrix corresponding to the test observation probability and the sequence length;

基于所述测试观测序列的异常概率，确定所述序列长度所关联的异常检测模型的测试结果。Based on the abnormal probability of the test observation sequence, a test result of the abnormality detection model associated with the sequence length is determined.

一种可选实施方式中，将所述目标观测概率乘以所述目标序列长度对应的惩罚系数矩阵，获得所述待检测访问序列的异常概率。In an optional implementation, the target observation probability is multiplied by a penalty coefficient matrix corresponding to the target sequence length to obtain the abnormal probability of the access sequence to be detected.

第二方面，本申请实施例提供了一种异常访问序列检测装置，包括：In a second aspect, an embodiment of the present application provides an abnormal access sequence detection device, including:

获取单元，用于获取获取待检测访问序列和所述待检测访问序列的目标序列长度；An acquiring unit, configured to acquire an access sequence to be detected and a target sequence length of the access sequence to be detected;

观测概率获取单元，用于将所述待检测访问序列输入所述目标序列长度关联的异常检测模型，获得所述待检测访问序列对应的目标观测概率，其中，所述异常检测模型是无监督或半监督学习获得的时序概率模型；An observation probability acquisition unit, used for inputting the access sequence to be detected into an anomaly detection model associated with the target sequence length to obtain a target observation probability corresponding to the access sequence to be detected, wherein the anomaly detection model is a time series probability model obtained by unsupervised or semi-supervised learning;

异常概率获取单元，用于根据所述目标观测概率和所述目标序列长度对应的惩罚系数矩阵，获得所述待检测访问序列的异常概率；An abnormality probability acquisition unit, used to obtain the abnormality probability of the access sequence to be detected according to the penalty coefficient matrix corresponding to the target observation probability and the target sequence length;

异常检测单元，用于若所述待检测访问序列的异常概率大于所述目标序列长度对应的预设阈值，则所述待检测访问序列为异常访问序列。The abnormality detection unit is used for determining that the access sequence to be detected is an abnormal access sequence if the abnormal probability of the access sequence to be detected is greater than a preset threshold corresponding to the target sequence length.

一种可选实施方式中，所述获取单元还用于：In an optional implementation manner, the acquisition unit is further configured to:

获取日志记录，所述日志记录包括：多个访问请求和相应的访问时间；Obtaining a log record, wherein the log record includes: a plurality of access requests and corresponding access times;

一种可选实施方式中，还包括模型筛选单元；In an optional implementation, it further includes a model screening unit;

所述模型筛选单元具体用于：The model screening unit is specifically used for:

将所述待检测访问序列输入所述目标序列长度关联的异常检测模型，获得所述待检测访问序列对应的目标观测概率之前，从已训练的多个序列长度关联的异常检测模型中，选取所述目标序列长度关联的异常检测模型。The access sequence to be detected is input into the anomaly detection model associated with the target sequence length. Before obtaining the target observation probability corresponding to the access sequence to be detected, the anomaly detection model associated with the target sequence length is selected from multiple trained anomaly detection models associated with sequence length.

一种可选实施方式中，还包括模型训练单元；In an optional implementation, it further includes a model training unit;

所述模型训练单元具体用于：The model training unit is specifically used for:

一种可选实施方式中，所述模型训练单元还用于：In an optional implementation manner, the model training unit is further used to:

针对所述目标序列长度对应的训练子集，按照候选观测概率从大到小的顺序，对所述训练子集中的多个样本访问序列进行排序，获得排序结果；For the training subset corresponding to the target sequence length, sorting multiple sample access sequences in the training subset according to the order of candidate observation probability from large to small to obtain a sorting result;

一种可选实施方式中，还包括模型测试单元；In an optional implementation, it further includes a model testing unit;

所述模型测试单元具体用于：The model testing unit is specifically used for:

获取待测试访问序列；Get the access sequence to be tested;

一种可选实施方式中，所述异常检测单元还用于：In an optional implementation manner, the anomaly detection unit is further configured to:

将所述目标观测概率乘以所述目标序列长度对应的惩罚系数矩阵，获得所述待检测访问序列的异常概率。The target observation probability is multiplied by the penalty coefficient matrix corresponding to the target sequence length to obtain the abnormal probability of the access sequence to be detected.

第三方面，本申请实施例提供了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现上述异常访问序列检测方法的步骤。In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above-mentioned abnormal access sequence detection method when executing the program.

第四方面，本申请实施例提供了一种计算机可读存储介质，其存储有可由计算机设备执行的计算机程序，当所述程序在计算机设备上运行时，使得所述计算机设备执行上述异常访问序列检测方法的步骤。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing a computer program executable by a computer device, and when the program runs on the computer device, the computer device executes the steps of the above-mentioned abnormal access sequence detection method.

第五方面，本申请实施例提供了一种计算机程序产品，所述计算机程序产品包括存储在计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机设备执行时，使所述计算机设备执行上述异常访问序列检测方法的步骤。In a fifth aspect, an embodiment of the present application provides a computer program product, which includes a computer program stored on a computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer device, the computer device executes the steps of the above-mentioned abnormal access sequence detection method.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简要介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1为本申请实施例提供的一种系统架构的结构示意图；FIG1 is a schematic diagram of a system architecture provided in an embodiment of the present application;

图2为本申请实施例提供的一种异常访问序列检测方法的流程示意图；FIG2 is a flow chart of an abnormal access sequence detection method provided in an embodiment of the present application;

图3为本申请实施例提供的一种异常检测模型的训练方法的流程示意图；FIG3 is a flow chart of a method for training an anomaly detection model provided in an embodiment of the present application;

图4为本申请实施例提供的一种异常检测模型的测试方法的流程示意图；FIG4 is a flow chart of a method for testing an anomaly detection model provided in an embodiment of the present application;

图5为本申请实施例提供的一种异常访问序列检测装置的流程示意图；FIG5 is a schematic diagram of a flow chart of an abnormal access sequence detection device provided in an embodiment of the present application;

图6为本申请实施例提供的一种计算机设备的结构示意图。FIG6 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的、技术方案及有益效果更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and beneficial effects of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

为了方便理解，下面对本发明实施例中涉及的名词进行解释。For ease of understanding, the terms involved in the embodiments of the present invention are explained below.

Base64编码方式：Base64编码是从二进制到字符的过程，可用于在HTTP环境下传递较长的标识信息。采用Base64编码具有不可读性，需要解码后才能阅读。Base64 encoding: Base64 encoding is a process from binary to characters, which can be used to transmit longer identification information in HTTP environment. Base64 encoding is unreadable and needs to be decoded before it can be read.

URL编码方式：是一种将特定字符转换为适合作为URL一部分的格式的过程。URL encoding is the process of converting specific characters into a format suitable as part of a URL.

Key-Value方式：Key-Value是一种常用的数据存储结构，通过使用键值对的方式来组织和访问数据。Key-Value method: Key-Value is a commonly used data storage structure that organizes and accesses data by using key-value pairs.

AR模型：Autoregressive model，自回归模型，是统计上一种处理时间序列的方法。它用同一变数的之前各期，来预测本期的表现，并假设它们为一线性关系。AR model: Autoregressive model is a statistical method for processing time series. It uses the previous periods of the same variable to predict the performance of the current period and assumes that they are in a linear relationship.

MA模型：moving average model，滑动平均模型，是一种在时间序列分析和谱估计中常用的模型，MA模型的核心思想是将时间序列的值视为不同时间点白噪声的线性组合，这种组合有助于预测未来的趋势。MA model: moving average model, sliding average model, is a model commonly used in time series analysis and spectral estimation. The core idea of the MA model is to regard the value of the time series as a linear combination of white noise at different time points. This combination helps to predict future trends.

HMM模型：Hidden Markov Model，隐马尔可夫模型，是一种统计模型，用于描述含有隐含未知参数的马尔可夫过程。在HMM中，存在一个隐藏的马尔可夫链，它随机生成不可观测的状态序列，然后每个状态都生成一个观测，从而产生观测随机序列。HMM model: Hidden Markov Model, hidden Markov model, is a statistical model used to describe Markov processes with hidden unknown parameters. In HMM, there is a hidden Markov chain that randomly generates an unobservable state sequence, and then each state generates an observation, thereby generating a random sequence of observations.

参见图1，其为本申请实施例适用的一种系统架构图，该系统架构至少包括终端设备101和检测系统102，终端设备101的数量可以是一个或多个，检测系统102的数量也可以是一个或多个，本申请对终端设备101和检测系统102的数量不做具体限定。Refer to Figure 1, which is a system architecture diagram applicable to an embodiment of the present application. The system architecture includes at least a terminal device 101 and a detection system 102. The number of terminal devices 101 can be one or more, and the number of detection systems 102 can also be one or more. The present application does not specifically limit the number of terminal devices 101 and detection systems 102.

终端设备101中预先安装应用，其中，应用是客户端应用、网页版应用、小程序应用等。终端设备101可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能家电、智能语音交互设备、智能车载设备等，但并不局限于此。The terminal device 101 is pre-installed with applications, where the applications are client applications, web applications, small program applications, etc. The terminal device 101 can be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart home appliance, a smart voice interaction device, a smart car device, etc., but is not limited thereto.

检测系统102是应用的后台服务器，检测系统102可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content DeliveryNetwork，简称CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备101与检测系统102可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。The detection system 102 is the background server of the application. The detection system 102 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms. The terminal device 101 and the detection system 102 can be directly or indirectly connected through wired or wireless communication, and this application does not limit this.

本申请实施例中的异常访问序列检测方法可以是终端设备101执行，也可以是检测系统102执行，还可以由终端设备101与检测系统102交互执行。The abnormal access sequence detection method in the embodiment of the present application may be executed by the terminal device 101, may be executed by the detection system 102, or may be executed interactively by the terminal device 101 and the detection system 102.

基于图1所示的系统架构图，本申请实施例提供了一种异常访问序列检测方法的流程示意图，如图2所示，该方法的流程由计算机设备执行，该计算机设备可以是图1所示的终端设备101和/或检测系统102，包括以下步骤：Based on the system architecture diagram shown in FIG1 , an embodiment of the present application provides a flow chart of an abnormal access sequence detection method. As shown in FIG2 , the flow of the method is executed by a computer device, which may be the terminal device 101 and/or the detection system 102 shown in FIG1 , and includes the following steps:

步骤S201，获取待检测访问序列和待检测访问序列的目标序列长度。Step S201, obtaining an access sequence to be detected and a target sequence length of the access sequence to be detected.

一种可选实施方式中，获取日志记录，日志记录包括：多个访问请求和相应的访问时间；In an optional implementation, a log record is obtained, the log record including: a plurality of access requests and corresponding access times;

按照获得的多个编码请求各自的访问时间，对多个编码请求进行排序，获取待检测访问序列，其中，多个编码请求的数量，为待检测访问序列的目标序列长度。According to the obtained access times of the multiple coding requests, the multiple coding requests are sorted to obtain an access sequence to be detected, wherein the number of the multiple coding requests is the target sequence length of the access sequence to be detected.

具体地，获取访问日志记录，日志记录包括：访问时间、访问地址、访问请求和响应码等，比如：2024-3-1，1.2.3.4，index/login，200，将访问请求经过数据预处理，转换为相应的编码请求存储，数据预处理的方式可以是Base64编码方式、URL编码方式和key-value方式等，以key-value方式为例，比如{index:0、index/login:1}，其中，键值可以根据前端研发文档，列举所有可能的访问请求URL，访问请求URL的参数部分可进行统一的简化处理或删除处理。Specifically, obtain access log records, which include: access time, access address, access request and response code, such as: 2024-3-1, 1.2.3.4, index/login, 200. The access request is converted into a corresponding encoded request for storage after data preprocessing. The data preprocessing method can be Base64 encoding, URL encoding, and key-value encoding. Taking the key-value method as an example, such as {index:0, index/login:1}, the key value can list all possible access request URLs according to the front-end research and development documents, and the parameter part of the access request URL can be uniformly simplified or deleted.

按照获得的多个编码请求各自的访问时间的先后顺序，对多个编码请求进行排序，获得待检测访问序列，其中，多个编码请求的数量，为待检测访问序列的目标序列长度，例如：获得的多个访问请求为index、index/login、index/user、index/user/passreset、index/user/passreset/success、index/login，那么多个访问请求经过数据预处理获得的多个编码请求，按照时间顺序进行排序后，获得的待检测访问序列为(0，1，2，3，4，1)，其目标序列长度为6。The multiple coding requests are sorted according to the order of their respective access times to obtain an access sequence to be detected, wherein the number of the multiple coding requests is the target sequence length of the access sequence to be detected. For example, the multiple access requests obtained are index, index/login, index/user, index/user/passreset, index/user/passreset/success, and index/login. Then, after the multiple coding requests are obtained through data preprocessing and sorted in chronological order, the access sequence to be detected is (0, 1, 2, 3, 4, 1), and its target sequence length is 6.

上述实施方式下，通过解析日志记录获取多个访问请求和相应的访问时间，然后对访问请求进行数据预处理，获得编码请求，基于编码请求的时间顺序获取待检测访问序列和目标序列长度，实现了访问序列特征的简化，降低了异常检测模型计算的复杂度，大大减少了异常检测模型的训练时间，有利于提高异常检测模型的检测准确率。Under the above implementation mode, multiple access requests and corresponding access times are obtained by parsing log records, and then data preprocessing is performed on the access requests to obtain coding requests. The access sequence to be detected and the target sequence length are obtained based on the time sequence of the coding requests. This simplifies the features of the access sequence, reduces the complexity of the anomaly detection model calculation, greatly reduces the training time of the anomaly detection model, and is conducive to improving the detection accuracy of the anomaly detection model.

步骤S202，将待检测访问序列输入目标序列长度关联的异常检测模型，获得待检测访问序列对应的目标观测概率，其中，异常检测模型是无监督或半监督学习获得的时序概率模型。Step S202, input the access sequence to be detected into an anomaly detection model associated with the target sequence length to obtain the target observation probability corresponding to the access sequence to be detected, wherein the anomaly detection model is a time series probability model obtained by unsupervised or semi-supervised learning.

一种可选实施方式中，将待检测访问序列输入目标序列长度关联的异常检测模型，获得待检测访问序列对应的目标观测概率之前，从已训练的多个序列长度关联的异常检测模型中，选取目标序列长度关联的异常检测模型。In an optional implementation, before inputting the access sequence to be detected into the anomaly detection model associated with the target sequence length and obtaining the target observation probability corresponding to the access sequence to be detected, an anomaly detection model associated with the target sequence length is selected from multiple trained anomaly detection models associated with sequence length.

具体地，针对不同序列长度的待检测访问序列，使用的异常检测模型不同，在获取待检测访问序列以及确定目标序列长度之后，需要从已训练的多个序列长度关联的异常检测模型中，选取目标序列长度关联的异常检测模型，将待检测访问序列输入目标序列长度关联的异常检测模型。Specifically, different anomaly detection models are used for access sequences to be detected with different sequence lengths. After obtaining the access sequence to be detected and determining the target sequence length, it is necessary to select an anomaly detection model associated with the target sequence length from multiple trained anomaly detection models associated with sequence lengths, and input the access sequence to be detected into the anomaly detection model associated with the target sequence length.

此外，异常检测模型是无监督或半监督学习获得的时序概率模型，比如：AR模型(Autoregressive model，自回归模型)、MA模型(moving average model，滑动平均模型)和HMM模型(HiddenMarkov Model，隐马尔可夫模型)等等。In addition, the anomaly detection model is a time series probability model obtained by unsupervised or semi-supervised learning, such as: AR model (Autoregressive model), MA model (moving average model) and HMM model (HiddenMarkov Model), etc.

上述实施方式下，通过无监督或半监督学习获得的时序概率模型作为异常检测模型，不仅不需要事先采集异常访问流量和异常日志，能够直接使用序列长度关联的异常检测模型进行检测，而且后续能够根据管理员的少量反馈再进行异常检测模型的异步训练和更新。Under the above implementation, the time series probability model obtained through unsupervised or semi-supervised learning is used as an anomaly detection model. Not only does it not need to collect abnormal access traffic and abnormal logs in advance, but it can directly use the anomaly detection model associated with sequence length for detection, and can also perform asynchronous training and updating of the anomaly detection model based on a small amount of feedback from the administrator.

一种可选实施方式中，多个序列长度关联的异常检测模型是通过以下方式训练获得的：In an optional implementation, the anomaly detection model associated with multiple sequence lengths is trained in the following manner:

按照序列长度对访问序列样本集中的多个样本访问序列进行分类，获得多个序列长度各自的训练子集；Classifying multiple sample access sequences in the access sequence sample set according to sequence lengths to obtain training subsets of respective multiple sequence lengths;

采用多个序列长度各自的训练子集，训练获得多个序列长度各自关联的异常检测模型。Using training subsets of multiple sequence lengths, anomaly detection models associated with the multiple sequence lengths are trained.

具体地，获取访问序列样本集，按照序列长度对访问序列样本集中的多个样本访问序列进行分类，获得多个序列长度各自的训练子集，比如，样本访问序列(1，2，3)，则序列长度为3；样本访问序列(1，2，3，4)，则序列长度为4；(1，2，3，4，1)，则序列长度为5，设定最小序列长度为3，最大序列长度为8。Specifically, an access sequence sample set is obtained, and multiple sample access sequences in the access sequence sample set are classified according to sequence length to obtain training subsets of multiple sequence lengths. For example, the sample access sequence is (1, 2, 3), and the sequence length is 3; the sample access sequence is (1, 2, 3, 4), and the sequence length is 4; (1, 2, 3, 4, 1), and the sequence length is 5. The minimum sequence length is set to 3 and the maximum sequence length is 8.

针对每个序列长度类别的训练子集，将序列长度较长的样本访问序列拆分后作为较短训练子集补充，将序列长度较短的样本访问序列补空值后作为较长训练子集补充，比如，序列长度5的样本访问序列(2，4，6，8，10)拆分为样本访问序列(2，4，6)和样本访问序列(6，8，10)补充到序列长度3的训练子集中，序列长度3的样本访问序列(1，2，3)，补空值后为(1，2，3，null)补充给序列长度4的训练子集中。For the training subsets of each sequence length category, the sample access sequences with longer sequence lengths are split and supplemented as shorter training subsets, and the sample access sequences with shorter sequence lengths are padded with null values and supplemented as longer training subsets. For example, the sample access sequence (2, 4, 6, 8, 10) of sequence length 5 is split into sample access sequence (2, 4, 6) and sample access sequence (6, 8, 10) and supplemented to the training subset of sequence length 3. The sample access sequence (1, 2, 3) of sequence length 3 is padded with null values to become (1, 2, 3, null) and supplemented to the training subset of sequence length 4.

然后采用多个序列长度各自的训练子集，训练获得多个序列长度各自关联的异常检测模型。Then, training subsets of each of the multiple sequence lengths are used to train anomaly detection models associated with each of the multiple sequence lengths.

下面介绍每个序列长度关联的异常检测模型的训练过程以及测试过程：The following describes the training and testing process of the anomaly detection model associated with each sequence length:

一种可选实施方式中，多个序列长度中的每个序列长度关联的异常检测模型的训练过程包括以下操作：In an optional implementation, the training process of the anomaly detection model associated with each sequence length of the plurality of sequence lengths includes the following operations:

针对多个初始状态矩阵，分别执行以下操作：基于一个初始状态矩阵和序列长度对应的训练子集，生成一组候选概率矩阵，一组候选概率矩阵包括：状态转移概率矩阵和观测状态概率矩阵；将样本访问序列作为观测序列，基于一组候选概率矩阵，获得相应的候选观测概率；并基于候选观测概率和一个初始状态矩阵，确定一组候选概率矩阵的预测误差；For multiple initial state matrices, the following operations are performed respectively: based on an initial state matrix and a training subset corresponding to the sequence length, a set of candidate probability matrices are generated, the set of candidate probability matrices including: a state transition probability matrix and an observation state probability matrix; taking the sample access sequence as the observation sequence, based on a set of candidate probability matrices, corresponding candidate observation probabilities are obtained; and based on the candidate observation probabilities and an initial state matrix, a prediction error of a set of candidate probability matrices is determined;

从获得的多组候选概率矩阵中，选取预测误差最小的一组候选概率矩阵作为序列长度关联的异常检测模型的模型参数。From the obtained multiple sets of candidate probability matrices, a set of candidate probability matrices with the smallest prediction error is selected as the model parameters of the sequence length associated anomaly detection model.

具体地，根据系统的功能，比如“登录”，“注册”，“搜索”，“支付”，“转账”，“扫码”等，确定功能的类别数量作为隐藏状态的集合数，比如30，进而确定隐藏状态集合。Specifically, according to the functions of the system, such as "login", "register", "search", "payment", "transfer", "scan code", etc., the number of function categories is determined as the number of hidden state sets, such as 30, and then the hidden state set is determined.

以序列长度3关联的异常检测模型的训练过程为例，如果训练子集为(1，2，3)、(2，5，7)，(1，3，9)...，连续生成10次随机分布的初始状态矩阵I₀，I₁..I₁₀，针对每一个初始状态矩阵，基于训练子集、隐藏状态集合和初始状态矩阵，通过HMM算法，最终计算得到10组状态转移概率矩阵T和观测状态概率矩阵O作为模型参数。将样本访问序列作为观测序列，基于每一组状态转移概率矩阵T和观测状态概率矩阵O，最终计算得到10组候选概率矩阵和10组候选概率矩阵各自对应的候选观测概率，所以针对每一个初始状态矩阵，都有其对应的候选观测概率，根据初始状态矩阵和其对应的候选观测概率，计算得到每一组候选概率矩阵的预测误差，从获得的多组候选概率矩阵中，选取预测误差最小的一组候选概率矩阵作为序列长度关联的异常检测模型的模型参数，其中，预测误差最小可以是均方误差最小、平均绝对误差最小、均方误差最小和平均绝对误差之和最小等等，本申请对此不作限定。按照前述方法，依次获取已训练的序列长度3-8各自关联的异常检测模型。Taking the training process of the anomaly detection model associated with sequence length 3 as an example, if the training subset is (1, 2, 3), (2, 5, 7), (1, 3, 9)..., 10 randomly distributed initial state matrices I₀ , I₁ .. I₁₀ are generated continuously. For each initial state matrix, based on the training subset, the hidden state set and the initial state matrix, the HMM algorithm is used to finally calculate 10 sets of state transition probability matrices T and observation state probability matrices O as model parameters. The sample access sequence is used as the observation sequence. Based on each group of state transition probability matrix T and observation state probability matrix O, 10 groups of candidate probability matrices and the candidate observation probabilities corresponding to each of the 10 groups of candidate probability matrices are finally calculated. Therefore, for each initial state matrix, there is a corresponding candidate observation probability. According to the initial state matrix and its corresponding candidate observation probability, the prediction error of each group of candidate probability matrix is calculated. From the multiple groups of candidate probability matrices obtained, a group of candidate probability matrices with the smallest prediction error is selected as the model parameters of the anomaly detection model associated with the sequence length, where the minimum prediction error can be the minimum mean square error, the minimum mean absolute error, the minimum sum of the mean square error and the mean absolute error, etc. This application does not limit this. According to the above method, the anomaly detection models associated with the trained sequence lengths 3-8 are obtained in turn.

一种可选实施方式中，针对目标序列长度对应的训练子集，按照候选观测概率从大到小的顺序，对训练子集中的多个样本访问序列进行排序，获得排序结果；In an optional implementation, for a training subset corresponding to the target sequence length, multiple sample access sequences in the training subset are sorted in descending order of candidate observation probability to obtain a sorting result;

将排序结果中，排在第N位的样本访问序列的候选观测概率作为目标序列长度对应的预设阈值，N为预设正整数。The candidate observation probability of the sample access sequence ranked at the Nth position in the sorting result is used as the preset threshold corresponding to the target sequence length, where N is a preset positive integer.

具体地，针对目标序列长度对应的训练子集，按照候选观测概率从大到小的顺序，对训练子集中的多个样本访问序列进行排序，获得排序结果，将排序结果中，排在第N位的样本访问序列的候选观测概率作为目标序列长度对应的预设阈值，比如，序列长度3对应的训练子集中，100个样本访问序列中超过95％的样本访问序列的候选观测概率在1％以上，则预设阈值为1％。Specifically, for the training subset corresponding to the target sequence length, multiple sample access sequences in the training subset are sorted in descending order of candidate observation probability to obtain a sorting result, and the candidate observation probability of the sample access sequence ranked Nth in the sorting result is used as the preset threshold corresponding to the target sequence length. For example, in the training subset corresponding to the sequence length of 3, more than 95% of the 100 sample access sequences have a candidate observation probability of more than 1%, and the preset threshold is 1%.

需要说明的是，多个序列长度中除目标序列长度之外的其他序列长度的预设阈值也可以采用上述实施方式获得，并且，针对不同的序列长度，设置的参数N可以是相同的，也可以是不同的，对此，本申请不做具体限定。It should be noted that the preset thresholds of other sequence lengths among multiple sequence lengths except the target sequence length can also be obtained by using the above implementation, and for different sequence lengths, the set parameter N can be the same or different, and this application does not make specific limitations on this.

一种可选实施方式中，多个序列长度关联的异常检测模型的测试过程包括以下步骤：In an optional implementation, the testing process of the anomaly detection model associated with multiple sequence lengths includes the following steps:

获取待测试访问序列；Get the access sequence to be tested;

从待测试访问序列中提取多个序列长度的测试观测序列；Extracting test observation sequences of multiple sequence lengths from the access sequence to be tested;

针对多个序列长度的测试观测序列，分别执行以下操作：将一个测试观测序列输入相应的序列长度所关联的异常检测模型，获得一个测试观测序列对应的测试观测概率；For test observation sequences of multiple sequence lengths, the following operations are performed respectively: a test observation sequence is input into an anomaly detection model associated with a corresponding sequence length to obtain a test observation probability corresponding to the test observation sequence;

根据测试观测概率和序列长度对应的惩罚系数矩阵，获得测试观测序列的异常概率；According to the penalty coefficient matrix corresponding to the test observation probability and sequence length, the abnormal probability of the test observation sequence is obtained;

基于测试观测序列的异常概率，确定序列长度所关联的异常检测模型的测试结果。Based on the anomaly probability of the test observation sequence, the test result of the anomaly detection model associated with the sequence length is determined.

具体地，获取日志记录，从日志记录中获取待测试访问序列，再从待测试访问序列中提取多个序列长度的测试观测序列，比如，待测试访问序列为(3，3，3，3，3)，将待测试访问序列分别拆分成序列长度为3、4、5的三个测试观测序列，针对每个序列长度的测试观测序列，将其输入相应的序列长度所关联的异常检测模型，分别获得每个序列长度的测试观测序列对应的测试观测概率，根据测试观测概率和序列长度对应的惩罚系数矩阵，获得测试观测序列的异常概率，比如，序列长度3的异常概率为0.5％，序列长度4的异常概率为0.1％，序列长度5的异常概率为0.01％，对照每个序列长度关联的异常检测模型的预设阈值，确定每个待测试访问序列是否异常。如果测试结果显示序列长度3的测试观测序列和序列长度4的测试观测序列都为正常，序列长度5的测试观测序列为异常，则将测试结果发送给管理员，由管理员综合判定待测试访问序列是否异常。如果序列长度5的异常概率0.01％远远小于序列长度5关联的异常检测模型的预设阈值2％，则将序列长度5的测试观测序列告警给管理员，如果管理员分析后判定为正常的业务情况，则可在训练集中动态增加相关的训练数据集，异步更新状态转移概率矩阵和观测状态概率矩阵。Specifically, log records are obtained, and the access sequence to be tested is obtained from the log records, and then test observation sequences of multiple sequence lengths are extracted from the access sequence to be tested. For example, the access sequence to be tested is (3, 3, 3, 3, 3), and the access sequence to be tested is split into three test observation sequences of sequence lengths of 3, 4, and 5, respectively. For each test observation sequence of sequence length, it is input into the anomaly detection model associated with the corresponding sequence length, and the test observation probability corresponding to the test observation sequence of each sequence length is obtained respectively. According to the penalty coefficient matrix corresponding to the test observation probability and the sequence length, the anomaly probability of the test observation sequence is obtained, for example, the anomaly probability of sequence length 3 is 0.5%, the anomaly probability of sequence length 4 is 0.1%, and the anomaly probability of sequence length 5 is 0.01%. Compare the preset threshold of the anomaly detection model associated with each sequence length to determine whether each access sequence to be tested is abnormal. If the test result shows that the test observation sequence of sequence length 3 and the test observation sequence of sequence length 4 are both normal, and the test observation sequence of sequence length 5 is abnormal, the test result is sent to the administrator, and the administrator comprehensively determines whether the access sequence to be tested is abnormal. If the anomaly probability of sequence length 5 (0.01%) is much smaller than the preset threshold value of 2% of the anomaly detection model associated with sequence length 5, the test observation sequence of sequence length 5 will be alerted to the administrator. If the administrator determines that it is a normal business situation after analysis, the relevant training data set can be dynamically added to the training set, and the state transition probability matrix and the observation state probability matrix can be asynchronously updated.

步骤S203，根据目标观测概率和目标序列长度对应的惩罚系数矩阵，获得待检测访问序列的异常概率。Step S203, obtaining the abnormal probability of the access sequence to be detected according to the penalty coefficient matrix corresponding to the target observation probability and the target sequence length.

一种可选实施方式中，将目标观测概率乘以目标序列长度对应的惩罚系数矩阵，获得待检测访问序列的异常概率。In an optional implementation, the target observation probability is multiplied by the penalty coefficient matrix corresponding to the target sequence length to obtain the abnormal probability of the access sequence to be detected.

具体地，从前端研发文档，整理出标准的URL网页跳转逻辑，自动生成每个序列长度的访问序列对应的惩罚系数矩阵，惩罚系数矩阵的获得通常采用图的中心性度量算法，因为深度游走算法的距离函数永远都是对称的，计算量大，模型超参数多，而简单的随机游走算法的距离函数不对称，可以有效降低计算量，因此，本申请以随机游走算法为例具体说明惩罚系数矩阵的获得过程：在访问序列中，将每一个URL网页视作一个节点，若两个网页中存在访问跳转关系，则建立两个网页的边，由于网页跳转后，虽然可能存在“返回上一页”的情况，但前后跳转的业务可能性不一样，所以考虑建立有向图，比如从网页A中的up.com/index.htm点击登录会跳转到网页B中的up.com/login.htm，则建立节点A到节点B的有向边。Specifically, the standard URL web page jump logic is sorted out from the front-end research and development documents, and the penalty coefficient matrix corresponding to the access sequence of each sequence length is automatically generated. The penalty coefficient matrix is usually obtained by the centrality measurement algorithm of the graph, because the distance function of the deep walk algorithm is always symmetrical, the calculation amount is large, and the model hyperparameters are many, and the distance function of the simple random walk algorithm is asymmetric, which can effectively reduce the calculation amount. Therefore, this application takes the random walk algorithm as an example to specifically illustrate the process of obtaining the penalty coefficient matrix: in the access sequence, each URL web page is regarded as a node. If there is an access jump relationship between the two web pages, an edge between the two web pages is established. After the web page jumps, although there may be a "return to the previous page" situation, the business possibilities of the previous and next jumps are different, so consider establishing a directed graph. For example, clicking on login from up.com/index.htm in web page A will jump to up.com/login.htm in web page B, then a directed edge from node A to node B is established.

进一步地，为了计算节点A到节点B的距离，从节点A出发做随机游走，计算节点A首次抵达节点B所需要的平均步数(首中时，英文全称hitting time)，即为节点A到节点B的距离，因为距离的有向性，所以从节点A到节点B的距离未必等于从节点B到节点A的距离。Furthermore, in order to calculate the distance from node A to node B, a random walk is performed starting from node A to calculate the average number of steps required for node A to reach node B for the first time (hitting time), which is the distance from node A to node B. Due to the directionality of distance, the distance from node A to node B is not necessarily equal to the distance from node B to node A.

通过上述随机游走算法，可以计算出有向图中，每个节点和各个节点的关联度(以距离来衡量)，两个节点之间的关联度值越大，则两个节点之间的联系越紧密，两个网页之间跳转的可能性越大，则惩罚系数矩阵中的惩罚系数越大；比如，从/index/robot到/index/passreset的惩罚系数设为0.01，表示此跳转的业务可能性极低。Through the above random walk algorithm, the correlation between each node and each node in the directed graph (measured by distance) can be calculated. The larger the correlation value between two nodes, the closer the connection between the two nodes, the greater the possibility of jumping between two web pages, and the larger the penalty coefficient in the penalty coefficient matrix; for example, the penalty coefficient from /index/robot to /index/passreset is set to 0.01, indicating that the business possibility of this jump is extremely low.

特别地，当网页A(节点A)对自身进行访问，惩罚系数可设置成一个随机的极小值，比如index对自己的惩罚系数设为0.01；当存在非连通图，非连通子图的节点之间的惩罚系数也应设置成一个随机的极小值，比如非连通子图1中节点index指向节点logi，非连通子图2中节点test1指向节点test2，则节点index对节点test1的惩罚系数设为0.02。In particular, when web page A (node A) visits itself, the penalty coefficient can be set to a random minimum value, such as setting the penalty coefficient of index to itself to 0.01; when there is a non-connected graph, the penalty coefficients between the nodes of the non-connected subgraph should also be set to a random minimum value, such as node index in non-connected subgraph 1 points to node logi, and node test1 in non-connected subgraph 2 points to node test2, then the penalty coefficient of node index to node test1 is set to 0.02.

通过上述方法获得各个目标序列长度对应的惩罚系数矩阵，将目标观测概率乘以目标序列长度对应的惩罚系数矩阵，获得待检测访问序列的异常概率。The penalty coefficient matrix corresponding to each target sequence length is obtained by the above method, and the target observation probability is multiplied by the penalty coefficient matrix corresponding to the target sequence length to obtain the abnormal probability of the access sequence to be detected.

步骤S204，若待检测访问序列的异常概率大于目标序列长度对应的预设阈值，则待检测访问序列为异常访问序列。Step S204: if the abnormal probability of the access sequence to be detected is greater than a preset threshold corresponding to the target sequence length, the access sequence to be detected is an abnormal access sequence.

为了更清楚的解释本申请实施例，参见图3，示出了一种异常检测模型的训练方法的流程示意图，具体包括以下步骤：In order to more clearly explain the embodiment of the present application, referring to FIG3 , a flow chart of a training method of an anomaly detection model is shown, which specifically includes the following steps:

步骤301，日志解析。Step 301, log analysis.

具体地，获取日志记录，对日志进行解析，并进行数据清洗，得到格式化数据，获取多个访问请求和相应的访问时间，对每个访问请求进行数据预处理，获得相应的编码请求。Specifically, log records are obtained, the logs are parsed, and data is cleaned to obtain formatted data, multiple access requests and corresponding access times are obtained, data preprocessing is performed on each access request, and a corresponding encoding request is obtained.

步骤302，序列提取。Step 302: sequence extraction.

具体地，设定最大时间窗口，基于编码请求提取日志，获得访问序列样本集，按照序列长度对访问序列样本集中的多个样本访问序列进行分类，获得多个序列长度各自的训练子集。Specifically, a maximum time window is set, logs are extracted based on encoding requests, an access sequence sample set is obtained, multiple sample access sequences in the access sequence sample set are classified according to sequence lengths, and training subsets of multiple sequence lengths are obtained.

步骤303，训练子集扩充。Step 303: training subset expansion.

具体地，将序列长度较长的样本访问序列拆分后作为较短训练子集补充，将序列长度较短的样本访问序列补空值后作为较长训练子集补充。Specifically, the sample access sequence with a longer sequence length is split and then used as a supplement to the shorter training subset, and the sample access sequence with a shorter sequence length is padded with blank values and then used as a supplement to the longer training subset.

步骤304，设定隐藏状态数。Step 304, setting the number of hidden states.

具体地，根据系统功能与业务逻辑，设定隐藏状态数，获得隐藏状态集合。Specifically, according to the system function and business logic, the number of hidden states is set to obtain a hidden state set.

步骤305，模型训练与计算。Step 305: model training and calculation.

具体地，随机生成多次初始状态矩阵，基于初始状态矩阵、隐藏状态集合和序列长度对应的训练子集，训练模型计算得到一组候选概率矩阵：状态转移概率矩阵和观测状态概率矩阵。然后，将训练子集作为观测序列集，基于计算获得的状态转移概率矩阵和观测状态概率矩阵，获得相应的候选观测概率，基于初始状态矩阵和相应的候选观测概率，确定一组候选概率矩阵的预测误差，选取预测误差最小的一组候选概率矩阵。以此类推，计算获取所有序列长度关联的异常检测模型。Specifically, the initial state matrix is randomly generated multiple times. Based on the initial state matrix, the hidden state set and the training subset corresponding to the sequence length, the training model calculates a set of candidate probability matrices: the state transition probability matrix and the observation state probability matrix. Then, the training subset is used as the observation sequence set, and the corresponding candidate observation probabilities are obtained based on the calculated state transition probability matrix and the observation state probability matrix. Based on the initial state matrix and the corresponding candidate observation probabilities, the prediction error of a set of candidate probability matrices is determined, and a set of candidate probability matrices with the smallest prediction error is selected. In this way, the anomaly detection model associated with all sequence lengths is calculated.

步骤306，设定异常概率阈值。Step 306, setting an abnormal probability threshold.

具体地，按照各个训练子集的候选观测概率分布，设定每个序列长度关联的异常检测模型的异常概率阈值。Specifically, according to the candidate observation probability distribution of each training subset, the anomaly probability threshold of the anomaly detection model associated with each sequence length is set.

步骤307，设定惩罚系数矩阵。Step 307, setting the penalty coefficient matrix.

具体地，根据系统研发的访问逻辑，包括网页、小程序等，设定每个序列长度对应的惩罚系数矩阵。Specifically, according to the access logic developed by the system, including web pages, applets, etc., a penalty coefficient matrix corresponding to each sequence length is set.

上述实施方式下，不仅通过访问序列样本集扩充和随机初始概率迭代，增强了异常检测模型的鲁棒性，而且根据业务逻辑加入惩罚系数矩阵，有效提高了异常检测模型的准确性。Under the above implementation, not only the robustness of the anomaly detection model is enhanced by accessing the expansion of the sequence sample set and the random initial probability iteration, but also the accuracy of the anomaly detection model is effectively improved by adding a penalty coefficient matrix according to the business logic.

此外，为了更清楚的解释本申请实施例，参见图4，示出了一种异常访问序列测试方法的流程示意图，具体包括：In addition, in order to more clearly explain the embodiment of the present application, referring to FIG. 4 , a flow chart of an abnormal access sequence testing method is shown, which specifically includes:

步骤401，日志解析。Step 401, log analysis.

具体地，获取日志记录，解析日志记录，得到多个访问请求和相应的访问时间，对每个访问请求进行数据预处理，获得相应的编码请求，按照获得的多个编码请求各自的访问时间，对多个编码请求进行排序，获取待测试访问序列。将待测试访问序列拆分成多个序列长度的观测行为序列。Specifically, log records are obtained, and the log records are parsed to obtain multiple access requests and corresponding access times. Data preprocessing is performed on each access request to obtain a corresponding coding request. The multiple coding requests are sorted according to their respective access times to obtain an access sequence to be tested. The access sequence to be tested is split into observation behavior sequences of multiple sequence lengths.

步骤402，将序列长度为N的观测行为序列输入序列长度N关联的异常检测模型，获得测试观测概率。Step 402: Input the observed behavior sequence with a sequence length of N into an anomaly detection model associated with the sequence length N to obtain a test observation probability.

从N＝Nmin开始进行异常检测。Anomaly detection starts from N=Nmin.

步骤403，判断序列长度N是否小于最大序列长度Nmax，如果N<Nmax，则步骤404，否则执行步骤405。Step 403, determine whether the sequence length N is less than the maximum sequence length Nmax, if N<Nmax, then go to step 404, otherwise go to step 405.

步骤404，设置N＝N+1，返回执行步骤402。Step 404, set N=N+1, and return to execute step 402.

步骤405，惩罚系数矩阵相乘。Step 405: multiply the penalty coefficient matrices.

将多个测试观测概率分别乘以各自的序列长度对应的惩罚系数矩阵，获得各个观测行为序列的异常概率。The multiple test observation probabilities are multiplied by the penalty coefficient matrix corresponding to their respective sequence lengths to obtain the abnormal probability of each observation behavior sequence.

步骤406，判断异常概率是否大于预设阈值，若是，则步骤407，否则执行步骤408。Step 406, determine whether the abnormal probability is greater than a preset threshold, if so, proceed to step 407, otherwise, proceed to step 408.

步骤407，异常告警。Step 407, abnormal alarm.

步骤408，正常访问序列。Step 408, normal access sequence.

将各个观测行为序列的异常概率与对应的预设阈值比较进行判定，若观测行为序列的异常概率大于对应的预设阈值，则观测行为序列为异常访问序列，进行异常告警；否则，观测行为序列为正常访问序列。The abnormal probability of each observed behavior sequence is compared with the corresponding preset threshold for judgment. If the abnormal probability of the observed behavior sequence is greater than the corresponding preset threshold, the observed behavior sequence is an abnormal access sequence and an abnormal alarm is issued; otherwise, the observed behavior sequence is a normal access sequence.

上述实施方式下，通过对访问请求的数据预处理，以及无监督或者半监督学习获得的时序概率模型作为异常检测模型进行异常访问序列检测，不仅可以从访问序列的时间先后顺序中，有效地检测与识别出异常用户访问的行为路径，而且有效提高了异常访问序列的检测准确性。Under the above implementation mode, by preprocessing the access request data and using the time series probability model obtained by unsupervised or semi-supervised learning as an anomaly detection model to perform abnormal access sequence detection, not only can the behavioral path of abnormal user access be effectively detected and identified from the chronological order of the access sequence, but also the detection accuracy of abnormal access sequences can be effectively improved.

基于相同的技术构思，参见图5，本申请实施例提供了一种异常访问序列检测装置，包括：Based on the same technical concept, referring to FIG5 , an embodiment of the present application provides an abnormal access sequence detection device, including:

获取单元501，用于获取待检测访问序列和待检测访问序列的目标序列长度；An acquisition unit 501 is used to acquire an access sequence to be detected and a target sequence length of the access sequence to be detected;

观测概率获取单元502，用于将待检测访问序列输入目标序列长度关联的异常检测模型，获得待检测访问序列对应的目标观测概率，其中，异常检测模型是无监督或半监督学习获得的时序概率模型；The observation probability acquisition unit 502 is used to input the access sequence to be detected into the anomaly detection model associated with the target sequence length to obtain the target observation probability corresponding to the access sequence to be detected, wherein the anomaly detection model is a time series probability model obtained by unsupervised or semi-supervised learning;

异常概率获取单元503，用于根据目标观测概率和目标序列长度对应的惩罚系数矩阵，获得待检测访问序列的异常概率；The abnormal probability acquisition unit 503 is used to obtain the abnormal probability of the access sequence to be detected according to the penalty coefficient matrix corresponding to the target observation probability and the target sequence length;

异常检测单元504，用于若待检测访问序列的异常概率大于目标序列长度对应的预设阈值，则待检测访问序列为异常访问序列。The abnormality detection unit 504 is configured to determine that the access sequence to be detected is an abnormal access sequence if the abnormal probability of the access sequence to be detected is greater than a preset threshold corresponding to the target sequence length.

一种可选实施方式中，获取单元501还用于：In an optional implementation manner, the acquiring unit 501 is further configured to:

获取日志记录，日志记录包括：多个访问请求和相应的访问时间；Obtain log records, which include: multiple access requests and corresponding access times;

一种可选实施方式中，还包括模型筛选单元505；In an optional implementation, a model screening unit 505 is also included;

模型筛选单元505具体用于：The model screening unit 505 is specifically used for:

将待检测访问序列输入目标序列长度关联的异常检测模型，获得待检测访问序列对应的目标观测概率之前，从已训练的多个序列长度关联的异常检测模型中，选取目标序列长度关联的异常检测模型。The access sequence to be detected is input into the anomaly detection model associated with the target sequence length. Before obtaining the target observation probability corresponding to the access sequence to be detected, the anomaly detection model associated with the target sequence length is selected from multiple trained anomaly detection models associated with sequence length.

一种可选实施方式中，还包括模型训练单元506；In an optional implementation, a model training unit 506 is also included;

模型训练单元506具体用于：The model training unit 506 is specifically used for:

一种可选实施方式中，模型训练单元506还用于：In an optional implementation manner, the model training unit 506 is further configured to:

针对多个初始状态矩阵，分别执行以下操作：基于一个初始状态矩阵和序列长度对应的训练子集，生成一组候选概率矩阵，一组候选概率矩阵包括：状态转移概率矩阵和观测状态概率矩阵；将样本访问序列作为观测序列，基于一组候选概率矩阵，获得相应的候选观测概率；并基于候选观测概率和一个初始状态矩阵，确定一组概率矩阵的预测误差；For multiple initial state matrices, the following operations are performed respectively: based on an initial state matrix and a training subset corresponding to the sequence length, a set of candidate probability matrices are generated, the set of candidate probability matrices including: a state transition probability matrix and an observation state probability matrix; taking the sample access sequence as the observation sequence, based on a set of candidate probability matrices, corresponding candidate observation probabilities are obtained; and based on the candidate observation probabilities and an initial state matrix, a prediction error of a set of probability matrices is determined;

针对目标序列长度对应的训练子集，按照候选观测概率从大到小的顺序，对训练子集中的多个样本访问序列进行排序，获得排序结果；For the training subset corresponding to the target sequence length, multiple sample access sequences in the training subset are sorted in descending order of candidate observation probability to obtain a sorting result;

一种可选实施方式中，还包括模型测试单元507；In an optional implementation, it further includes a model testing unit 507;

模型测试单元507具体用于：The model testing unit 507 is specifically used for:

获取待测试访问序列；Get the access sequence to be tested;

一种可选实施方式中，异常检测单元504还用于：In an optional implementation manner, the anomaly detection unit 504 is further configured to:

将目标观测概率乘以目标序列长度对应的惩罚系数矩阵，获得待检测访问序列的异常概率。The target observation probability is multiplied by the penalty coefficient matrix corresponding to the target sequence length to obtain the abnormal probability of the access sequence to be detected.

基于相同的技术构思，本申请实施例提供了一种计算机设备，该计算机设备可以是图1所示的终端设备和/或检测系统，如图6所示，包括至少一个处理器601，以及与至少一个处理器连接的存储器602，本申请实施例中不限定处理器601与存储器602之间的具体连接介质，图6中处理器601和存储器602之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。Based on the same technical concept, the embodiment of the present application provides a computer device, which may be the terminal device and/or detection system shown in FIG1 , as shown in FIG6 , including at least one processor 601, and a memory 602 connected to the at least one processor. The embodiment of the present application does not limit the specific connection medium between the processor 601 and the memory 602. In FIG6 , the processor 601 and the memory 602 are connected through a bus as an example. The bus can be divided into an address bus, a data bus, a control bus, etc.

在本申请实施例中，存储器602存储有可被至少一个处理器601执行的指令，至少一个处理器601通过执行存储器602存储的指令，可以执行上述异常访问序列检测方法的步骤。In the embodiment of the present application, the memory 602 stores instructions that can be executed by at least one processor 601 , and the at least one processor 601 can perform the steps of the above-mentioned abnormal access sequence detection method by executing the instructions stored in the memory 602 .

其中，处理器601是计算机设备的控制中心，可以利用各种接口和线路连接计算机设备的各个部分，通过运行或执行存储在存储器602内的指令以及调用存储在存储器602内的数据，从而实现异常访问序列检测。可选的，处理器601可包括一个或多个处理单元，处理器601可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作系统、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器601中。在一些实施例中，处理器601和存储器602可以在同一芯片上实现，在一些实施例中，它们也可以在独立的芯片上分别实现。Among them, the processor 601 is the control center of the computer device, and various interfaces and lines can be used to connect various parts of the computer device, and abnormal access sequence detection can be realized by running or executing instructions stored in the memory 602 and calling data stored in the memory 602. Optionally, the processor 601 may include one or more processing units, and the processor 601 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc., and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 601. In some embodiments, the processor 601 and the memory 602 may be implemented on the same chip, and in some embodiments, they may also be implemented separately on independent chips.

处理器601可以是通用处理器，例如中央处理器(CPU)、数字信号处理器、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件，可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成，或者用处理器中的硬件及软件模块组合执行完成。Processor 601 can be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application-specific integrated circuit (ASIC), a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, and can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application. A general-purpose processor can be a microprocessor or any conventional processor, etc. The steps of the method disclosed in the embodiments of the present application can be directly embodied as a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.

存储器602作为一种非易失性计算机可读存储介质，可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器602可以包括至少一种类型的存储介质，例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(RandomAccess Memory，RAM)、静态随机访问存储器(Static RandomAccess Memory，SRAM)、可编程只读存储器(Programmable Read Only Memory，PROM)、只读存储器(Read Only Memory，ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory，EEPROM)、磁性存储器、磁盘、光盘等等。存储器602是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机设备存取的任何其他介质，但不限于此。本申请实施例中的存储器602还可以是电路或者其它任意能够实现存储功能的装置，用于存储程序指令和/或数据。The memory 602 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer executable programs and modules. The memory 602 may include at least one type of storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory, a random access memory (Random Access Memory, RAM), a static random access memory (Static Random Access Memory, SRAM), a programmable read-only memory (Programmable Read Only Memory, PROM), a read-only memory (Read Only Memory, ROM), an electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a magnetic memory, a disk, an optical disk, etc. The memory 602 is any other medium that can be used to carry or store a desired program code in the form of an instruction or data structure and can be accessed by a computer device, but is not limited thereto. The memory 602 in the embodiment of the present application can also be a circuit or any other device that can realize a storage function, for storing program instructions and/or data.

基于同一发明构思，本申请实施例提供了一种计算机可读存储介质，其存储有可由计算机设备执行的计算机程序，当程序在计算机设备上运行时，使得计算机设备执行上述异常访问序列检测方法的步骤。Based on the same inventive concept, an embodiment of the present application provides a computer-readable storage medium storing a computer program executable by a computer device. When the program runs on the computer device, the computer device executes the steps of the above-mentioned abnormal access sequence detection method.

基于同一发明构思，本申请实施例提供了一种计算机程序产品，所述计算机程序产品包括存储在计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机设备执行时，使所述计算机设备执行上述异常访问序列检测方法的步骤。Based on the same inventive concept, an embodiment of the present application provides a computer program product, which includes a computer program stored on a computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer device, the computer device executes the steps of the above-mentioned abnormal access sequence detection method.

本领域内的技术人员应明白，本发明的实施例可提供为方法、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention may be provided as methods or computer program products. Therefore, the present invention may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Furthermore, the present invention may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机设备或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the embodiment of the present invention. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer device or other programmable data processing device produce a device for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机设备或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer device or other programmable data processing device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机设备或其他可编程数据处理设备上，使得在计算机设备或其他可编程设备上执行一系列操作步骤以产生计算机设备实现的处理，从而在计算机设备或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer device or other programmable data processing device so that a series of operating steps are executed on the computer device or other programmable device to produce a process implemented by the computer device, thereby the instructions executed on the computer device or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例作出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。Although the preferred embodiments of the present invention have been described, those skilled in the art may make other changes and modifications to these embodiments once they have learned the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications that fall within the scope of the present invention.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include these modifications and variations.