Disclosure of Invention
In view of the above, the present application is directed to a data processing method and system applied to data center station construction.
According to a first aspect of the present application, there is provided a data processing method applied to data center construction, applied to a data center service system, the method comprising:
Acquiring a first template network intrusion data sequence and a second template network intrusion data sequence which are applied to data center station construction, wherein the first template network intrusion data sequence and the second template network intrusion data sequence are different in network intrusion type, and the template configuration scale of the first template network intrusion data sequence is larger than that of the second template network intrusion data sequence;
Dividing the first template network intrusion data sequence into a plurality of template network intrusion data subsequences, and determining the extraction scale of each template network intrusion data subsequence based on the scale ratio of the template configuration scale of the first template network intrusion data sequence to the template configuration scale of the second template network intrusion data sequence;
Performing data extraction on each template network intrusion data subsequence based on the extraction scale to generate a third template network intrusion data sequence, and generating a sample learning data sequence based on the third template network intrusion data sequence and the second template network intrusion data sequence, wherein the scale ratio of the template configuration scale of the third template network intrusion data sequence to the template configuration scale of the second template network intrusion data sequence is smaller than a set ratio;
model parameter learning is carried out on the network intrusion prediction model based on the sample learning data sequence, and the network intrusion prediction model with the model parameter learning completed is generated;
Classifying target network intrusion data based on the network intrusion prediction model with model parameter learning, generating a classification prediction result of the target network intrusion data, and performing security protection reinforcement configuration on a data center station to be constructed based on the classification prediction result of the target network intrusion data, wherein the classification prediction result characterizes the network intrusion type of the target network intrusion data.
In a possible implementation manner of the first aspect, the acquiring the first template network intrusion data sequence and the second template network intrusion data sequence applied to the data center station construction includes:
obtaining a reference intrusion log carrying an attack type tag from an initial network intrusion log, wherein the attack type tag characterizes the network intrusion type of the reference intrusion log;
encoding the reference intrusion log to generate an intrusion path encoding vector of the reference intrusion log;
And determining the first template network intrusion data sequence and the second template network intrusion data sequence based on the intrusion path coding vector of the reference intrusion log.
In a possible implementation manner of the first aspect, the obtaining, from an initial network intrusion log, a reference intrusion log carrying an attack type tag includes:
Acquiring an initial intrusion knowledge point of the reference intrusion log;
determining intrusion behavior evaluation information based on the initial intrusion knowledge points of the reference intrusion log;
And removing a noise intrusion log from the reference intrusion log based on the intrusion behavior evaluation information and the intrusion feature template, and generating the updated reference intrusion log.
In a possible implementation manner of the first aspect, the encoding the reference intrusion log to generate an intrusion path encoding vector of the reference intrusion log includes:
constructing an initial intrusion knowledge vector of the reference intrusion log according to the prior attack behavior record of the reference intrusion log, wherein the initial intrusion knowledge vector at least comprises an attack type, an attack source and an attack target;
Constructing an attack strategy penetration characteristic of the reference intrusion log according to the attack mode of the reference intrusion log;
Carrying out serialization processing on the initial intrusion knowledge vector and the attack strategy penetration feature of the reference intrusion log based on a set time window so as to fuse the fused initial intrusion knowledge vector and the fused attack strategy penetration feature of different time windows;
Encoding at least one of the initial intrusion knowledge vector before fusion and the initial intrusion knowledge vector after fusion and the penetration feature of the attack strategy to obtain a first encoding vector of the reference intrusion log;
processing a classification vector in the first coding vector of the reference intrusion log to generate a second coding vector of the reference intrusion log;
And splicing the first code vector and the second code vector of the reference intrusion log to generate an intrusion path code vector of the reference intrusion log, wherein the intrusion path code vector of the reference intrusion log is a numerical intrusion path code vector.
In a possible implementation manner of the first aspect, the encoding at least one of the pre-fusion initial intrusion knowledge vector and the post-fusion initial intrusion knowledge vector with the attack policy penetration feature to obtain a first encoded vector of the reference intrusion log includes:
Regularized conversion of the structured vector in the penetration feature of at least one attack strategy in the initial intrusion knowledge vector before fusion and the initial intrusion knowledge vector after fusion, and
And discretizing and converting at least one of the initial intrusion knowledge vector before fusion and the initial intrusion knowledge vector after fusion and an unstructured vector in the penetration characteristic of the attack strategy.
In a possible implementation manner of the first aspect, the dividing the first template network intrusion data sequence into a plurality of template network intrusion data subsequences, determining a scale of extraction of each of the template network intrusion data subsequences based on a ratio of a template configuration scale of the first template network intrusion data sequence to a template configuration scale of the second template network intrusion data sequence, includes:
And when the scale ratio of the first template network intrusion data sequence to the second template network intrusion data sequence is larger than the set ratio, determining the extraction scale based on the set ratio, the template configuration scale of the template network intrusion data subsequence and the template configuration scale of the second template network intrusion data sequence.
In a possible implementation manner of the first aspect, the performing model parameter learning on the network intrusion prediction model based on the sample learning data sequence, generating the network intrusion prediction model that completes model parameter learning, includes:
randomly extracting a plurality of characteristics of each sample learning data in the sample learning data sequence, and inputting an initial learning algorithm to generate a risk weighted prediction tree;
Iteratively executing the steps of arbitrarily extracting a plurality of features of each sample learning data in the sequence of sample learning data and inputting an initial learning algorithm to generate a risk weighted prediction tree until X risk weighted prediction trees are generated, wherein X is a positive integer greater than 1, and
And performing ensemble learning on the X risk weighted prediction trees to generate the network intrusion prediction model.
In a possible implementation manner of the first aspect, the performing model parameter learning on the network intrusion prediction model based on the sample learning data sequence, generating the network intrusion prediction model that completes model parameter learning, includes:
dividing the sample learning data sequence into a first subsequence and a second subsequence;
model parameter learning is carried out on the network intrusion prediction model according to initialization weight information and the first subsequence, wherein the initialization weight information comprises the scale of the template network intrusion data subsequence, the scale of the X and the characteristic of each sample learning data in the sample learning data sequence extracted at will;
learning network intrusion prediction behaviors of the network intrusion prediction model;
continuing to perform model parameter learning on the network intrusion prediction model according to the learned model parameter information of the network intrusion prediction behavior, generating the network intrusion prediction model with the model parameter learning completed, and testing the network intrusion prediction model with the model parameter learning completed according to the second subsequence;
in a possible implementation manner of the first aspect, the classifying, based on the network intrusion prediction model that completes model parameter learning, target network intrusion data, and generating a classification prediction result of the target network intrusion data includes:
encoding the acquired target network intrusion data set to generate an intrusion path encoding vector of the target network intrusion data;
inputting an intrusion path coding vector of the target network intrusion data into each risk weighted prediction tree in the network intrusion prediction model with model parameter learning completed, so that each risk weighted prediction tree outputs network intrusion classification data;
and generating the classification prediction result based on the network intrusion classification data output by each risk weighted prediction tree.
In a possible implementation manner of the first aspect, the network intrusion classification data includes a confidence that the target network intrusion data belongs to different network intrusion categories, and the generating the classification prediction result based on the network intrusion classification data output by each risk weighted prediction tree includes:
Calculating the scale of the network intrusion classification data with the confidence coefficient of each network intrusion category being larger than a preset confidence coefficient;
Determining the network intrusion category having the size greater than half of the X as the classification prediction result.
According to a second aspect of the present application, there is provided a data center service system including a processor and a readable storage medium storing a program which, when executed by the processor, implements the aforementioned data processing method applied to data center construction.
According to a third aspect of the present application, there is provided a computer-readable storage medium having stored therein computer-executable instructions for implementing the aforementioned data processing method applied to data center construction when it is monitored that the computer-executable instructions are executed.
According to any one of the aspects, the embodiment of the application can more comprehensively and accurately cope with various network intrusion situations by acquiring the template network intrusion data sequences of two different network intrusion categories and performing intelligent processing according to the template configuration scale of the template network intrusion data sequences, thereby improving the safety protection capability of a data center service system. By dividing the first template network intrusion data sequence and determining the extraction scale based on the scale ratio, the data is efficiently utilized, the data processing flow is optimized, the calculation complexity is reduced, and the processing speed is improved. The third template network intrusion data sequence is generated and combined with the second template network intrusion data sequence to generate the sample learning data sequence, so that the comprehensiveness and accuracy of model learning are ensured, and a high-quality data basis is provided for subsequent model parameter learning. Model parameter learning is carried out on the network intrusion prediction model by utilizing the sample learning data sequence, so that the model can more accurately predict and identify various network intrusion behaviors, and the safety protection capability of the data center platform service system is enhanced. By classifying the target network intrusion data and carrying out safety protection reinforcement configuration on the data center to be constructed based on the classification prediction result, personalized protection of the data center to be constructed is realized, and safety and stability of the data center are effectively improved. Therefore, the safety protection capability of the data center station service system can be obviously improved, the data processing flow is optimized, the processing speed is improved, and the safety and stability operation of the data center station is ensured.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art in light of the embodiments of the present application without undue burden, are intended to be within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1 shows a flow chart of a data processing method applied to data center table construction according to an embodiment of the present application, and it should be understood that, in other embodiments, the sequence of part of the steps in the data processing method applied to data center table construction may be interchanged according to actual needs, or part of the steps may be omitted or deleted. The detailed steps of the data processing method applied to the data center table construction are described as follows.
Step S110, a first template network intrusion data sequence and a second template network intrusion data sequence which are constructed by a data center station are obtained, the network intrusion categories of the first template network intrusion data sequence and the second template network intrusion data sequence are different, and the template configuration scale of the first template network intrusion data sequence is larger than that of the second template network intrusion data sequence.
In a practical scenario, the construction of a station in data requires a large amount of network intrusion data to train and optimize the model. The server may obtain this data from a number of sources, such as a security monitoring system within the enterprise, security authorities on the internet, historical data stores, etc.
Assume that a server obtains the following two network intrusion data sequences from a security monitoring system inside an enterprise:
the first template network intrusion data sequence comprises a series of complex network attack behaviors, such as Advanced Persistent Threat (APT) attack, cross-site scripting attack (XSS) and the like. These attacks are often of high technical content and jeopardy.
The second template network intrusion data sequence comprises some common network intrusion behaviors, such as port scanning, malicious software infection and the like. These attacks are relatively simple, but also need to be discovered and protected in time.
The server stores the two data sequences in a database for use in subsequent steps.
Step S120, dividing the first template network intrusion data sequence into a plurality of template network intrusion data subsequences, and determining an extraction scale of each of the template network intrusion data subsequences based on a scale ratio of a template configuration scale of the first template network intrusion data sequence to a template configuration scale of the second template network intrusion data sequence.
In this step, the server needs to divide the first template network intrusion data sequence into a plurality of sub-sequences and determine the extraction scale of each sub-sequence. The extraction scale is determined based on a scale ratio of the first template network intrusion data sequence to the second template network intrusion data sequence.
Assume that the total size of the first template network intrusion data sequence is 10000 data points and the size of the second template network intrusion data sequence is 2000 data points. Then the scale ratio is 5.
The server may convert the ratio 5 to the extraction scale for each template network intrusion data subsequence according to the set rules. For example, the server may set the extraction scale for each sub-sequence to 2000 data points.
The specific conversion rule can be adjusted according to actual conditions, for example, the specific conversion rule can be determined according to factors such as data distribution conditions, training requirements of a model and the like.
Step S130, performing data extraction on each template network intrusion data subsequence based on the extraction scale, so as to generate a third template network intrusion data sequence, and generating a sample learning data sequence based on the third template network intrusion data sequence and the second template network intrusion data sequence, where a scale ratio of a template configuration scale of the third template network intrusion data sequence to a template configuration scale of the second template network intrusion data sequence is smaller than a set ratio.
In this step, the server performs data extraction on each template network intrusion data subsequence according to the determined extraction scale. For example, the server randomly extracts 2000 data points from each sub-sequence, generating a third template network intrusion data sequence.
The server then combines the third template network intrusion data sequence and the second template network intrusion data sequence into a sample learning data sequence. The sample learning data sequence will be used for subsequent model training.
When generating the sample learning data sequence, the server needs to ensure that the scale ratio of the template configuration scale of the third template network intrusion data sequence to the template configuration scale of the second template network intrusion data sequence is smaller than the set ratio. This set ratio can be adjusted according to the actual situation and can be set to 1, for example.
By controlling the scale ratio, the server can ensure that the sample learning data sequence contains enough network intrusion data of different types so as to improve the generalization capability and accuracy of the model.
And step S140, carrying out model parameter learning on the network intrusion prediction model based on the sample learning data sequence, and generating the network intrusion prediction model with the model parameter learning completed.
In this step, the server trains the network intrusion prediction model using the sample learning data sequence. The network intrusion prediction model may be a model based on a machine learning algorithm, such as a decision tree, a neural network, etc.
The server inputs the sample learning data sequence into the model and trains the model using an appropriate training algorithm. In the training process, the model learns the characteristics and modes of different network intrusion behaviors, so that the prediction capability of network intrusion is improved.
A specific training process may include the steps of:
1. and data preprocessing, namely preprocessing the sample learning data sequence, such as data cleaning, feature engineering and the like, so as to improve the quality and usability of the data.
2. Model selection, namely selecting a model suitable for network intrusion prediction, such as decision trees, neural networks and the like.
3. Training the model, namely training the model by using the preprocessed data, and adjusting parameters of the model to improve the performance of the model.
4. Model evaluation, namely evaluating the trained model by using a verification set or a test set, such as calculating indexes of accuracy, recall rate and the like, so as to evaluate the performance of the model.
5. Model adjustment, namely adjusting and optimizing the model according to the result of model evaluation, such as adjusting parameters of the model, adding training data and the like.
6. Repeating the training until the performance of the model meets the expected requirement.
Through continuous training and adjustment, the server can generate a network intrusion prediction model for completing model parameter learning.
Step S150, classifying the target network intrusion data based on the network intrusion prediction model with model parameter learning completed, generating a classification prediction result of the target network intrusion data, and performing security protection reinforcement configuration on the data center station to be constructed based on the classification prediction result of the target network intrusion data, wherein the classification prediction result represents the network intrusion type of the target network intrusion data.
In practical application, the server can monitor network traffic in real time and input the monitored target network intrusion data into the trained network intrusion prediction model. And classifying the target network intrusion data by the model to generate a classification prediction result.
For example, the model predicts that the target network intrusion data belongs to the APT attack class. And the server carries out safety protection strengthening configuration on the data center to be constructed according to the classification prediction result. For example, the server can increase firewall rules, limit access to specific ports, strengthen monitoring strength of an intrusion detection system, timely discover and prevent APT attacks, encrypt data, protect data security and the like.
Through the steps, the server can generate a sample learning data sequence by utilizing the first template network intrusion data sequence and the second template network intrusion data sequence, and train the network intrusion prediction model by using the sequence. Then, the server can classify the target network intrusion data by using the trained model, generate a classification prediction result, and perform safety protection strengthening configuration on the data center to be constructed according to the result, so that the safety of the data center is improved.
It should be noted that the above scenario is merely for helping to understand the specific operation of each step, and the situation in practical application may be more complex and diversified. In the practical implementation process, the adjustment and optimization are required according to specific requirements and conditions. Meanwhile, in order to ensure the security and privacy of data, the server needs to take corresponding security measures, such as data encryption, access control, etc., when processing network intrusion data.
In one possible implementation, step S110 includes:
step S111, a reference intrusion log carrying an attack type tag is obtained from the initial network intrusion log, and the attack type tag characterizes the network intrusion type of the reference intrusion log.
Step S112, the reference intrusion log is encoded, and an intrusion path encoding vector of the reference intrusion log is generated.
And step S113, determining the first template network intrusion data sequence and the second template network intrusion data sequence based on the intrusion path coding vector of the reference intrusion log.
In one possible implementation, step S111 includes:
step S1111, obtaining the initial intrusion knowledge points of the reference intrusion log.
Step S1112, determining intrusion behavior evaluation information based on the initial intrusion knowledge points of the reference intrusion log.
Step S1113, based on the intrusion behavior evaluation information and the intrusion feature template, removing a noise intrusion log from the reference intrusion log, and generating an updated reference intrusion log.
In this embodiment, the server obtains a reference intrusion log carrying an attack type tag from the initial network intrusion log. These initial network intrusion logs may come from a number of sources, such as security monitoring systems within the enterprise, security agencies on the internet, historical data stores, etc. The server finds out the reference intrusion log carrying the attack type tag therein by screening and analyzing the logs.
For example, a server may obtain a large number of network intrusion logs from a security monitoring system within an enterprise. These logs record various intrusion events occurring in the enterprise network, including information on attack type, attack time, attack source, etc. The server finds out the reference intrusion log carrying the attack type tag therein by analyzing these logs. These benchmark intrusion logs may be some common network intrusion behavior such as port scanning, malware infection, etc., or some complex network attack behavior such as Advanced Persistent Threat (APT) attacks, cross-site scripting attacks (XSS), etc.
Next, the acquired reference intrusion log is encoded, and an intrusion path encoding vector of the reference intrusion log is generated. This process can help the server better understand and analyze the information in the benchmark intrusion log, thereby improving the efficiency and accuracy of the construction of the data center station.
For example, the server may encode the reference intrusion log using a depth learning based encoding algorithm. The algorithm can convert the information in the reference intrusion log into a numeric intrusion path encoding vector. This intrusion path encoding vector may contain information on the type of attack, the source of the attack, the target of the attack, etc. in the reference intrusion log, and the relationship and pattern between these information.
Based on the intrusion path encoding vector of the reference intrusion log, a first template network intrusion data sequence and a second template network intrusion data sequence are determined. This process can help the server better understand and analyze the information in the benchmark intrusion log, thereby improving the efficiency and accuracy of the construction of the data center station.
For example, the server may perform cluster analysis on the intrusion path encoding vector of the reference intrusion log using a clustering algorithm-based method. The algorithm may divide the intrusion path encoding vectors of the reference intrusion log into different categories, each category representing a different network intrusion behavior. The server may determine a first template network intrusion data sequence and a second template network intrusion data sequence based on the results of the cluster analysis. The first template network intrusion data sequence may contain some common network intrusion actions such as port scanning, malware infection, etc., while the second template network intrusion data sequence may contain some complex network attack actions such as Advanced Persistent Threat (APT) attacks, cross-site scripting attacks (XSS), etc.
The server needs to obtain a reference intrusion log carrying the attack type tag from the initial network intrusion log. The following is a specific example of a scenario:
Assume that a server obtains a large number of network intrusion logs from a security monitoring system within an enterprise. These logs record various intrusion events occurring in the enterprise network, including information on attack type, attack time, attack source, etc. The server first filters and analyzes the logs to find out the reference intrusion log carrying the attack type tag therein.
The server may use a rule-based screening algorithm to screen the initial network intrusion log. The algorithm can screen and analyze the initial network intrusion log according to preset rules to find out the reference intrusion log carrying the attack type label. For example, the server may set rules such as logs of attack types port scans, malware infection, etc. as benchmark intrusion logs.
The server may also use a machine learning based screening algorithm to screen the initial network intrusion log. The algorithm can find out the reference intrusion log carrying the attack type tag by learning and analyzing a large number of initial network intrusion logs. For example, the server may learn and analyze the initial network intrusion log using a decision tree-based machine learning algorithm to find a baseline intrusion log that carries attack type tags therein.
Then, the acquired reference intrusion log needs to be encoded to generate an intrusion path encoding vector of the reference intrusion log. The following is a specific example of a scenario:
It is assumed that the server acquires some reference intrusion logs, and these reference intrusion logs record various intrusion events occurring in the enterprise network, including information such as attack type, attack time, attack source, and the like. The server first pre-processes the baseline intrusion logs and converts them into a format suitable for encoding.
The server may encode the reference intrusion log using a depth learning based encoding algorithm. The algorithm can convert the information in the reference intrusion log into a numeric intrusion path encoding vector. For example, the server may encode the reference intrusion log using a Convolutional Neural Network (CNN) based encoding algorithm. The algorithm may convert the information in the reference intrusion log into a two-dimensional numeric intrusion path encoded vector, where each element represents an information element in the reference intrusion log.
The server may also encode the reference intrusion log using a clustering algorithm-based encoding algorithm. The algorithm can convert the information in the reference intrusion log into a numeric intrusion path encoding vector. For example, the server may encode the benchmark intrusion log using an encoding algorithm based on a K-Means clustering algorithm. The algorithm may convert information in the reference intrusion log into a numeric intrusion path encoding vector, where each element represents an information element in the reference intrusion log.
Next, a first template network intrusion data sequence and a second template network intrusion data sequence are determined based on the intrusion path encoding vectors of the reference intrusion log. The following is a specific example of a scenario:
Assume that the server has acquired intrusion path encoding vectors for some of the reference intrusion logs. The server first performs a cluster analysis on the intrusion path code vectors, classifying them into different categories.
The server may perform cluster analysis on the intrusion path code vector using a method based on a K-Means clustering algorithm. Such an algorithm may divide intrusion path coding vectors into different categories, each category representing a different network intrusion behavior. For example, the server may categorize intrusion path code vectors into different categories of port scanning, malware infection, advanced Persistent Threat (APT) attacks, cross-site scripting attacks (XSS), and the like.
The server may also perform cluster analysis on the intrusion path code vectors using a hierarchical clustering algorithm-based method. Such an algorithm may divide the intrusion path encoding vector into different levels, each level representing a different network intrusion behavior. For example, the server may divide the intrusion path encoding vector into a first layer for common network intrusion actions such as port scanning, malware infection, and the like, and a second layer for complex network attack actions such as Advanced Persistent Threat (APT) attack, cross-site scripting attack (XSS), and the like.
And the server determines a first template network intrusion data sequence and a second template network intrusion data sequence according to the result of the cluster analysis. The second template network intrusion data sequence may contain some common network intrusion actions such as port scanning, malware infection, etc., while the first template network intrusion data sequence may contain some complex network attack actions such as Advanced Persistent Threat (APT) attacks, cross-site scripting attacks (XSS), etc.
For example, the server may determine, as the second template network intrusion data sequence, an intrusion path encoding vector belonging to a common network intrusion behavior such as port scanning, malware infection, and the like in the cluster analysis result, and determine, as the first template network intrusion data sequence, an intrusion path encoding vector belonging to a complex network intrusion behavior such as Advanced Persistent Threat (APT) attack, cross site scripting attack (XSS), and the like.
Through the steps, the server can acquire the first template network intrusion data sequence and the second template network intrusion data sequence which are applied to data center station construction. These sequences will be used for subsequent data center station construction and safety protection reinforcement configuration, improving the safety and reliability of the data center station.
According to the embodiment of the application, the template network intrusion data sequences of two different network intrusion categories are obtained, and intelligent processing is performed according to the template configuration scale of the template network intrusion data sequences, so that the method can more comprehensively and accurately cope with various network intrusion conditions, and the safety protection capability of a data center service system is improved. By dividing the first template network intrusion data sequence and determining the extraction scale based on the scale ratio, the data is efficiently utilized, the data processing flow is optimized, the calculation complexity is reduced, and the processing speed is improved. The third template network intrusion data sequence is generated and combined with the second template network intrusion data sequence to generate the sample learning data sequence, so that the comprehensiveness and accuracy of model learning are ensured, and a high-quality data basis is provided for subsequent model parameter learning. Model parameter learning is carried out on the network intrusion prediction model by utilizing the sample learning data sequence, so that the model can more accurately predict and identify various network intrusion behaviors, and the safety protection capability of the data center platform service system is enhanced. By classifying the target network intrusion data and carrying out safety protection reinforcement configuration on the data center to be constructed based on the classification prediction result, personalized protection of the data center to be constructed is realized, and safety and stability of the data center are effectively improved. Therefore, the safety protection capability of the data center station service system can be obviously improved, the data processing flow is optimized, the processing speed is improved, and the safety and stability operation of the data center station is ensured.
In one possible implementation, step S112 includes:
Step 1121, constructing an initial intrusion knowledge vector of the reference intrusion log according to the prior attack behavior record of the reference intrusion log, wherein the initial intrusion knowledge vector at least comprises an attack type, an attack source and an attack target.
Step S1122, constructing an attack policy penetration feature of the reference intrusion log according to the attack mode of the reference intrusion log.
Step S1123, performing serialization processing on the initial intrusion knowledge vector and the attack policy penetration feature of the reference intrusion log based on the set time window, so as to fuse the fused initial intrusion knowledge vector and the fused attack policy penetration feature of different time windows.
Step S1124, encoding at least one of the pre-fusion initial intrusion knowledge vector and the post-fusion initial intrusion knowledge vector with the attack policy penetration feature to obtain a first encoded vector of the reference intrusion log.
Step S1125, processing the classification vector in the first code vector of the reference intrusion log, and generating a second code vector of the reference intrusion log.
Step S1126, splicing the first code vector and the second code vector of the reference intrusion log to generate an intrusion path code vector of the reference intrusion log, where the intrusion path code vector of the reference intrusion log is a numerical intrusion path code vector.
In this embodiment, a priori attack behavior record of the reference intrusion log may be obtained from the historical database. These a priori attack records contain detailed information about various intrusion events that occurred in the past, such as attack type, attack source, attack targets, etc.
And the server constructs an initial intrusion knowledge vector of the reference intrusion log according to the prior attack behavior record. The vector at least comprises key information such as attack type, attack source, attack target and the like. For example, for a port scan attack, the initial intrusion knowledge vector may be expressed as [ attack type: port scan, attack source: IP address 192.168.1.10, attack target: server a ].
The server analyzes the attack mode of the reference intrusion log and extracts the penetration characteristics of the attack strategy. These features may reflect the behavioral patterns of the attacker and the intent of the attack.
For example, for a malware infection attack, the attack policy penetration characteristics may include the way the malware is propagated, the type of file infected, the operations performed, etc. The server constructs the features into a feature vector as attack strategy penetration features of the reference intrusion log.
The server sets a time window and divides the reference intrusion log into a plurality of subsequences according to a time sequence. And for the reference intrusion log in each time window, the server performs serialization processing on the initial intrusion knowledge vector and the attack strategy penetration characteristic. This can be achieved by concatenating them into one long vector. The server fuses the fused initial intrusion knowledge vectors of different time windows and the fused attack strategy penetration features. This can be achieved by concatenating them into one longer vector. Thus, the server obtains a fused intrusion knowledge vector and attack policy penetration feature containing different time window information.
The server selects at least one of the initial intrusion knowledge vector before fusion and the initial intrusion knowledge vector after fusion and the attack strategy penetration characteristic to encode. The Encoding may take a variety of forms, such as converting a class-type variable into a numeric vector using One-Hot Encoding (One-Hot Encoding), or converting a continuous-type variable into a numeric vector using numeric Encoding. The server combines the encoded results into a vector as the first encoded vector of the reference intrusion log.
The server extracts a classification vector, i.e. a part representing the type of attack, from the first encoded vector. The classification vector is further processed, for example, by mapping it to a smaller range of values using a hash function, or by converting it to a low-dimensional vector using a dimension-reduction algorithm. And taking the processed classification vector as a second coding vector of the reference intrusion log.
The server splices the first coded vector and the second coded vector to obtain a longer vector which is used as an intrusion path coded vector of the reference intrusion log. The intrusion path coding vector is a numerical vector, and contains rich information of a reference intrusion log, such as attack type, attack source, attack target, attack policy penetration feature and the like. The server stores the intrusion path code vectors in a database for subsequent analysis and processing.
Through the steps, the reference intrusion log is encoded, and an intrusion path encoding vector is generated. The intrusion path coding vector can be used for training and predicting a subsequent network intrusion prediction model, so that a server can better understand and analyze network intrusion behaviors, and the safety and reliability of a data center station are improved.
In one possible implementation, step S1124 includes:
And carrying out regularized conversion on the structured vector in the penetration characteristic of at least one attack strategy in the initial intrusion knowledge vector before fusion and the initial intrusion knowledge vector after fusion. And
And discretizing and converting at least one of the initial intrusion knowledge vector before fusion and the initial intrusion knowledge vector after fusion and an unstructured vector in the penetration characteristic of the attack strategy.
In this embodiment, the server acquires the initial intrusion knowledge vector before fusion, the initial intrusion knowledge vector after fusion, and the attack policy penetration feature. The structured vectors therein, such as attack type, attack source and attack target, etc., are then identified. These structured vectors are regularly transformed according to predefined rules. For example, the attack type is converted to a corresponding digital code, and the attack source and attack target are converted to respective identifiers or indexes. After regularized conversion, the server converts the structured vector into a form that is easier to process and analyze.
Next, processing of the pre-fusion initial intrusion knowledge vector and the post-fusion initial intrusion knowledge vector, as well as unstructured vectors in the attack strategy penetration feature, continues. Unstructured vectors may include specific behaviors of the attack, time series of attacks, etc. The server converts these unstructured vectors into discrete numerical representations using a discretization method. For example, the attack is divided into different categories and each category is assigned a discrete value. The discretized transformations may help the server convert unstructured information into a quantifiable and comparable form for subsequent analysis and processing.
Through the steps, the server encodes at least one of the initial intrusion knowledge vector before fusion and the initial intrusion knowledge vector after fusion and the attack strategy penetration feature. The regularized and discretized transformations enable such information to be represented in a more unified and processable manner, facilitating subsequent training and application of network intrusion prediction models.
The following is a specific example of a scenario to illustrate these steps:
It is assumed that the server receives a series of reference intrusion logs, which contain information such as attack type, attack source, attack target, and specific behavior of the attack. The server firstly records and constructs an initial intrusion knowledge vector according to prior attack behaviors, and constructs an attack strategy penetration characteristic based on an attack mode.
For the initial intrusion knowledge vector before fusion, the server identifies the structuring vector in the initial intrusion knowledge vector, for example, the attack type is "port scanning", the attack source is "IP address 192.168.1.10", and the attack target is "server A". The server converts the attack type into a digital code of '1' according to the rule, and converts the attack source and the attack target into corresponding identifiers or indexes.
For unstructured vectors in the attack policy penetration feature, such as specific behavior of the attack, the server classifies them into different categories, such as "fast scan", "deep scan", etc., and assigns a discrete value to each category.
Through such an encoding process, at least one of the pre-fusion initial intrusion knowledge vector and the post-fusion initial intrusion knowledge vector and the attack policy penetration feature are converted into a numerical-type encoded vector. The coded vectors can be used for training a subsequent network intrusion prediction model, and the model can learn the relations among different attack types, sources and targets and the modes of attack strategy penetration characteristics, so that the prediction capability of network intrusion is improved.
The encoding mode enables the server to process and analyze a large amount of intrusion log data more effectively, extracts valuable information and provides support for construction of data center stations and security protection reinforcement configuration. Meanwhile, through discrete conversion of unstructured vectors, the server can also better capture and represent diversity and complexity of attack behaviors.
In a possible implementation, step S120 includes determining the extraction scale based on the set ratio, the template configuration scale of the template network intrusion data subsequence, and the template configuration scale of the second template network intrusion data sequence when the scale ratio of the first template network intrusion data sequence and the second template network intrusion data sequence is greater than the set ratio.
In this embodiment, the first template network intrusion data sequence is divided into a plurality of subsequences according to a certain rule or algorithm. These sub-sequences may be of equal length or may be partitioned according to a certain characteristic or pattern. For example, the server may divide the first template network intrusion data sequence into a plurality of sub-sequences in a time sequence, each sub-sequence representing intrusion data over a period of time.
And then, the template configuration scale of the first template network intrusion data sequence and the template configuration scale of the second template network intrusion data sequence are obtained, and the ratio of the two scales is calculated. For example, the template configuration size of the first template network intrusion data sequence is 1000, the template configuration size of the second template network intrusion data sequence is 200, and then the scale ratio is 5.
Then, the set ratio is obtained, the calculated scale ratio is compared with the set ratio, if the scale ratio is larger than the set ratio, the server executes the next step, otherwise, the server may adopt other processing modes.
And determining the extraction scale of each template network intrusion data subsequence based on the set ratio, the template configuration scale of the template network intrusion data subsequence and the template configuration scale of the second template network intrusion data sequence. The specific determination mode can be designed according to actual conditions. One possible way is to calculate the number of extraction needed for each sub-sequence based on the set ratio and the template configuration size of the second template network intrusion data sequence. And then, determining the extraction proportion of each subsequence according to the template configuration scale of the template network intrusion data subsequence. For example, the ratio is set to 3, the template configuration size of the second template network intrusion data sequence is 200, and the template configuration size of the template network intrusion data subsequence is 100. Then, the number of required decimations per sub-sequence is 200/3≡67, and the decimation ratio is 67/100=0.67.
Through the steps, the extraction scale of each template network intrusion data subsequence can be determined according to the scale ratio of the first template network intrusion data sequence to the second template network intrusion data sequence. Therefore, reasonable extraction can be performed according to the scale ratio while the data diversity is ensured, and a proper data sample is provided for subsequent processing and analysis.
The following is a specific example of a scenario to illustrate these steps:
Assume that the server receives a first template network intrusion data sequence, the template configuration size of which is 10000. Meanwhile, the server also acquires a second template network intrusion data sequence, and the template configuration scale of the second template network intrusion data sequence is 2000. The ratio was set to 5.
The server first divides the first template network intrusion data sequence into a plurality of template network intrusion data subsequences. For example, the server may divide it into 100 sub-sequences, each with a template configuration size of 100.
The server then calculates the scale ratio, i.e. 10000/2000=5. Since the scale ratio is greater than the set ratio of 5, the server proceeds to determine the scale of the extraction.
And according to the set ratio and the template configuration scale of the second template network intrusion data sequence, the server calculates the number of required extraction of each subsequence to be 2000/5=400. Then, the server determines that the extraction ratio of each sub-sequence is 400/100=4 according to the template configuration scale of the template network intrusion data sub-sequence.
By the determination mode, the server can extract 400 data points from each template network intrusion data subsequence, so that the extracted data is ensured to have certain representativeness and diversity.
The extraction scale determination mode can be adjusted and optimized according to actual conditions so as to adapt to different data characteristics and analysis requirements. Meanwhile, the server can further improve the determination method of the extraction scale by considering other factors such as data distribution, importance of features and the like, and improve the quality of the data and the accuracy of analysis.
In one possible implementation, step S140 includes:
Step a110, arbitrarily extracting a plurality of features of each sample learning data in the sample learning data sequence, and inputting an initial learning algorithm to generate a risk weighted prediction tree.
And step A120, iteratively executing the steps of arbitrarily extracting a plurality of features of each sample learning data in the sample learning data sequence and inputting an initial learning algorithm to generate a risk weighted prediction tree until X risk weighted prediction trees are generated, wherein X is a positive integer greater than 1. And
And step A130, performing ensemble learning on the X risk weighted prediction trees to generate the network intrusion prediction model.
In this embodiment, it is assumed that the server receives a sample learning data sequence, which contains a large amount of network intrusion data. Such data may come from a number of sources, such as security monitoring systems within the enterprise, security agencies on the internet, historical data stores, and the like.
The server firstly preprocesses the sample learning data sequence, including operations such as data cleaning and characteristic engineering. The data cleaning is to remove noise and outliers, and the feature engineering is to extract features related to network intrusion.
Next, the server starts to perform the step of model parameter learning. It arbitrarily extracts a plurality of features of each sample learning data in the sample learning data sequence and inputs the features into the initial learning algorithm.
The initial learning algorithm calculates weights for each feature from the input features and generates a risk weighted prediction tree based on these weights. This process is similar to the construction of decision trees, but takes into account the weights of the features when computing node splits.
The server will iteratively perform this step, continually generating new risk weighted prediction trees. At each iteration, it randomly selects a portion of the data from the sample learn data sequence and extracts features of the data, which are then input into the initial learning algorithm.
In this way, the server gradually generates a plurality of risk weighted prediction trees. For example, it may generate 10 risk weighted prediction trees.
After the multiple risk weighted prediction trees are generated, the server starts to perform integrated learning. The 10 risk weighted prediction trees are combined into a whole, and comprehensive judgment is carried out according to the prediction results of the trees.
There are various methods of ensemble learning, such as an averaging method, a voting method, and the like. In this example, the server may use voting, i.e. voting according to the prediction result of each risk weighted prediction tree, and select the category with the highest vote as the final prediction result.
For example, for a new network intrusion data, 10 risk weighted prediction trees each give a different prediction result. Of these, 6 trees are predicted as category a,3 trees are predicted as category B, and 1 tree is predicted as category C. Then, the server classifies the network intrusion data into category a based on the voting results.
Through the integrated learning process, the server generates a network intrusion prediction model for completing model parameter learning. This model may be used to classify predictions of new network intrusion data.
In practical application, the server can deploy the model to a data center station, monitor network traffic in real time, and conduct rapid classification prediction on new network intrusion data. According to the prediction result, the server can take corresponding security protection measures, such as strengthening firewall rules, limiting access to specific ports, raising alarms and the like, so as to improve the security of the data center station.
In one possible implementation, step S140 further includes:
And step B110, dividing the sample learning data sequence into a first subsequence and a second subsequence.
And step B120, performing model parameter learning on the network intrusion prediction model according to initialization weight information and the first subsequence, wherein the initialization weight information comprises the scale of the template network intrusion data subsequence, the scale of the X and the characteristics of each sample learning data in the sample learning data sequence.
And step B130, learning network intrusion prediction behaviors of the network intrusion prediction model.
And step B140, continuing to perform model parameter learning on the network intrusion prediction model according to the learned model parameter information of the network intrusion prediction behavior, generating the network intrusion prediction model with the model parameter learning completed, and testing the network intrusion prediction model with the model parameter learning completed according to the second subsequence.
In this embodiment, the sample learning data sequence is divided into a first sub-sequence and a second sub-sequence. The first subsequence is used for carrying out model parameter learning on the network intrusion prediction model, and the second subsequence is used for testing the network intrusion prediction model with model parameter learning completed.
And the server learns model parameters of the network intrusion prediction model according to the initialization weight information and the first subsequence. The initialization weight information includes the size of the template network intrusion data subsequence, the size of the features of each sample learning data in the X and any extracted sample learning data sequence. By using the initialization weight information, the server can better control the learning process of the model, and improve the accuracy and generalization capability of the model.
In the model parameter learning process, the server learns network intrusion prediction behaviors of the network intrusion prediction model. This means that the server uses the data in the sample learning data sequence to train the model so that it can learn the characteristics and patterns of different network intrusion behaviour. By continuously adjusting the parameters of the model, the server can improve the prediction capability of the model on network intrusion.
After model parameter learning is completed, the server generates a network intrusion prediction model for completing model parameter learning. This model has learned knowledge in the sample learning data sequence and can be used to predict new network intrusion data.
And finally, the server tests the network intrusion prediction model which completes model parameter learning according to the second subsequence. By applying the model to the data in the second sub-sequence, the server can evaluate the performance and accuracy of the model. If the model performs well in the test, the server may apply it to actual network intrusion predictions. If the model performs poorly in the test, the server may further adjust parameters of the model or retrain the model to improve its performance.
The following is a specific example of a scenario illustrating these steps:
it is assumed that the server receives a sample learning data sequence that contains a large amount of network intrusion data. The server divides the sample learning data sequence into a first sub-sequence and a second sub-sequence.
And the server learns model parameters of the network intrusion prediction model according to the initialization weight information and the first subsequence. The initialization weight information comprises a template network intrusion data subsequence with a size of 1000, X is 10, and the size of the features of each sample learning data in the arbitrary extracted sample learning data sequence is 50.
In the model parameter learning process, the server learns network intrusion prediction behaviors of the network intrusion prediction model. The server uses the data in the sample learning data sequence to train the model so that it can learn the characteristics and patterns of different network intrusion behaviors. For example, the server may find that certain features are associated with certain types of network intrusion behavior, thereby adjusting parameters of the model to better predict these behaviors.
After model parameter learning is completed, the server generates a network intrusion prediction model for completing model parameter learning. This model has learned knowledge in the sample learning data sequence and can be used to predict new network intrusion data.
And finally, the server tests the network intrusion prediction model which completes model parameter learning according to the second subsequence. The server applies the model to the data in the second sub-sequence and evaluates the performance and accuracy of the model. For example, the server may calculate metrics such as accuracy, recall, and F1 values for the model to evaluate the performance of the model. If the model performs well in the test, the server may apply it to actual network intrusion predictions. If the model performs poorly in the test, the server may further adjust parameters of the model or retrain the model to improve its performance.
Through the steps, the server can use the sample learning data sequence to perform model parameter learning on the network intrusion prediction model, and generate the network intrusion prediction model for completing model parameter learning. The model can be used for predicting new network intrusion data and improving the safety and reliability of stations in the data.
In one possible implementation, step S150 includes:
step S151, the acquired target network intrusion data set is encoded, and an intrusion path encoding vector of the target network intrusion data is generated.
Step S152, inputting the intrusion path coding vector of the target network intrusion data into each of the risk weighted prediction trees in the network intrusion prediction model with model parameter learning completed, so that each of the risk weighted prediction trees outputs network intrusion classification data.
And step S153, generating the classification prediction result based on the network intrusion classification data output by each risk weighted prediction tree.
In a possible implementation manner, the network intrusion classification data includes a confidence that the target network intrusion data belongs to different network intrusion categories, and step S153 includes:
step S1531, calculating the scale of the network intrusion classification data with the confidence of each network intrusion class being greater than a preset confidence.
Step S1532, determining the network intrusion category with the size greater than half of the X as the classification prediction result.
In practical applications, the server needs to classify the target network intrusion data to generate a classification prediction result. The following is a specific example of a scenario illustrating these steps:
it is assumed that the server receives a target network intrusion data set, which contains a large amount of network intrusion data. The server first encodes the data set to generate an intrusion path encoding vector for the target network intrusion data. This process may use the aforementioned method to convert the target network intrusion data into a numeric intrusion path encoded vector for subsequent processing and analysis.
Next, the server inputs the intrusion path coding vector of the target network intrusion data into each risk weighted prediction tree in the network intrusion prediction model that completes model parameter learning. Each risk weighted prediction tree outputs a network intrusion classification data according to the input intrusion path coding vector. This classification data represents the confidence that the target network intrusion data belongs to different network intrusion categories.
For example, assume that there are three risk weighted prediction trees in the network intrusion prediction model, tree 1, tree 2 and tree 3, respectively. The server inputs the intrusion path coding vector of the target network intrusion data into the three trees, wherein the network intrusion classification data output by the tree 1 is [0.8, 0.1 and 0.1], the confidence that the target network intrusion data belongs to the category 1 is 0.8, the confidence that the target network intrusion data belongs to the category 2 is 0.1, the confidence that the target network intrusion data belongs to the category 3 is 0.1, the network intrusion classification data output by the tree 2 is [0.7, 0.2 and 0.1], the confidence that the target network intrusion data belongs to the category 1 is 0.7, the confidence that the target network intrusion data belongs to the category 2 is 0.2 and the confidence that the target network intrusion classification data belongs to the category 3 is 0.1, the confidence that the target network intrusion data belongs to the category 1 is 0.6, the confidence that the target network intrusion data belongs to the category 2 is 0.3, and the confidence that the target network intrusion data belongs to the category 3 is 0.1.
Then, the server generates a classification prediction result based on the network intrusion classification data output from each risk weighted prediction tree. In one possible implementation, the server calculates the size of the network intrusion classification data for which the confidence of each network intrusion category is greater than a preset confidence. For example, assuming a preset confidence of 0.5, the server calculates the size of network intrusion classification data for which the confidence of each network intrusion category is greater than 0.5. In this example, the size of the network intrusion classification data with the confidence of class 1 being greater than 0.5 is 3, the size of the network intrusion classification data with the confidence of class 2 being greater than 0.5 is 2, and the size of the network intrusion classification data with the confidence of class 3 being greater than 0.5 is 1.
And finally, the server determines the network intrusion category with the scale being more than half of X as a classification prediction result. In this example, X is 3 and half of X is 1.5. Since the confidence of category 1 is greater than 0.5 for a network intrusion classification data size of 3 and greater than 1.5, the server determines category 1 as the classification prediction result.
Through the steps, the target network intrusion data can be classified based on the network intrusion prediction model for completing model parameter learning, and a classification prediction result is generated. The classification prediction result can be used for subsequent safety protection strengthening configuration, and the safety and reliability of the data center station are improved.
Further, fig. 2 shows a schematic hardware structure of a data center service system 100 for implementing the method provided by the embodiment of the present application. As shown in fig. 2, the data center station service system 100 may include at least one processor 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions, and a controller 108. Those of ordinary skill in the art will appreciate that the configuration shown in fig. 2 is merely illustrative and is not limiting of the configuration of the data center service system 100. For example, the mid-data station service system 100 may also include more or fewer components than shown in FIG. 2, or have a different configuration than shown in FIG. 2.
The memory 104 may be used to store software programs and modules of application software, such as program instructions corresponding to the above-described method embodiments in the embodiments of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing a data processing method applied to data middle stage construction as described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 104 may further include memory remotely located with respect to processor 102, which may be connected to data center service system 100 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is used for acquiring or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the data center station service system 100. In one example, the transmission device 106 includes a network adapter that can connect to other network equipment through a base station to communicate with the Internet. In one example, the transmission device 106 may be a radio frequency module for communicating wirelessly with the internet.
It should be noted that the embodiment sequencing of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this application. Other embodiments are within the scope of the following claims. In some cases, the exceptions or steps recited in the claims may be performed in an order different from the order in the embodiments and still achieve desirable results. Additionally, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
All the embodiments of the present application are described in a progressive manner, and identical and similar parts of all the embodiments are mutually referred to, and each embodiment mainly describes the differences from other embodiments. In particular, for the different embodiments above, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.