Disclosure of Invention
In order to solve the technical problem, the invention provides a security threat collaborative modeling method and system.
The invention is realized in such a way, and provides a security threat collaborative modeling method, which comprises the following steps:
1) data sharing, wherein incomplete sharing of network entity data is performed among data sources provided by a plurality of participants, and before sharing, desensitization is performed on network entity data which is necessary to be desensitized;
2) data fusion, namely performing attribute fusion, relationship fusion, behavior fusion and label fusion on the data shared in the step 1) to form attribute fusion data, relationship fusion data, behavior fusion data and label fusion data;
3) data feature extraction, namely respectively extracting data features of the attribute fusion data, the relation fusion data and the behavior fusion data obtained in the step 2);
4) modeling, selecting different types of modeling methods according to needs, selectively loading data characteristics and label fusion data, performing a machine learning specific training process, generating a training model, and outputting the training model;
5) auditing, running in a block chain or a cooperation platform shared by multiple participants, carrying out accounting according to the data flow in the steps 1), 2), 3) and 4), counting contribution points for different participants according to a set rule, and rewarding the participants according to the contribution points.
Preferably, in step 1), the shared data includes the following types:
entity attribute data, association relation data among entities, entity behavior record data and entity studying and judging label data.
Further preferably, in step 1), different network entities are identified by different signs, data of the network entity which needs desensitization is identified by desensitization signs, and a rainbow table between the signs of the data of the network entity which needs desensitization and the desensitization signs is established.
Further preferably, in the step 1), the triggering mode of data sharing includes active sharing and help-response sharing; the data sharing mode comprises community release and point-to-point release; the data sharing scenario includes voluntary sharing and legally defined share of services.
Further preferably, in step 2):
the attribute fusion refers to fusing the attributes obtained by the same network entity at different participants by predefining a fusion strategy, specifically complementing different attribute information recorded by the same network entity at different data sources, and removing duplication and correcting the same attribute information;
the relationship fusion means that the incidence relationship between a pair of network entities is fused to form a mapped network entity relationship library, specifically, different relationship information recorded by the pair of network entities at different data sources is mutually supplemented, and duplicate removal and correction are performed on the same relationship information;
behavior fusion refers to the fusion of behavior information recorded by the same network entity at different data sources, and forms more comprehensive and complete observation records of each network entity by integrating multi-source and scattered behavior information, specifically, different behavior information of the same network entity is arranged according to time sequence, and the same behavior information from different participants is subjected to duplication removal and correction;
the label fusion means that different participants respectively provide research and judgment labels for the same entity, after receiving the research and judgment labels sent by other participants, each reader executes a local trust collection strategy, integrates information of each participant, gives a higher trust degree to the same label given by multiple parties, and supplements different labels given by the multiple parties, so as to obtain the confidence degree of each label.
Further preferably, in step 3):
the data characteristics of the attribute fusion data include: IP position, domain name registration time and file change time;
the data characteristics of the relationship-fused data include: the access degree of the graph nodes, the number of domain name associated IP, the number of domain name associated NS servers and the access degree of the IP type neighbor nodes limited by the domain name nodes;
the data characteristics of the behavior fusion data include: the method comprises the following steps of counting characteristics and sensitive behavior characteristics, wherein the counting characteristics comprise transverse communication times, external connection times and file access times; sensitive behavior features include modification of startup items, overseas contacts, access to registration edges.
Further preferably, in step 4), the modeling method includes local training, federal learning, intelligence aggregation and integrated learning, specifically:
local training, wherein any participant executes a machine learning training task as required according to the grasped data characteristics and label fusion data to obtain an AI model;
federal learning, wherein a plurality of participants agree to train an AI model together under the condition of incomplete data sharing;
information is aggregated, a participant executes a customized letter collection strategy on a network entity marked as malicious, the desensitized network entity with the malicious label is taken as a customized IOC index, the IOC index is communicated with customized confidence level thresholds of all participants to form a class of decision models, whether input network entity data are matched with known IOC indexes with confidence levels larger than the confidence level thresholds can be judged, and the IOC model is obtained;
the method comprises the steps of ensemble learning, wherein local training, federal learning and information aggregation means under different parameter settings are comprehensively operated to generate a plurality of threat studying and judging models, initial studying and judging results of the models are comprehensively obtained in a voting mode to generate final studying and judging results with higher credibility, and local training, federal learning or information aggregation are independently used as special cases of ensemble learning.
More preferably, in step 5), the rule for calculating the contribution score is set as:
attribute data are shared and read by other participants, and a single attribute corresponding to the basic contribution score is recorded as Sa;
sharing relation data, and reading the relation data by other participants, and recording a single relation corresponding to the basic contribution score as Sr;
sharing behavior data, reading the behavior data by other participants, and recording a basic contribution score corresponding to a single behavior event as Sb;
sharing tag data, and reading by other participants, wherein the basic contribution corresponding to a single tag is Sl;
sharing transparent data, and using the data for machine learning, wherein the contribution corresponding to a single transparent data is Sf;
directionally sharing network entity data under the requirements of other participants, wherein the contribution corresponding to single network entity data is St;
after the reader reads the data, the corresponding contribution score of the reader is deducted, and the reader can selectively evaluate the read data and give a score, if the difference between the score given by the reader and the average score obtained by the data is within a certain range, the reader obtains a certain contribution score, when the average score of the data is higher than a certain value, the participant sharing the data obtains a score reward, and when the average score of the data is lower than a certain value, the participant sharing the data deducts a certain score;
the participants initially have a certain initial score by default.
The invention also provides a security threat collaborative modeling system, which comprises the following modules:
the data sharing unit is used for incompletely sharing the network entity data among a plurality of participants and desensitizing the network entity data which is necessary to be desensitized before sharing;
the data fusion unit is used for performing attribute fusion, relationship fusion, behavior fusion and label fusion on the shared data to form attribute fusion data, relationship fusion data, behavior fusion data and label fusion data;
the data feature extraction unit is used for respectively extracting data features of the attribute fusion data, the relationship fusion data and the behavior fusion data;
the modeling unit comprises a data loading unit, a model training unit and a model output unit, different types of modeling methods are selected according to needs, the data loading unit is used for selectively loading data characteristics and label fusion data, the model training unit is used for performing a machine learning specific training process to generate a training model, and the model output unit is used for training model output;
and the auditing unit runs in a block chain or a cooperation platform shared by all the participants and is used for accounting data flows in the data sharing unit, the data fusion unit, the data feature extraction unit and the modeling unit, accounting contribution scores for different participants according to a set rule and rewarding the participants according to the contribution scores.
Compared with the prior art, the invention has the advantages that:
1. a unified data circulation framework is provided for industry sharing and supervision sharing of threat intelligence, quantification and settlement are carried out on sharing behaviors of data in a contribution score mode, data transaction and data sharing are encouraged, and consensus is achieved on contributions of all parties through a block chain or a cooperation platform technology;
2. entity IDs are exchanged in a desensitization mode, and the joint modeling capability and the threat IOC matching capability can be still supported on the premise of ensuring data security;
3. the method supports common local machine learning and federal learning, each participant can select data openness degree according to needs, obtain external data resources according to self budget, independently or cooperatively complete AI modeling, and enhance safety and intelligence.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention mainly discusses the fusion problem of AI characteristic data and AI label data, and realizes the collaborative machine learning capability based on the fusion mechanism. It is advocated to perform a certain desensitization of the data, and only in a few cases necessary to restore the desensitized data. Specifically, the method comprises the following steps:
referring to fig. 1, the present invention provides a collaborative modeling method for security threats, comprising the following steps:
1) data sharing, wherein incomplete sharing of network entity data is performed among data sources provided by a plurality of participants, and before sharing, desensitization is performed on network entity data which is necessary to be desensitized;
in step 1), the participants may be different organizations, departments, persons or devices, which do not share data completely, and for each data item, the data item is considered to be provided to the reader by the sharer.
In step 1), the shared data includes the following types:
entity attribute data, association relation data among entities, entity behavior record data and entity studying and judging label data. Specifically, the method comprises the following steps:
entity attribute data is a description of a network entity and is typically used to construct AI characteristic data. For example, the attributes of an IP entity include a home, an operator, an owner unit, and the like. The attributes of the domain name entity include the domain name text itself, the registration date, the first observation date, the last observation date, etc. The attribute characteristics of the file sample include file format, storage space, change time, and the like.
And the association relation data among the entities, such as the resolution relation between the domain name and the IP, the execution relation between the malicious sample and the domain name, the dependency relation between the domain name and the URL, and the like. The relationship between a domain name and its registrars, and register mailboxes also fall into this category.
Entity behavior records data such as firewall logs, IDS logs, DNS logs, and the like.
The entity research and judgment label data represents the research and judgment of each participant on the black gray level of the network entity.
In step 1), in order to label the entities uniquely, different network entities are identified by different signs, network entity data which needs desensitization is identified by desensitization signs, and a rainbow table between the signs and desensitization signs of the network entity data which needs desensitization is established. Specifically, the method comprises the following steps:
marking: a token is a unique identification of a network entity in a conventional representation. For example, the notation of the IPv4 address is a dot decimal notation, the notation of the IPv6 address is a colon separated 4-digit hexadecimal numeric group, the notation of the port service is a combination of a specific IP and a port number notation, and the notation of the domain name and the URL is a text. The sample of the file is signed by a combination of its various hash values, e.g. (MD5, SHA256, SHA 512). In summary, the token is a text expression of the network entity.
Desensitization marking: the document sample does not need desensitization, its desensitization signature is defined as its signature itself. The tokens of the other network entities are passed through a combination of various hash values of their tokens, e.g. (MD5, SHA256, SHA 512). Each participant maintains a mapping relationship between the marks and desensitization marks, namely a rainbow table. The nature of the hash value determines that after data sharing, the reader generally cannot obtain the original token directly, but can still use its own data to perform collision, and only as a successfully collided token can the reader understand the original token according to the rainbow table. In addition, if the collision is not successful, the reader can issue a peer-to-peer application to the data provider and provide a reward if it still wants to obtain the original token information. The sharer can approve the application and issue the answer to the reader point-to-point while getting the contribution score, which is called the transparentization of entity identification.
In addition to selective desensitization of network entity identification, specific values of network entity attributes, relationships, behaviors and label information may also be desensitized. Participants can simply announce to the outside that they have this information in mind, but specific values are not disclosed. When other participants receive the notice, the missing value is taken as a special value, and the original information source is recorded, namely the participant ID of the value is really mastered. This number is said to be opaque data. The reader can make a request to the data sharer to obtain the real value of the opaque data. If the sharer agrees with the application, the relevant data is said to be transparently rendered. In addition, opaque data may be used for federal learning modeling.
In the step 1), the triggering mode of data sharing comprises active sharing and help-response sharing; the data sharing mode comprises community release and point-to-point release; the data sharing scenario includes voluntary sharing and legally defined share of services.
Under the active sharing triggering mode, all the participants establish a subscription relationship and respectively take the roles of a publisher and a subscriber. Each publisher discloses partial data mastered by the publisher by oneself according to batches, and the subscriber selects whether to accept the data of the batch according to needs, and only the selected and accepted subscriber has the right to read the data. For help-response sharing, a subscriber is interested in a certain network entity but lacks knowledge of that entity, so it advertises desensitization tokens or original tokens for other participants to provide informative assistance. Other participants provide relevant data according to the information grasped by the participants, and the subscribers who initiate help seeking decide whether to accept the data.
Under a specific supervision environment, the method supports the realization of a law definition service sharing mechanism. The regulatory body may perform the data sharing function as a special party. For information that the regulation dictates must be disclosed by the regulatory body, the regulatory body needs to provide to all other participants through an active sharing mode; for the information which must be actively reported to the monitoring organization according to the relevant regulations, other participants must take the corresponding monitoring organization as a subscriber in an active sharing mode; for information that must respond to a regulatory agency query, the information publisher must properly respond to the regulatory agency's request for help in accordance with relevant regulations. If no legal obligation exists, the data sharing of all the participants belongs to the voluntary sharing of the industry.
2) Data fusion, namely performing attribute fusion, relationship fusion, behavior fusion and label fusion on the data shared in the step 1) to form attribute fusion data, relationship fusion data, behavior fusion data and label fusion data;
in step 2), each participant voluntarily shares the batch of network entity attribute data that it observes.
In step 2):
the attribute fusion refers to fusing the attributes obtained by the same network entity at different participants by predefining a fusion strategy, specifically complementing different attribute information recorded by the same network entity at different data sources, and removing duplication and correcting the same attribute information;
the relationship fusion means that the incidence relationship between a pair of network entities is fused to form a mapped network entity relationship library, specifically, different relationship information recorded by the pair of network entities at different data sources is mutually supplemented, and duplicate removal and correction are performed on the same relationship information;
behavior fusion refers to the fusion of behavior information recorded by the same network entity at different data sources, and forms more comprehensive and complete observation records of each network entity by integrating multi-source and scattered behavior information, specifically, different behavior information of the same network entity is arranged according to time sequence, and the same behavior information from different participants is subjected to duplication removal and correction;
the label fusion means that different participants respectively provide research and judgment labels for the same entity, after receiving the research and judgment labels sent by other participants, each reader executes a local trust collection strategy, integrates information of each participant, gives a higher trust degree to the same label given by multiple parties, and supplements different labels given by the multiple parties, so as to obtain the confidence degree of each label.
In the data fusion process, data sources need to be recorded, so that data can be audited and traced conveniently, and support is provided for problem troubleshooting and value transaction.
3) Data feature extraction, namely respectively extracting data features of the attribute fusion data, the relation fusion data and the behavior fusion data obtained in the step 2);
and 3) establishing on the basis of data fusion, and performing feature extraction and judgment analysis by taking team resources as a unit instead of analyzing a single network entity in an isolated manner.
In step 3):
the data characteristics of the attribute fusion data include: IP position, domain name registration time and file change time;
the data characteristics of the relationship-fused data include: the access degree of the graph nodes, the number of domain name associated IP, the number of domain name associated NS servers and the access degree of the IP type neighbor nodes limited by the domain name nodes; and constructing a topological structure based on the incidence relation among the entities, and further extracting characteristic information from the topological structure.
The data characteristics of the behavior fusion data include: the method comprises the following steps of counting characteristics and sensitive behavior characteristics, wherein the counting characteristics comprise transverse communication times, external connection times and file access times; sensitive behavior features include modification of startup items, overseas contacts, access to registration edges.
In step 3), the label data shared by a particular external participant may also be used as feature data, rather than training labels for the final model. The credibility of different participants is different, and the judgment labels provided by the participants with low credibility can not be directly trusted and are fed to the machine learning model; in addition, the learning target is not necessarily the same. For example, an external party may provide a URL link tagged with an "advertisement" while other parties may be more concerned with network intrusion behavior, and this inconsistency in business goals may result in some external tags being used only as candidate features.
The machine learning model requires that the feature data is in a specific form such as continuous real number, discrete ordinal number, Boolean type variable and the like. Part of the attribute data does not meet the requirement, but can be converted into valuable characteristic information through certain preprocessing. This process generally requires extracting information on demand in conjunction with business semantics.
The feature extraction process is decided by each participant. The participators should adopt PDCA methodology, namely continuously adjust the specific mode of the feature engineering according to the service direction and the model effect.
4) Modeling, referring to fig. 2, selecting different types of modeling methods according to needs, selectively loading data features and label fusion data, performing a machine learning specific training process, generating a training model, and outputting the training model;
in the step 4), the modeling method comprises local training, federal learning, intelligence aggregation and integrated learning, and specifically comprises the following steps:
local training, wherein any participant executes a machine learning training task as required according to the grasped data characteristics and label fusion data to obtain an AI model;
federal learning, wherein a plurality of participants agree to train an AI model together under the condition of incomplete data sharing;
information is aggregated, a participant executes a customized letter collection strategy on a network entity marked as malicious, the desensitized network entity with the malicious label is taken as a customized IOC index, the IOC index is communicated with customized confidence level thresholds of all participants to form a class of decision models, whether input network entity data are matched with known IOC indexes with confidence levels larger than the confidence level thresholds can be judged, and the IOC model is obtained;
the method comprises the steps of ensemble learning, wherein local training, federal learning and information aggregation means under different parameter settings are comprehensively operated to generate a plurality of threat studying and judging models, initial studying and judging results of the models are comprehensively obtained in a voting mode to generate final studying and judging results with higher credibility, and local training, federal learning or information aggregation are independently used as special cases of ensemble learning.
And when the data characteristics and the label fusion data are selectively loaded, the characteristic data and the label data are selectively loaded. If the type strategy is set to local training or federal learning, the loading strategy is set according to modeling requirements, and a feature selection function is executed. If the type strategy is set as information aggregation, desensitization marks meeting requirements are loaded according to the information collection strategy, and a user-defined IOC index is generated.
And performing a machine learning specific training process, and executing a specific process of local supervised training or federal learning training when generating a training model. If the type strategy is set as information aggregation, the model training process converts the user-defined IOC index into an IOC matching model, and the model provides IOC matching capability and is responsible for judging whether the input desensitization mark hits the IOC index.
And outputting the training model, namely outputting the model capability to the outside, wherein the model capability comprises the prediction capability of an AI model and the matching judgment capability of an IOC model.
5) Auditing, namely running in a block chain or a cooperation platform shared by multiple parties, accounting according to the data flow in the steps 1), 2), 3) and 4), accounting contribution points for different participants according to set rules, and rewarding the participants according to the contribution points.
And 5), running auditing on a block chain shared by multiple parties or a cooperation platform. Storing the data exchange record on the blocks, and continuously increasing new blocks; the auditing process can also be operated on a public collaboration platform, the data exchange records are stored in a platform database, and the auditing system obtains the contribution condition of each party by reading the content of the database.
In step 5), the rule for counting the contribution score is set as:
attribute data are shared and read by other participants, and a single attribute corresponding to the basic contribution score is recorded as Sa;
sharing relation data, and reading the relation data by other participants, and recording a single relation corresponding to the basic contribution score as Sr;
sharing behavior data, reading the behavior data by other participants, and recording a basic contribution score corresponding to a single behavior event as Sb;
sharing tag data, and reading by other participants, wherein the basic contribution corresponding to a single tag is Sl;
sharing transparent data, and using the data for machine learning, wherein the contribution corresponding to a single transparent data is Sf;
directionally sharing network entity data under the requirements of other participants, wherein the contribution corresponding to single network entity data is St;
only after the shared data is read by other participants can the participants obtain the contribution score. The contribution obtained is divided into the product of the basic contribution and the number of readers. For example, if some associated data is read by 3 participants, the sharing party may obtain a contribution score of 3 Sr.
After the reader reads the data, the corresponding contribution of the reader should be deducted, for example, when the reader reads 3 related data, the contribution of 3Sr is deducted. And the reader can selectively evaluate the read data to give a score, such as 0-10, and if the score given by the reader is within a certain range, such as 2 points, from the average score obtained by the data, the reader obtains a certain contribution score, which is recorded as Ss. When the average score of the data is higher than a certain value, for example 8 points, the participant who shares the data gets a point reward, the basic reward point Sw, and the sharer gets a contribution point Sw (r-8). When the value is lower than a certain value, for example, 3 points, the participants sharing the data deduct a certain value and mark as Sw (3-r);
the participants initially have a certain initial score by default.
The sum of the contribution points currently held by the participants in the system is constantly changing.
Referring to fig. 3, the present invention further provides a collaborative modeling system for security threats, including the following modules:
the data sharing unit is used for incompletely sharing the network entity data among a plurality of participants and desensitizing the network entity data which is necessary to be desensitized before sharing; if the original mark of the network entity is not sensitive in the current environment, the desensitization function can be closed, i.e. the desensitization mark is directly taken as the original mark. For example, when the parties are affiliated with the same department or are fully trusted by each other, desensitization is not required during the information exchange process.
The data fusion unit is used for performing attribute fusion, relationship fusion, behavior fusion and label fusion on the shared data to form attribute fusion data, relationship fusion data, behavior fusion data and label fusion data; the data fusion unit provides interface support for the process and can provide a visual interface to present a fusion process of entities and relationships. The incidence relation between the network entities can be visually presented in a graph mode, and a convenient interface for editing and changing is provided for a user. To distinguish the data sources, the dots and lines are labeled with different label icons, with different icons representing different parties. The color and shape of the points and lines may be used to reflect the type of entity or relationship.
The data feature extraction unit is used for respectively extracting data features of the attribute fusion data, the relationship fusion data and the behavior fusion data;
the modeling unit comprises a data loading unit, a model training unit and a model output unit, different types of modeling methods are selected according to needs, the data loading unit is used for selectively loading data characteristics and label fusion data, the model training unit is used for performing a machine learning specific training process to generate a training model, and the model output unit is used for training model output;
and the auditing unit runs in a block chain or a cooperation platform shared by all the participants and is used for accounting data flows in the data sharing unit, the data fusion unit, the data feature extraction unit and the modeling unit, accounting contribution scores for different participants according to a set rule and rewarding the participants according to the contribution scores.