CN105491023B

Movatterモバイル変換

Info

Publication number: CN105491023B
Application number: CN201510824673.3A
Authority: CN
Inventors: 周诚; 张涛; 马媛媛; 李伟伟; 汪晨; 邵志鹏; 费稼轩; 何高峰; 楚杰; 黄秀丽
Original assignee: State Grid Smart Grid Research Institute of SGCC; State Grid Corp of China SGCC
Current assignee: State Grid Smart Grid Research Institute of SGCC; State Grid Corp of China SGCC
Priority date: 2015-11-24
Filing date: 2015-11-24
Publication date: 2020-10-27
Anticipated expiration: 2035-11-24
Also published as: CN105491023A

Abstract

Translated fromChinese

本发明提供一种面向电力物联网的数据隔离交换和安全过滤方法，所述方法包括：(1)构建一种基于前置代理和专有协议的隔离架构；(2)在前置代理部分进行特征向量提取；(3)在所述前置代理部分进行标签封装，在隔离服务侧进行标签解析；(4)在隔离服务侧实现标签过滤；(5)在所述隔离服务侧结合特征向量和标签过滤进行内容过滤。本发明通过引入前置代理和私有应用层协议实现了完善的协议隔离，大大提高了隔离强度。

The present invention provides a data isolation exchange and security filtering method oriented to the power Internet of things. The method includes: (1) constructing an isolation architecture based on a pre-agent and a proprietary protocol; (2) performing a pre-agent part Feature vector extraction; (3) Label encapsulation is performed on the pre-agent part, and label parsing is performed on the isolation service side; (4) Label filtering is performed on the isolation service side; (5) The feature vector and Tag filtering for content filtering. The present invention realizes perfect protocol isolation by introducing the pre-agent and private application layer protocol, and greatly improves the isolation strength.

Description

Translated fromChinese

一种面向电力物联网的数据隔离交换和安全过滤方法A data isolation exchange and security filtering method for power Internet of things

技术领域technical field

本发明涉及一种安全隔离交换方法，具体涉及一种面向电力物联网的数据隔离交换和安全过滤方法。The invention relates to a security isolation exchange method, in particular to a data isolation exchange and security filtering method oriented to the power Internet of things.

背景技术Background technique

安全隔离与信息交换技术的基本原理是，通过一种专用硬件——隔离装置，使两个或者两个以上网络在不连通的情况下进行网络之间的安全数据传输和资源共享。具体做法一般是，隔离装置切断网络之间的TCP/IP连接，分解或重组TCP/IP数据包，进行安全审查，然后与另一边的主机建立有效连接并把数据发送出去。The basic principle of security isolation and information exchange technology is to make two or more than two networks perform secure data transmission and resource sharing between networks through a dedicated hardware, an isolation device. The specific method is generally that the isolation device cuts off the TCP/IP connection between the networks, decomposes or reassembles the TCP/IP data packet, conducts a security review, and then establishes an effective connection with the host on the other side and sends the data.

在智能电网发展背景下，电力物联网环境中存在海量智能终端、智能刀闸、作业终端以及各种电力业务系统，这些终端和系统需要与电力信息网进行频繁的数据交互。由于电力信息网属于涉密网络，而上述终端和系统多通过移动APN网络或者互联网完成接入，两者的交互对电力信息网来说存在明显的安全风险，因此必须采取隔离防护措施。然而现有的安全隔离与信息交换技术存在明显不足，难以满足电力物联网的隔离交换需求，具体表现为：In the context of smart grid development, there are a large number of smart terminals, smart switches, operation terminals, and various power business systems in the power Internet of Things environment. These terminals and systems need frequent data interaction with the power information network. Since the power information network is a classified network, and the above-mentioned terminals and systems are mostly connected through the mobile APN network or the Internet, the interaction between the two presents obvious security risks to the power information network, so isolation and protection measures must be taken. However, the existing security isolation and information exchange technology has obvious deficiencies, and it is difficult to meet the isolation and exchange requirements of the power Internet of things. The specific performance is as follows:

1、隔离强度不足1. Insufficient isolation strength

隔离与交换是一对矛盾，传统的安全隔离与信息交换技术基于TCP/IP数据包的分解和重组，在解决隔离与交换的矛盾问题上存在明显不足。绝大多数特定的数据交换都需要承载于特定的应用协议，而一旦在安全隔离与信息交换模型中加入应用层协议的考虑，就会发现TCP/IP报文重组并不能完全消除应用层协议引入的安全风险。假设非涉密网络通过安全隔离与信息交换系统与涉密网络中的Oracle数据库进行数据交换，此时非涉密网络和涉密网络都必须基于TNS协议进行通信，而TCP/IP报文在经过硬件交换矩阵后仍需还原成TNS协议报文格式，这就导致非涉密网络的攻击者实际上完全可以通过公开的TNS协议报文实现对涉密网络数据库的攻击。传统安全隔离与信息交换技术对这种攻击的防御措施是尽可能增强应用层过滤能力，但这显然已经退化到应用层防火墙的水平。Isolation and exchange are a pair of contradictions. The traditional security isolation and information exchange technology is based on the decomposition and reorganization of TCP/IP data packets, and there are obvious deficiencies in solving the contradiction between isolation and exchange. The vast majority of specific data exchanges need to be carried by specific application protocols. Once the consideration of application layer protocols is added to the security isolation and information exchange model, it will be found that TCP/IP packet reassembly cannot completely eliminate the introduction of application layer protocols. security risks. Assuming that the non-secret-related network exchanges data with the Oracle database in the secret-related network through the security isolation and information exchange system, both the non-secret-related network and the secret-related network must communicate based on the TNS protocol. After the hardware switching matrix, it still needs to be restored to the TNS protocol message format, which leads to the fact that an attacker of a non-secret-related network can actually attack the secret-related network database through the public TNS protocol message. The traditional security isolation and information exchange technology's defense against this kind of attack is to enhance the filtering capability of the application layer as much as possible, but this has obviously degenerated to the level of the application layer firewall.

2、过滤深度和效率存在矛盾2. There is a contradiction between filtration depth and efficiency

传统的安全隔离与信息交换技术体系中，安全过滤是重要的一环。许多相关产品都声称具备内容级的过滤能力。然而内容过滤算法不可能是没有性能代价的，内容过滤的深度越深，其造成的延迟和吞吐量下降情况越严重。传统的安全隔离与信息交换技术体系试图在隔离装置上完成深度过滤，这在工业环境中很可能是不可行的。例如电力物联网环境中涉及的终端数以亿计，数据交换流量巨大，对数据交换延迟也有严格要求，传统的安全隔离与信息交换技术体系只能通过关闭过滤功能或者降低过滤深度来应对，在解决过滤深度和效率的矛盾方面存在明显不足。In the traditional security isolation and information exchange technology system, security filtering is an important part. Many related products claim content-level filtering capabilities. However, the content filtering algorithm cannot be free of performance costs. The deeper the content filtering is, the more serious the delay and throughput degradation will be. Traditional security isolation and information exchange technology systems attempt to complete deep filtering on isolation devices, which may not be feasible in industrial environments. For example, there are hundreds of millions of terminals involved in the power Internet of Things environment, the data exchange traffic is huge, and there are strict requirements for data exchange delay. The traditional security isolation and information exchange technology system can only be dealt with by turning off the filtering function or reducing the filtering depth. There are obvious deficiencies in solving the contradiction between filtration depth and efficiency.

发明内容SUMMARY OF THE INVENTION

为了克服上述现有技术的不足，本发明提供一种面向电力物联网的数据隔离交换和安全过滤方法，本发明通过引入前置代理和私有应用层协议实现了完善的协议隔离，大大提高了隔离强度。In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a data isolation exchange and security filtering method oriented to the Internet of Things. The present invention realizes perfect protocol isolation by introducing a pre-agent and a private application layer protocol, and greatly improves the isolation. strength.

为了实现上述发明目的，本发明采取如下技术方案：In order to realize the above-mentioned purpose of the invention, the present invention adopts the following technical solutions:

一种面向电力物联网的数据隔离交换和安全过滤方法，所述方法包括如下步骤：A data isolation exchange and security filtering method oriented to the Internet of Things, the method comprises the following steps:

(1)构建一种基于前置代理和专有协议的隔离架构；(1) Build an isolation architecture based on pre-agents and proprietary protocols;

(2)在前置代理部分进行特征向量提取；(2) Feature vector extraction is performed in the pre-agent part;

(3)在所述前置代理部分进行标签封装，在隔离服务侧进行标签解析；(3) label encapsulation is performed on the pre-agent part, and label parsing is performed on the isolation service side;

(4)在隔离服务侧实现标签过滤；(4) Implement label filtering on the isolation service side;

(5)在所述隔离服务侧结合特征向量和标签过滤进行内容过滤。(5) Perform content filtering in combination with feature vector and label filtering on the isolation service side.

优选的，所述步骤(1)中，所述隔离架构包括基于TCP/IP协议专有的应用层交换协议、专用安全隔离装置和前置代理，所述隔离架构用于将用户交互过程映射为所述应用层交换协议报文，以实现数据交换。Preferably, in the step (1), the isolation architecture includes a proprietary application layer exchange protocol based on the TCP/IP protocol, a dedicated security isolation device and a pre-agent, and the isolation architecture is used to map the user interaction process as The application layer exchanges protocol packets to realize data exchange.

优选的，所述步骤(2)中，包括如下步骤：Preferably, in the step (2), the following steps are included:

步骤2-1、对报文内容T_i进行预处理；Step 2-1, preprocessing the message content T_i ;

步骤2-2、对所述报文内容T_i进行特征提取；Step 2-2, performing feature extraction on the message content T_i ;

步骤2-3、生成报文特征向量V’和隔离服务侧中敏感库的特征向量V；Step 2-3, generate the message feature vector V' and the feature vector V of the sensitive library in the isolation service side;

步骤2-4、将提取的特征向量保存到报文的标签字段中。Step 2-4, save the extracted feature vector in the label field of the message.

优选的，所述步骤2-1中，所述预处理为通过ICTCLAS分词接口，将文本文件进行分词解析，报文内容T_i分词后表示为如下形式：Preferably, in the step 2-1, the preprocessing is to perform word segmentation analysis on the text file through the ICTCLAS word segmentation interface, and the message content T_i after word segmentation is expressed in the following form:

T_i＝((a_i1,l_i1,p_i1),(a_i2,l_i2,p_i2),......,(a_in,l_in,p_in))T_i =((a_i1 ,l_i1 ,p_i1 ),(a_i2 ,l_i2 ,p_i2 ),...,(a_in ,l_in ,p_in ))

式中：T_i表示报文i，a_in表示划分出来的词组，l_in表示词组的长度，p_in表示划分出来的词组的词性。In the formula: T_i represents the message i, a_in represents the divided phrase,_lin represents the length of the phrase, and_pin represents the part of speech of the divided phrase.

优选的，所述步骤2-2中，包括如下步骤：Preferably, in the step 2-2, the following steps are included:

步骤2-2-1、对所述报文内容T_i进行词性选择，提取分析后的文本词组中的名词性词组，删除其它词性，所述报文内容T_i经过词性选择后，表达式如下：Step 2-2-1, perform part-of-speech selection on the message content T_i , extract the noun phrases in the analyzed text phrases, delete other parts of speech, after the message content T_i is selected by the part-of-speech, the expression is as follows: :

式中：Ta_i为提取名词之后的文本，

为名词，

为名词词组的长度；In the formula: Tai is the text after the_noun is extracted,

as a noun,

is the length of the noun phrase;

步骤2-2-2、统计关键字的出现频率，形成分词三元组，包含词组、词组在本文本中出现的频率和词性，将Ta_i增加一个词频项，表达式如下：Step 2-2-2. Count the frequency of occurrence of keywords to form word segmentation triples, including the frequency and part of speech of phrases and_phrases in this text, add a word frequency item to Tai, and the expression is as follows:

式中：Tb_i为统计词频之后的文本，

为统计词频后的词组，

为统计词频后词组的长度，

为

的词频；In the formula: Tb_i is the text after counting the word frequency,

is the phrase after counting the frequency of words,

is the length of the phrase after counting the word frequency,

for

word frequency;

步骤2-2-3、计算每个关键字的长度并删除单个字的关键字，表达式如下：Step 2-2-3, calculate the length of each keyword and delete the keyword of a single word, the expression is as follows:

式中：Tc_i为删除关键字为单个字之后的文本，其中

为长度大于一个字的词组，

为

词频；In the formula: Tc_i is the text after deleting the keyword as a single word, where

is a phrase longer than one word,

for

word frequency;

步骤2-2-4、剔除关键字出现一次的词组，得到的最终表达式为：Step 2-2-4, remove the phrases where the keyword appears once, and the final expression obtained is:

其中：Td_i为剔除关键字出现一次之后的文本，

为剔除关键字出现一次之后的词组，

为

的词频，其中

Among them: Td_i is the text after excluding the keyword appearing once,

In order to remove the phrase after the keyword appears once,

for

word frequency, where

优选的，所述步骤2-3中，包括如下步骤：Preferably, the steps 2-3 include the following steps:

步骤2-3-1、基于TF-IDF公式对词组的权值进行计算，公式为：Step 2-3-1, calculate the weight of the phrase based on the TF-IDF formula, the formula is:

d_ij＝t_ij*log(N/n_j)d_ij =t_ij *log(N/n_j )

其中，d_ij为词组a_ij在文本T_i中出现的次数，等于Td_i中的

N为文档的总数，n_j为文档库中包含词组a_ij的文档的个数；Among them, d_ij is the number of times the phrase a_ij appears in the text T_i , which is equal to the number of times in Td_i

N is the total number of documents, and n_j is the number of documents containing the phrase a_ij in the document library;

步骤2-3-2、由敏感库数据组成的特征向量表示为：Step 2-3-2, the feature vector composed of sensitive database data is expressed as:

V＝((a₁₁,d₁₁),(a₁₂,d₁₂),......,(a_1m,d_1m),......,(a_n1,d_n1),(a_n1,d_n1),......,(a_nm,d_nm))V=((a₁₁ ,d₁₁ ),(a₁₂ ,d₁₂ ),...,(a_1m ,d_1m ),...,(a_n1 ,d_n1 ),( a_n1 ,d_n1 ),...,(a_nm ,d_nm ))

简记为：Abbreviated as:

V＝(d₁₁,d₁₂,......,d_1m,......,d_n1,d_n2,......,d_nm)V=(d₁₁ ,d₁₂ ,...,d_1m ,...,d_n1 ,d_n2 ,...,d_nm )

步骤2-3-3、根据步骤2-3-2，得到报文的特征向量简记为：Step 2-3-3, according to step 2-3-2, the feature vector of the obtained message is abbreviated as:

V′＝(d′₁₁,d′₁₂,......,d′_1m,......,d′_n1,d′_n2,......,d′_nm)。V'=(d'₁₁ ,d'₁₂ ,...,d'_1m ,...,d'_n1 ,d'_n2 ,...,d'_nm ).

优选的，所述步骤(3)中，所述标签包括用户信息U(k,v)、数据属性信息Ad(k,v)，特征向量V’、生成时间T和加密标识Fe信息，表达式为：Preferably, in the step (3), the label includes user information U(k,v), data attribute information Ad(k,v), feature vector V', generation time T and encrypted identification Fe information, the expression for:

Label＝(U(k,v),Ad(k,v),V’,T,Fe)Label=(U(k,v),Ad(k,v),V',T,Fe)

所述前置代理部分进行标签封装包括如下步骤：The label encapsulation performed by the pre-agent part includes the following steps:

步骤3-1-1、用户信息U、数据属性信息Ad(k,v)，特征向量V’、生成时间T按序排列，并分块成N块；Step 3-1-1, user information U, data attribute information Ad(k, v), feature vector V', generation time T are arranged in order, and are divided into N blocks;

步骤3-1-2、随机选择N块中的N1块，设置加密标识，并对数据进行加密获得EN1；Step 3-1-2, randomly select the N1 block in the N blocks, set the encryption identifier, and encrypt the data to obtain EN1;

步骤3-1-3、记录随机选择过程R，将R作为块，设置加密标识，加密R获得ER；Step 3-1-3, record the random selection process R, use R as a block, set an encryption identifier, and encrypt R to obtain ER;

步骤3-1-4、对剩余的N2(N-N1)块不设置加密标识；Step 3-1-4, do not set the encryption flag for the remaining N2 (N-N1) blocks;

步骤3-1-5、计算所述EN1的长度和所述ER的长度，并连接EN1长度、EN1、ER长度、ER和N2得标签封装后的私有协议数据E；Step 3-1-5, calculate the length of the EN1 and the length of the ER, and connect the length of EN1, EN1, ER, ER and N2 to obtain the private protocol data E encapsulated by the label;

所述在隔离服务侧进行标签解析包括如下步骤：The label parsing on the isolation service side includes the following steps:

步骤3-2-1、获取所述私有协议数据E；Step 3-2-1, obtaining the private protocol data E;

步骤3-2-2、提取所述EN1长度，通过EN1长度提取EN1，并解密EN1获得N1；Step 3-2-2, extract the EN1 length, extract EN1 through the EN1 length, and decrypt EN1 to obtain N1;

步骤3-2-3、提取ER长度，通过ER长度提取ER，并解密ER获得R；Step 3-2-3, extract the ER length, extract the ER through the ER length, and decrypt the ER to obtain R;

步骤3-2-4、提取后面的数据N2；Step 3-2-4, extract the following data N2;

步骤3-2-5、通过随机选择过程R，将N1和N2恢复到U(k,v),Ad(k,v),V’和T。Step 3-2-5, restore N1 and N2 to U(k,v), Ad(k,v), V' and T by randomly selecting process R.

优选的，所述步骤(4)中，所述标签过滤是通过策略规则，依客户端提供的数据属性，对数据进行过滤；所述策略规则由左括号[，关键字begin，表达式exp，关键字end，右括号]构成；所述表达式由基本项和构成项构成，所述基本项包括变量var、数值和字符串，所述构成项是由变量、数值和字符串，通过一元、二元操作符连接的复杂表达式。Preferably, in the step (4), the label filtering is to filter the data according to the data attributes provided by the client through policy rules; the policy rules are composed of left bracket [, keyword begin, expression exp, keyword end, right parenthesis]; the expression is composed of basic items and constituent items, the basic items include variables var, numerical values and strings, and the constituent items are composed of variables, numerical values and strings, through unary, Complex expressions concatenated by binary operators.

优选的，所述标签过滤包括如下步骤：Preferably, the label filtering includes the following steps:

步骤4-1、提取所述用户信息U(k,v)和所述数据属性信息Ad(k,v)重新赋给新的属性信息Ad’(k,v)；Step 4-1, extract the user information U(k, v) and the data attribute information Ad(k, v) and re-assign new attribute information Ad'(k, v);

步骤4-2、从策略库中提取策略规则；Step 4-2, extract policy rules from the policy library;

步骤4-3、遍历策略规则表达式exp，提取表达式中的变量var；Step 4-3, traverse the policy rule expression exp, and extract the variable var in the expression;

步骤4-4、将所述var为键从所述Ad’(k,v)中提取var对应值v；Step 4-4, using the var as a key to extract the corresponding value v of var from the Ad'(k, v);

步骤4-5、将策略规则中的var由v替代，并计算表达式；Step 4-5, replace the var in the policy rule with v, and calculate the expression;

步骤4-6、依据计算结果判定数据是否被过滤，并记录日志。Steps 4-6: Determine whether the data is filtered according to the calculation result, and record the log.

优选的，所述步骤(5)中，包括如下步骤：Preferably, in the step (5), the following steps are included:

步骤5-1、特征向量V’与隔离服务侧中敏感库的特征向量V通过余弦计算得到余弦相似度值，余弦相似度计算公式如下：Step 5-1, the feature vector V' and the feature vector V of the sensitive library in the isolation service side obtain the cosine similarity value through cosine calculation, and the cosine similarity calculation formula is as follows:

式中，V'和V为两个特征向量，V'·V为标准向量点积，定义为

t为向量的维数，分母中的范数||V'||定义为

分母中的范数||V||定义为

In the formula, V' and V are two eigenvectors, and V'·V is the standard vector dot product, which is defined as

t is the dimension of the vector, and the norm ||V'|| in the denominator is defined as

The norm ||V|| in the denominator is defined as

步骤5-2、通过将余弦相似度值与预定义的相似度阈值比较，分析得到报文是否携带涉密信息，对涉密的文档进行过滤。Step 5-2, by comparing the cosine similarity value with a predefined similarity threshold, analyze whether the packet carries secret-related information, and filter secret-related documents.

与现有技术相比，本发明的有益效果在于：Compared with the prior art, the beneficial effects of the present invention are:

本发明只需要提供一个私有的JDBC驱动，在非涉密网络并不需要开放TNS协议通信，仅在涉密网络开放TNS协议通信，这样隔离边界两侧网络的报文完全经过语义翻译，不具备简单映射关系，非涉密网络的攻击者无法攻击内网TNS协议漏洞，从而实现了完善的协议隔离，大大提高了隔离强度。The present invention only needs to provide a private JDBC driver, and does not need to open TNS protocol communication in non-secret-related networks, but only open TNS protocol communication in secret-related networks, so that the messages of the networks on both sides of the isolation boundary are completely semantically translated, and there is no need for Simple mapping relationship, attackers on non-secret-related networks cannot attack the TNS protocol loopholes in the intranet, thus achieving perfect protocol isolation and greatly improving the isolation strength.

本发明通过基于特有的隔离交换架构，将深度内容解析和内容特征值提取前移到前置代理侧完成，在隔离装置侧则只进行特征值匹配，这样隔离边界的计算需求大幅降低。在电力物联网环境下，该技术可以利用数以亿计的智能终端设备实现分布式的内容过滤计算，从而实现高效率低延迟的分布式内容过滤。较好的解决了过滤深度和交换效率的矛盾。Based on the unique isolation switching architecture, the present invention moves in-depth content analysis and content feature value extraction forward to the front-end proxy side to complete, and only performs feature value matching on the isolation device side, thus greatly reducing the computational requirements of the isolation boundary. In the power Internet of Things environment, this technology can utilize hundreds of millions of smart terminal devices to realize distributed content filtering computing, so as to achieve high-efficiency and low-latency distributed content filtering. It better solves the contradiction between filtration depth and exchange efficiency.

本发明通过引入前置代理，将隔离交换的边界前移至终端侧，电力物联网环境下大量的智能终端基于可信计算的理念构建，运行在智能终端的前置代理软件可以与智能终端的可信计算体系相结合，通过私有应用层协议加固，将整个隔离交换体系纳入到可信交换系中去，从而实现可信隔离交换。The present invention moves the boundary of isolation and exchange to the terminal side by introducing a pre-agent. A large number of intelligent terminals in the power Internet of Things environment are constructed based on the concept of trusted computing. The pre-agent software running on the intelligent terminal can be connected with the intelligent terminal. The trusted computing system is combined, and the entire isolation exchange system is incorporated into the trusted exchange system through the reinforcement of the private application layer protocol, thereby realizing the trusted isolation exchange.

附图说明Description of drawings

图1是本发明提供的一种面向电力物联网的数据隔离交换和安全过滤方法流程图FIG. 1 is a flowchart of a data isolation exchange and security filtering method oriented to the power Internet of things provided by the present invention

图2是本发明提供的在前置代理部分实现特征向量提取的流程图Fig. 2 is the flow chart of realizing feature vector extraction in the front proxy part provided by the present invention

图3是本发明提供的标签及私有协议封装将标签内容进行私有格式化处理的流程图Fig. 3 is the flow chart that the label and private protocol encapsulation provided by the present invention perform private formatting processing on label content

图4是本发明提供的策略过滤的流程图Fig. 4 is the flow chart of the policy filtering provided by the present invention

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings.

如图1所示，本发明提供了一种面向电力物联网的数据隔离交换和安全过滤方法，采取如下技术方案：As shown in FIG. 1 , the present invention provides a data isolation exchange and security filtering method oriented to the Internet of Things, which adopts the following technical solutions:

步骤1、隔离架构的构建Step 1. Construction of the isolation architecture

构建一种基于前置代理和专有协议的隔离架构，包括一种基于TCP/IP协议的专有的应用层交互协议、一种专用安全隔离装置，该装置一方面具备硬件级的TCP/IP协议分解重组和交换能力，另一方面仅支持上述专用应用层协议通信，拒绝一切第三方公开应用层协议、一种前置代理，可以是驱动、SDK或者硬件插件等形式，在本架构中主要作用是将用户交互过程映射为专用应用层协议报文，以实现数据交换，前置代理在实际实现过程中也能够起到终端加固和可信认证的作用。Build an isolation architecture based on pre-agent and proprietary protocols, including a proprietary application layer interaction protocol based on TCP/IP protocol, and a dedicated security isolation device, which on the one hand has hardware-level TCP/IP Protocol decomposition, reorganization and exchange capabilities. On the other hand, it only supports the above-mentioned dedicated application layer protocol communication, and rejects all third-party public application layer protocols, a pre-agent, which can be in the form of drivers, SDKs or hardware plug-ins. In this architecture, the main The function is to map the user interaction process into dedicated application layer protocol packets to realize data exchange. The pre-agent can also play the role of terminal reinforcement and trusted authentication in the actual implementation process.

步骤2、在前置代理部分实现特征向量提取，如图2所示Step 2. Implement feature vector extraction in the pre-agent part, as shown in Figure 2

首先对报文内内容Ti进行预处理,之后进行特征提取生成报文特征向量V’和敏感库特征向量V,并将提取的特征向量保存到报文的标签字段中。Firstly, the content Ti in the message is preprocessed, and then feature extraction is performed to generate the message feature vector V' and the sensitive library feature vector V, and the extracted feature vector is saved in the label field of the message.

(1)预处理(1) Preprocessing

通过ICTCLAS分词接口，将文本文件进行分词解析，报文内容T_i分词后表示为如下形式：Through the ICTCLAS word segmentation interface, the text file is analyzed by word segmentation, and the content of the message T_i after word segmentation is expressed as the following form:

其中：T_i表示报文i，a_in表示划分出来的词组，l_in表示词组的长度，p_in表示划分出来的词组的词性。Among them: T_i represents the message i, a_in represents the divided phrase,_lin represents the length of the phrase, and p_in represents the part of speech of the divided phrase.

(2)特征提取(2) Feature extraction

1)词性选择1) Part of speech selection

在中文的文本中，根据词性取其中能够最强烈表达文章内容的关键词，用于后面的特征提取，有助于消除冗余，简便计算过程。因此提取分析后的文本词组中的名词性词组，删除其它词性。文本文件T_i经过词性选择后，表示为如下：In the Chinese text, the keywords that can most strongly express the content of the article are selected according to the part of speech and used for the subsequent feature extraction, which helps to eliminate redundancy and simplify the calculation process. Therefore, noun phrases in the analyzed text phrases are extracted, and other parts of speech are deleted. After the text file T_i is selected by part of speech, it is expressed as follows:

式中：Ta_i为提取名词之后的文本，

为名词，

为名词词组的长度。In the formula: Tai is the text after the_noun is extracted,

as a noun,

is the length of the noun phrase.

2)词频统计2) Word frequency statistics

统计关键字的出现频率，形成分词三元组，包含词组，词组在本文本中出现的频率和词性。将T_ai增加一个词频项，进一步表达为：The frequency of occurrence of keywords is counted to form word segmentation triples, including phrases, the frequency and part of speech of the phrases in this text. Add a_word frequency term to Tai, which is further expressed as:

式中：Tb_i为统计词频之后的文本，

为统计词频后的词组，

为统计词频后词组的长度，

为

的词频。In the formula: Tb_i is the text after counting the word frequency,

is the phrase after counting the frequency of words,

is the length of the phrase after counting the word frequency,

for

word frequency.

3)词长选择3) Word length selection

在中文的文本中，词比字有着更强的表达能力，计算每个关键字的长度并删除单个字的关键词。进一步表达为：In Chinese text, words have stronger expressive power than words, calculate the length of each keyword and delete the keywords of a single character. It is further expressed as:

式中：Tc_i为删除关键字为单个字之后的文本，其中

为长度大于一个字的词组，

为

词频。In the formula: Tc_i is the text after deleting the keyword as a single word, where

is a phrase longer than one word,

for

word frequency.

4)词频选择4) Word frequency selection

在中文的文本中，只出现一次的词具有偶然性不具备代表性，因此剔除统计后的文本分词三元组中只出现过一次的词组。得到最终的特征二元组表达为：In Chinese texts, words that appear only once are random and unrepresentative, so the phrases that appear only once in the trigrams of text word segmentations after statistics are excluded. The final feature two-tuple expression is:

其中：Td_i为剔除关键字出现一次之后的文本，

为剔除关键字出现一次之后的词组，

为

的词频，其中

Among them: Td_i is the text after excluding the keyword appearing once,

In order to remove the phrase after the keyword appears once,

for

word frequency, where

(3)生成特征向量(3) Generate feature vector

对词的权值的计算是衡量特征值的有效方法，目前广泛使用的是基于统计方法的TF-IDF公式，这个公式在大量实际使用中被证明是可行的有效的。其核心思想是，认为某个词在其它文本中出现的次数越是少，那么这个词就包含越多的信息，越能够代表文档的类型，相反，如果在其它文档中也是大量的出现，那么这个词就不具有代表性。The calculation of the weight of the word is an effective method to measure the feature value. At present, the TF-IDF formula based on statistical methods is widely used, and this formula has been proved to be feasible and effective in a large number of practical applications. The core idea is that the less a word appears in other texts, the more information the word contains, and the more it can represent the type of document. On the contrary, if it also appears a lot in other documents, then The word is not representative.

目前常用的计TF-IDF计算公式表示为：At present, the commonly used calculation formula of TF-IDF is expressed as:

d_ij＝t_ij*log(N/n_j)d_ij =t_ij *log(N/n_j )

其中，t_ij为词组a_ij在文本T_i中出现的次数，等于Td_i中的f_im，N为文档的总数，n_j为文档库中包含词组a_ij的文档的个数。Among them, t_ij is the number of times the phrase a_ij appears in the text T_i , which is equal to f_im in Td_i , N is the total number of documents, and n_j is the number of documents that contain the phrase a_ij in the document library.

由敏感库数据组成的特征向量表示为：The feature vector composed of sensitive library data is expressed as:

简记为：Abbreviated as:

V＝(d₁₁,d₁₂,......,d_1m,......,d_n1d_n2,......,d_nm)V=(d₁₁ ,d₁₂ ,...,d_1m ,...,d_n1 d_n2 ,...,d_nm )

同样的方法得到报文的特征向量简记为：The eigenvector of the message obtained by the same method is abbreviated as:

V'＝(d'₁₁,d'₁₂,......,d'_1m,......,d'_n1d'_n2,......,d'_nm)V'=(d'₁₁ ,d'₁₂ ,...,d'_1m ,...,d'_n1 d'_n2 ,...,d'_nm )

步骤3、在前置代理部分实现标签封装，在隔离服务侧实现标签解析Step 3. Implement label encapsulation in the front-end proxy, and implement label parsing on the isolation service side

标签封装和解析包含有标签，标签及私有协议封装，标签及私有协议解析。标签封装和解析通过在发送端对访问用户的用户信息，发送数据属性信息，数据的特征向量信息进行标记，然后通过私有协议对数据进行随机分块加密，再发送至服务端。在服务端，通过解析技术首先将数据恢复。恢复后的数据为标签过滤和特征向量过滤服务。Label encapsulation and parsing includes label, label and private protocol encapsulation, and label and private protocol parsing. Label encapsulation and parsing are performed by marking the user information of the visiting user, sending the data attribute information, and the feature vector information of the data at the sending end, and then encrypting the data in random blocks through a private protocol, and then sending it to the server. On the server side, the data is first recovered through parsing technology. The recovered data serves for label filtering and feature vector filtering.

标签包括用户信息U、数据属性信息，特征向量V、生成时间T和加密标识等信息。The label includes information such as user information U, data attribute information, feature vector V, generation time T and encrypted identification.

Label＝(U(k,v),Ad(k,v),V’,T,Fe)Label=(U(k,v),Ad(k,v),V',T,Fe)

其中，in,

1)用户信息包括用户身份信息和用户请求操作信息，用户信息以键值对的形式存在；1) User information includes user identity information and user request operation information, and user information exists in the form of key-value pairs;

2)数据属性信息包含有数据类型、数据大小、数据创造者、数据修改时间等，数据属性也以键值对的形式存在。2) Data attribute information includes data type, data size, data creator, data modification time, etc. Data attributes also exist in the form of key-value pairs.

3)特征向量用于服务端的基于特征向量的内容过滤；3) The feature vector is used for content filtering based on the feature vector on the server side;

4)生成时间为标签产生的时间；4) The generation time is the time when the label is generated;

5)加密标识用于标识标签分块后，块数据是否被加密，加密标识在服务端不解析时使用。5) The encryption flag is used to identify whether the block data is encrypted after the label is divided into blocks, and the encryption flag is used when the server does not parse it.

如图3所示，标签及私有协议封装将标签内容进行私有格式化处理，步骤如下，As shown in Figure 3, the label and private protocol encapsulation will privately format the label content. The steps are as follows:

步骤a:将用户信息U、数据属性信息Ad(k,v)，特征向量V’、生成时间T按序排列，并分块成N块；Step a: user information U, data attribute information Ad(k, v), feature vector V ', generation time T are arranged in order, and are divided into N blocks;

步骤b:随机选择N块中的N1块，设置加密标识，并对数据进行加密获得EN1；Step b: randomly select the N1 block in the N block, set the encryption mark, and encrypt the data to obtain EN1;

步骤c:记录随机选择过程R，将R作为块，设置加密标识，加密R获得ER；Step c: record the random selection process R, use R as a block, set an encryption mark, and encrypt R to obtain ER;

步骤d:对剩余的N2(N-N1)块不设置加密标识；Step d: do not set encryption mark to remaining N2 (N-N1) block;

步骤e:计算EN1的长度和ER的长度，然后连接EN1长度、EN1、ER长度、ER和N2得E，上述过程如图例3所示。Step e: Calculate the length of EN1 and the length of ER, and then connect the length of EN1, EN1, ER, ER and N2 to obtain E, the above process is shown in Figure 3.

标签及私有化协议封装后，将以报文形式发送至服务端。服务端首先对报告进行标签及私有协议解析，恢复标签值，步骤如下：After the label and privatization protocol are encapsulated, they will be sent to the server in the form of a message. The server first parses the label and private protocol of the report, and restores the label value. The steps are as follows:

步骤a:获取私有协议数据E；Step a: obtain private protocol data E;

步骤b:提取EN1长度，通过EN1长度提取EN1，并解密EN1获得N1；Step b: extract EN1 length, extract EN1 by EN1 length, and decrypt EN1 to obtain N1;

步骤c:提取ER长度，通过ER长度提取ER，并解密ER获得R；Step c: extract the ER length, extract the ER by the ER length, and decrypt the ER to obtain R;

步骤d:提取后面的数据N2；Step d: extract the following data N2;

步骤e:通过随机选择过程R，将N1和N2恢复到U(k,v),Ad(k,v),V’和T。Step e: Restore N1 and N2 to U(k,v), Ad(k,v), V' and T by randomly selecting process R.

步骤4、在隔离服务侧实现标签过滤Step 4. Implement label filtering on the isolation service side

标签过滤通过设计灵活的策略规则，依客户端提供的数据属性，对数据进行过滤。Label filtering filters data according to the data attributes provided by the client by designing flexible policy rules.

策略规则为策略过滤的规范。策略规则提供了一个统一的策略描述，以能够处理属性信息来达到过滤数据的目的。为了方便计算与扩展，策略规则设计为自定义的表达式，它由变量、值和操作符构成。变量值依据变量从数据属性信息中提取，操作符和变量由具体策略设置。过滤时，将属性值替换变量值，然后计算策略表达式，最后输出计算结果。由于策略规则使用表达式，所以策略规则非常灵活。Policy rules are specifications for policy filtering. Policy rules provide a unified policy description that can process attribute information to filter data. In order to facilitate calculation and expansion, policy rules are designed as self-defined expressions, which consist of variables, values and operators. The variable value is extracted from the data attribute information according to the variable, and the operator and variable are set by the specific strategy. When filtering, replace the attribute value with the variable value, then calculate the policy expression, and finally output the calculation result. Policy rules are very flexible because they use expressions.

形式化策略规则，Formalize policy rules,

策略规则由左括号[，关键字begin，表达式(exp)，关键字end，右括号]构成。表达式由两部分构成，A policy rule consists of an opening bracket [, the keyword begin, an expression (exp), the keyword end, and a closing bracket]. The expression consists of two parts,

基本项：变量(var)、数值(float和integer)和字符串(string)；Basic items: variables (var), numbers (float and integer), and strings (string);

构成项：由变量、数值和字符串，通过一元(opu)、二元(opb)操作符连接的复杂表达式。Constituents: Complex expressions consisting of variables, values, and strings connected by unary (opu) and binary (opb) operators.

使用自定义的表达式来描述策略规则，能够使基于策略的过滤不仅方便、而且操作性强，扩展性灵活。策略过滤流程如图4所示，Using custom expressions to describe policy rules can make policy-based filtering not only convenient, but also highly operable and flexible. The policy filtering process is shown in Figure 4.

步骤a：从解析步骤中提取用户信息和数据属性信息U(k,v)和Ad(k,v),并重新赋给新的属性信息Ad(k,v)；Step a: extract user information and data attribute information U(k,v) and Ad(k,v) from the parsing step, and reassign new attribute information Ad(k,v);

步骤b:从策略库中提取策略规则exp；Step b: extract the policy rule exp from the policy library;

步骤c：遍历策略规则表达式exp，提取表达式中的变量var；Step c: Traverse the policy rule expression exp and extract the variable var in the expression;

步骤d：将var为键从Ad(k,v)中提取var对应值v(整型数、浮点型数和字符串)；Step d: Use var as the key to extract the corresponding value v (integer, floating point and string) of var from Ad(k, v);

步骤e：将策略规则中的var由v替代，并计算表达式；Step e: Replace the var in the policy rule with v, and calculate the expression;

步骤f：依据计算结果判定数据是否被过滤，并记录日志。Step f: Determine whether the data is filtered according to the calculation result, and record the log.

步骤5、在隔离服务侧结合特征向量和标签过滤实现内容过滤Step 5. Implement content filtering by combining feature vector and label filtering on the isolation service side

提取解析标签中的特征向量值V’，与隔离装置上的敏感库的特征向量V通过余弦计算得到相似度，余弦相似度计算公式如下：Extract the eigenvector value V' in the analytical label, and obtain the similarity with the eigenvector V of the sensitive library on the isolation device through cosine calculation. The cosine similarity calculation formula is as follows:

式中，V'和V为两个特征向量，V'·V为标准向量点积，定义为

t为向量的维数，向量一般可以表示为一个数列，数列中的每个数称为分量，维数即分量的个数，例如(a1,a2,a3)的维数为3，分母中的范数||V'||定义为

t is the dimension of the vector. The vector can generally be represented as a sequence of numbers. Each number in the sequence is called a component. The dimension is the number of components. For example, the dimension of (a1, a2, a3) is 3. The norm ||V'|| is defined as

通过将余弦相似度值与预定义的相似度阈值比较，分析得到报文是否携带涉密信息，对涉密的文档进行过滤，达到内容过滤的功能。By comparing the cosine similarity value with the predefined similarity threshold, it is analyzed whether the packet carries secret-related information, and the secret-related documents are filtered to achieve the function of content filtering.

最后应当说明的是：以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者等同替换，而未脱离本发明精神和范围的任何修改或者等同替换，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: the present invention can still be Modifications or equivalent replacements are made to the specific embodiments of the present invention, and any modifications or equivalent replacements that do not depart from the spirit and scope of the present invention shall be included in the scope of the claims of the present invention.

Claims

1. A data isolation exchange and safety filtering method for an electric power Internet of things is characterized by comprising the following steps:

constructing an isolation framework based on a front proxy and a proprietary protocol;

step (2) extracting the feature vector in the preposed agent part;

step (3) label encapsulation is carried out on the preposed agent part, and label analysis is carried out on the isolation service side;

step (4) realizing label filtration at the isolation service side;

step 5, filtering content on the isolation service side by combining feature vector and label filtering;

the step (2) comprises the following steps:

step 2-1, for message content T_iCarrying out pretreatment;

step 2-2, for the message content T_iCarrying out feature extraction;

step 2-3, generating a message characteristic vector V' and a characteristic vector V of a sensitive library in an isolation service side;

step 2-4, storing the extracted feature vector into a label field of the message;

the preprocessing is to carry out word segmentation analysis on the text file through an ICTCCLAS word segmentation interface, and the message content T_iAfter word segmentation, the expression is as follows:

T_i＝((a_i1,l_i1,p_i1),(a_i2,l_i2,p_i2),......,(a_in,l_in,p_in))

in the formula: t is_iRepresenting messages i, a_inDenotes a divided phrase, l_inIndicates the length of the phrase, p_inRepresenting the part of speech of the divided phrase;

in the step 2-2, the method comprises the following steps:

step 2-2-1, for the message content T_iSelecting part of speech, extracting noun word groups in the analyzed text word groups, deleting other parts of speech, and the message content T_iAfter part of speech selection, the expression is as follows:

in the formula: ta_iIn order to extract the text following the noun,

for the sake of a noun, the term,

is the length of the noun phrase;

step 2-2-2, counting the occurrence frequency of the keywords to form word segmentation triplets comprising the word groups, the occurrence frequency and the part of speech of the word groups in the text, and converting Ta_iAdding a word frequency item, wherein the expression is as follows:

in the formula: tb_iTo count the text after the word frequency,

for the word group after the word frequency is counted,

for counting the length of the word group after the word frequency,

is composed of

The word frequency of;

step 2-2-3, calculating the length of each keyword and deleting the keywords of a single word, wherein the expression is as follows:

in the formula: tc_iFor deleting a keyword as text after a single word, wherein

Is a phrase with the length larger than one character,

is composed of

Word frequency;

step 2-2-4, eliminating phrases with keywords appearing once, and obtaining a final expression as follows:

wherein: td_iTo cull text after a keyword has appeared once,

in order to remove the phrase after the keyword appears once,

is composed of

Word frequency of wherein

In the step 2-3, the method comprises the following steps:

step 2-3-1, calculating the weight of the phrase based on a TF-IDF formula, wherein the formula is as follows:

d_ij＝t_ij*log(N/n_j)

wherein d is_ijIs the phrase a_ijIn the text T_iIs equal to Td_iIn (1)

N is the total number of documents, N_jFor words contained in document librariesGroup a_ijThe number of documents of (2);

step 2-3-2, the feature vector composed of the sensitive library data is expressed as:

V＝((a₁₁,d₁₁),(a₁₂,d₁₂),......,(a_1m,d_1m),......,(a_n1,d_n1),(a_n1,d_n1),......,(a_nm,d_nm))

for brevity, this is:

V＝(d₁₁,d₁₂,......,d_1m,......,d_n1,d_n2,......,d_nm)

step 2-3-3, according to step 2-3-2, the characteristic vector of the obtained message is simplified as:

V′＝(d′₁₁,d′₁₂,......,d′_1m,......,d′_n1,d′_n2,......,d′_nm)。

2. the method according to claim 1, wherein in the step (1), the isolation framework comprises an application layer exchange protocol proprietary to the TCP/IP protocol, a dedicated security isolation device and a pre-proxy, and is configured to map the user interaction process to the application layer exchange protocol packet to implement data exchange.

3. The method according to claim 1, wherein in step (3), the tag includes user information U (k, V), data attribute information Ad (k, V), a feature vector V', a generation time T, and encryption flag Fe information, and the expression is:

Label＝(U(k,v),Ad(k,v),V’,T,Fe)

the label encapsulation of the front proxy part comprises the following steps:

step 3-1-1, arranging user information U, data attribute information Ad (k, V), feature vectors V' and generation time T in sequence, and dividing the sequence into N blocks;

3-1-2, randomly selecting N1 blocks from the N blocks, setting an encryption identifier, and encrypting data to obtain EN 1;

3-1-3, recording a random selection process R, setting an encryption identifier by taking R as a block, and encrypting R to obtain ER;

3-1-4, setting no encryption identifier for the rest N-N1 blocks;

3-1-5, calculating the length of the EN1 and the length of the ER, and connecting the EN1 length, the EN1, the ER length, the ER and the rest N-N1 blocks to obtain label-encapsulated private protocol data E; the label analysis on the isolation service side comprises the following steps:

step 3-2-1, obtaining the private protocol data E;

3-2-2, extracting the EN1 length, extracting EN1 through the EN1 length, and decrypting EN1 to obtain N1;

3-2-3, extracting the ER length, extracting the ER through the ER length, and decrypting the ER to obtain R;

3-2-4, extracting the subsequent data N-N1;

step 3-2-5, N1 and N-N1 were restored to U (k, V), Ad (k, V), V' and T by random selection procedure R.

4. The method according to claim 3, wherein in the step (4), the tag filtering is to filter the data according to the data attribute provided by the client through a policy rule; the strategy rule is composed of left brackets [ keyword begin, expression exp, keyword end and right brackets ]; the expression is composed of basic terms and composition terms, wherein the basic terms comprise variables var, numerical values and character strings, and the composition terms are complex expressions which are connected through unary and binary operational characters by the variables, the numerical values and the character strings.

5. The method of claim 4, wherein the tag filtering comprises the steps of:

step 4-1, extracting the user information U (k, v) and the data attribute information Ad (k, v) and assigning new attribute information Ad' (k, v) again;

step 4-2, extracting strategy rules from a strategy library;

4-3, traversing the strategy rule expression exp, and extracting a variable var in the expression;

step 4-4, extracting a value v corresponding to the var from the Ad' (k, v) by taking the var as a key;

step 4-5, replacing var in the strategy rule by v, and calculating an expression;

and 4-6, judging whether the data is filtered or not according to the calculation result, and recording the log.

6. The method according to claim 5, wherein the step (5) comprises the steps of:

step 5-1, the feature vector V' and the feature vector V of the sensitive library in the isolation service side are subjected to cosine calculation to obtain a cosine similarity value, and the cosine similarity calculation formula is as follows:

wherein V 'and V are two eigenvectors, and V' and V are standard vector dot products defined as

t is the dimension of the vector, and the norm V' in the denominator is defined as

Norm V in denominator is defined as

And 5-2, comparing the cosine similarity value with a predefined similarity threshold value, analyzing whether the obtained message carries secret-related information, and filtering the secret-related documents.