CN107967208B

Movatterモバイル変換

Info

Publication number: CN107967208B
Application number: CN201610915633.4A
Authority: CN
Inventors: 陈林; 潘陶; 陈芝菲; 李言辉; 徐宝文
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-10-20
Filing date: 2016-10-20
Publication date: 2020-01-17
Anticipated expiration: 2036-10-20
Also published as: CN107967208A

Abstract

Translated fromChinese

本发明为一种基于深度神经网络的Python资源敏感缺陷代码检测方法，包括下列步骤：1)获取同一软件的历史版本的源代码和待测版本的源代码；2)利用类型推断抽取各版本的资源敏感代码模式；3)抽取资源敏感代码模式的相关特征；4)计算缺陷代码模式和安全代码模式、缺陷代码模式和待测代码模式之间的各个特征相似度，生成特征向量，并得到训练集和测试集；5)用训练集训练深度神经网络模型进行特征合并，然后对测试集中的模式对用深度神经网络模型计算相关度并排序；6)在程序开发和维护阶段，根据相关度排序结果对可能发生错误的资源对象操作进行提醒，辅助开发和维护；本发明解决了目前存在缺乏针对Python语言资源敏感代码识别和缺陷代码检测的自动化方法等问题，进而降低软件风险，提高软件质量，从而提高开发者和维护者开发和维护软件的效率。

The present invention is a Python resource-sensitive defect code detection method based on a deep neural network, comprising the following steps: 1) obtaining the source code of the historical version of the same software and the source code of the version to be tested; 2) using type inference to extract the source code of each version Resource-sensitive code patterns; 3) Extract relevant features of resource-sensitive code patterns; 4) Calculate the similarity of each feature between the defective code pattern and the safe code pattern, the defective code pattern and the code pattern to be tested, generate feature vectors, and get training 5) Use the training set to train the deep neural network model for feature merging, and then use the deep neural network model to calculate the correlation of the pattern pairs in the test set and sort them; 6) In the program development and maintenance phase, sort according to the correlation As a result, the operation of the resource object that may be wrong is reminded, and the development and maintenance are assisted; the present invention solves the problems such as the lack of an automatic method for identifying the sensitive code of the Python language resource and detecting the defective code, thereby reducing the software risk and improving the software quality, Thereby improving the efficiency of developers and maintainers in developing and maintaining software.

Description

Translated fromChinese

一种基于深度神经网络的Python资源敏感缺陷代码检测方法A Python resource-sensitive defect code detection method based on deep neural network

技术领域technical field

本发明属于计算机技术领域，尤其是软件技术领域，且特别是有关于一种基于深度神经网络的Python资源敏感代码缺陷代码检测方法。The invention belongs to the field of computer technology, in particular to the field of software technology, and in particular relates to a Python resource-sensitive code defect code detection method based on a deep neural network.

背景技术Background technique

随着软件应用技术的不断发展，用户对软件质量的要求越来越高，软件开发者也在通过各种技术满足用户的需求。资源敏感代码是一个处理资源对象的代码块或者语句。在软件的开发和维护阶段，很多资源敏感代码都存在异常隐患，往往在维护过程中才有可能被发现。随着敏捷开发技术的不断流行，版本更替频繁，导致资源敏感代码突然引发异常的情况时时发生。对于资源敏感代码异常处理最传统的解决方法是：使用try-except关键字进行捕获和处理。然而，开发者在开发阶段，往往忽视异常处理，从而导致程序出现突发性异常，导致应用崩溃。因而对资源对象危险操作的识别与检测是程序开发和维护阶段必不可少的步骤，它可以有效地提高程序质量，帮助开发和维护人员及时发现程序问题，从而制定更有效的解决方案。With the continuous development of software application technology, users have higher and higher requirements for software quality, and software developers are also meeting users' needs through various technologies. Resource-sensitive code is a block or statement of code that manipulates a resource object. In the software development and maintenance phase, many resource-sensitive codes have abnormal hidden dangers, which are often discovered during the maintenance process. With the continuous popularity of agile development technology, the version changes frequently, resulting in resource-sensitive code suddenly throwing exceptions from time to time. The most traditional solution to exception handling for resource-sensitive code is to use the try-except keyword to catch and handle it. However, in the development stage, developers often ignore exception handling, resulting in sudden exceptions in programs and application crashes. Therefore, the identification and detection of dangerous operations of resource objects is an indispensable step in the program development and maintenance phase. It can effectively improve the program quality and help developers and maintenance personnel to find program problems in time, so as to formulate more effective solutions.

目前，Python已经成为开发者十分青睐的编程语言。现在，各大开源社区Python应用不断涌现，形成了一个庞大的生态系统。Python是一种面向对象的、解释型程序语言，具有简洁、优雅、实用的特征。作为一种动态语言，Python更多地应用于设计互联网应用、图形用户界面和脚本植入等方面，从而涉及各种类型的资源。由于Python语言的动态语言特性，开发者往往动态改变变量类型，导致不安全操作众多。另一方面，Python在对资源对象进行操作时，常常由于资源配置等原因出现各种异常，而这种资源敏感操作产生的问题不容易被发现。目前，开发者采用条件检测、异常处理等方式来控制这些代码缺陷。At present, Python has become a very popular programming language for developers. Now, Python applications in major open source communities continue to emerge, forming a huge ecosystem. Python is an object-oriented, interpreted programming language with concise, elegant, and practical features. As a dynamic language, Python is more used in designing Internet applications, graphical user interfaces, and scripting, etc., thus involving various types of resources. Due to the dynamic language characteristics of the Python language, developers often change variable types dynamically, resulting in many unsafe operations. On the other hand, when Python operates resource objects, various exceptions often occur due to resource configuration and other reasons, and the problems caused by such resource-sensitive operations are not easy to find. At present, developers use condition detection, exception handling, etc. to control these code defects.

现阶段，识别和检测资源对象的方法大致可以分为两类。一类是基于程序分析数据的方法，它可以根据逻辑和语义分析定位资源对象危险操作。与之相反，另一类是使用信息检索的方法，借助于机器学习的方式识别资源对象和检测缺陷代码。第一种方法基于语义分析，可以很快的产生结果，但是具有准确率低、语义规则难以定义等问题。而第二种方法，通过上下文等方式抽取特征，然后利用机器学习的方式进行学习和预测，虽然产生结果较慢，但具有准确率高、实用性强等特点。本发明就是采用机器学习的方式进行检测。At this stage, methods for identifying and detecting resource objects can be roughly divided into two categories. One is a program-based method of analyzing data, which can locate dangerous operations of resource objects based on logical and semantic analysis. On the contrary, the other is to use information retrieval methods to identify resource objects and detect defective codes with the help of machine learning. The first method is based on semantic analysis, which can produce results quickly, but has problems such as low accuracy and difficulty in defining semantic rules. The second method extracts features through context and other methods, and then uses machine learning to learn and predict. Although the results are slower, it has the characteristics of high accuracy and strong practicability. The present invention adopts the method of machine learning for detection.

在维护阶段，开发者每次提交可能会同时修复多处相同的缺陷，从而同一版本的缺陷代码具有很强的相关性。本发明根据历史修复信息分辨出缺陷代码和安全代码，且利用缺陷代码之间的相关性，推测同历史缺陷代码相似的代码也很有可能存在缺陷，进一步提供了一种基于深度神经网络的Python资源敏感缺陷代码检测方法。During the maintenance phase, developers may fix multiple identical defects at the same time in each submission, so the defective code of the same version has a strong correlation. The present invention distinguishes the defective code and the security code according to the historical repair information, and uses the correlation between the defective codes to speculate that the code similar to the historical defective code is also likely to have defects, and further provides a deep neural network-based Python Resource-sensitive defect code detection method.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种基于深度神经网络的Python资源敏感缺陷代码检测方法。本方法通过挖掘和比较历史版本中被修复的缺陷代码，找出待测代码中与其相似的代码，提醒开发者和维护者注意可能存在相同的问题，以便尽早修复。该方法从软件版本控制系统中收集同一个Python软件的历史版本和待测版本；对于历史版本，接着通过类型推断识别出资源敏感代码模式，并提取对应的模式特征，根据历史修复信息将上述缺陷代码模式和安全代码模式组成相关模式对和非相关模式对，并计算特征相似度生成特征向量，得到训练集；对于待测版本，使用相同的方法提取不同模式和相应特征，将历史版本缺陷代码模式和待测版本模式组成模式对，并计算特征相似度生成特征向量，得到测试集。然后，使用训练集训练深度神经网络模型，将训练好的深度神经网络模型对测试集进行特征合并，得到待测代码与缺陷代码之间的相关度。最后，根据相关度进行排序，识别出待测代码中与历史版本被修复的资源敏感代码非常相似的潜在危险代码，从而对程序开发者和维护者提出建议，防止异常的产生。本发明旨在解决目前存在缺乏针对Python语言资源敏感代码识别和缺陷代码检测的自动化方法等问题，进而降低软件风险，提高软件质量，从而提高开发者开发软件的效率。The invention provides a Python resource-sensitive defect code detection method based on a deep neural network. This method finds out the similar code in the code to be tested by mining and comparing the repaired defect codes in the historical version, and reminds the developers and maintainers to pay attention to the possible same problems, so as to fix them as soon as possible. The method collects the historical version and the version to be tested of the same Python software from the software version control system; for the historical version, the resource-sensitive code pattern is then identified through type inference, and the corresponding pattern features are extracted, and the above defects are corrected according to the historical repair information. The code pattern and the security code pattern are composed of related pattern pairs and non-related pattern pairs, and the feature similarity is calculated to generate a feature vector to obtain a training set; for the version to be tested, the same method is used to extract different patterns and corresponding features, and the defect codes of the historical version are extracted. The pattern and the pattern to be tested form a pattern pair, and the feature similarity is calculated to generate a feature vector, and a test set is obtained. Then, use the training set to train the deep neural network model, and combine the features of the trained deep neural network model with the test set to obtain the correlation between the code to be tested and the defective code. Finally, sorting according to the degree of relevancy identifies the potentially dangerous code in the code under test that is very similar to the resource-sensitive code that has been fixed in the historical version, so as to make suggestions to the program developers and maintainers to prevent the occurrence of exceptions. The invention aims to solve the problems of lack of automatic methods for identifying sensitive codes of Python language resources and detecting defective codes at present, thereby reducing software risks, improving software quality, and improving the efficiency of software development by developers.

为达成上述目的，本发明提出一种基于深度神经网络的Python资源敏感缺陷代码检测方法方法包括下列步骤：In order to achieve the above object, the present invention proposes a method for detecting a Python resource-sensitive defect code based on a deep neural network, comprising the following steps:

1)获取同一软件的历史版本的源代码和待测版本的源代码；1) Obtain the source code of the historical version of the same software and the source code of the version to be tested;

2)利用类型推断抽取各版本的资源敏感代码模式；2) Use type inference to extract the resource-sensitive code patterns of each version;

3)抽取资源敏感代码模式的相关特征；3) Extract relevant features of resource-sensitive code patterns;

4)计算缺陷代码模式和安全代码模式、缺陷代码模式和待测代码模式之间的各个特征相似度，生成特征向量，并得到训练集和测试集；4) Calculate the similarity of each feature between the defective code pattern and the safe code pattern, the defective code pattern and the code pattern to be tested, generate a feature vector, and obtain a training set and a test set;

5)用训练集训练深度神经网络模型进行特征合并，然后对测试集中的模式对用深度神经网络模型计算相关度并排序；5) Use the training set to train the deep neural network model to perform feature merging, and then use the deep neural network model to calculate the correlation and sort the pattern pairs in the test set;

6)在程序开发和维护阶段，根据相关度排序结果对可能发生错误的资源对象操作进行提醒，辅助开发和维护。6) In the stage of program development and maintenance, according to the results of relevance sorting, remind the operation of resource objects that may have errors to assist development and maintenance.

进一步，其中上述步骤1)的具体步骤如下：Further, wherein the concrete steps of above-mentioned step 1) are as follows:

步骤1)-1：起始状态；Step 1)-1: initial state;

步骤1)-2：根据文件名和版本信息，从开源版本控制系统获取同一软件的历史版本中被修复的源程序和待测版本的源程序；Step 1)-2: According to the file name and version information, obtain the source program and the source program of the version to be tested in the historical version of the same software from the open source version control system;

步骤1)-3：软件不同版本源程序采集完毕。Steps 1)-3: The source programs of different versions of the software are collected.

进一步，其中上述步骤2)的具体步骤如下：Further, wherein the concrete steps of above-mentioned step 2) are as follows:

步骤2)-1：起始状态；Step 2)-1: initial state;

步骤2)-2：分别对各版本的源程序进行词法分析和语法分析，使用Python标准库中的ast模块生成各版本对应的抽象语法树；Step 2)-2: respectively perform lexical analysis and syntax analysis on the source programs of each version, and use the ast module in the Python standard library to generate an abstract syntax tree corresponding to each version;

步骤2)-3：根据Python标准库中定义的抽象语法，封装Python的每个类型，每个类型有一个映射表table，包含该类型的内部属性名或者API接口名。Steps 2)-3: According to the abstract syntax defined in the Python standard library, each type of Python is encapsulated, and each type has a mapping table table, which contains the internal attribute name or API interface name of the type.

步骤2)-4：遍历抽象语法树，并根据封装的类型和模块，推断每个变量的可能类型。抽取出资源对象类型的变量。Steps 2)-4: Traverse the abstract syntax tree and infer the possible types of each variable based on the encapsulated type and module. Extract the variable of the resource object type.

步骤2)-5：对于未识别类型，如果该变量是一个接口名，且其参数中有资源对象类型，则识别为资源对象类型，如果不是，则如果该变量为其他变量成员；若调用变量为资源对象类型，也标识为资源对象类型。Step 2)-5: For an unrecognized type, if the variable is an interface name and there is a resource object type in its parameters, it is recognized as a resource object type, if not, if the variable is a member of other variables; if the variable is called is a resource object type, also identified as a resource object type.

步骤2)-6：将调用资源对象类型变量的代码片段作为敏感资源代码模式。Steps 2)-6: Take the code fragment that calls the resource object type variable as the sensitive resource code pattern.

步骤2)-8：资源敏感代码模式信息收集完毕。Step 2)-8: The collection of resource-sensitive code pattern information is completed.

进一步，其中上述步骤3)的具体步骤如下：Further, wherein the concrete steps of above-mentioned step 3) are as follows:

步骤3)-1：起始状态；Step 3)-1: initial state;

步骤3)-2：根据资源代码模式信息，定位资源对象操作位置，提取API(参数类型，参数顺序)、资源名、调用结构和函数内部结构等作为特征。Step 3)-2: According to the resource code mode information, locate the operation position of the resource object, and extract the API (parameter type, parameter sequence), resource name, calling structure and function internal structure as features.

步骤3)-3：将API(参数类型，数量)、资源名、调用结构和函数结构统一化命名。Step 3)-3: Unified naming of API (parameter type, quantity), resource name, calling structure and function structure.

步骤3)-4：资源代码模式特征信息抽取完毕。Step 3)-4: The extraction of resource code pattern feature information is completed.

进一步，其中上述步骤4)的具体步骤如下：Further, wherein the concrete steps of above-mentioned step 4) are as follows:

步骤4)-1：起始状态；Step 4)-1: initial state;

步骤4)-2：将识别出的资源敏感代码模式分为三类，分别为缺陷代码模式、安全代码模式和待测代码模式；Step 4)-2: The identified resource-sensitive code patterns are divided into three categories, namely, defect code patterns, security code patterns, and code patterns to be tested;

步骤4)-3：对于历史版本，根据历史修复信息将相似的缺陷代码模式两两配对，组成相关模式对；将缺陷代码模式和与其相似的安全代码模式两两配对，组成非相关模式对；Step 4)-3: For the historical version, pair the similar defect code patterns in pairs according to the historical repair information to form a related pattern pair; pair the defect code patterns with the similar security code patterns in pairs to form a non-related pattern pair;

步骤4)-4：对于待测版本，将缺陷代码模式和待测代码模式两两配对，组成待测模式对；Step 4)-4: For the version to be tested, pair the defect code pattern and the code pattern to be tested to form a pattern pair to be tested;

步骤4)-5：计算不同模式对的各个特征相似度，并生成特征向量；Step 4)-5: calculate each feature similarity of different pattern pairs, and generate a feature vector;

步骤4)-6：由历史版本的代码模式对组成的特征向量集得到训练集，由待测版本的代码模式对组成的特征向量集得到测试集；Step 4)-6: the feature vector set composed of the code pattern pairs of the historical version obtains the training set, and the feature vector set composed of the code pattern pairs of the version to be tested obtains the test set;

步骤4)-7：训练集合测试集信息收集完毕；Step 4)-7: The information of training set and test set is collected;

进一步，其中上述步骤5)的具体步骤如下：Further, wherein the concrete steps of above-mentioned step 5) are as follows:

步骤5)-1：起始状态；Step 5)-1: initial state;

步骤5)-2：用步骤4)中生成的训练集相似度数据训练深度神经网络，得到模型的各参数值；Step 5)-2: train a deep neural network with the training set similarity data generated in step 4) to obtain each parameter value of the model;

步骤5)-3：将步骤4)中生成的测试集作为输入，通过已训练好的深度神经网络模型，得到相关度值；Step 5)-3: The test set generated in step 4) is used as input, and the correlation degree value is obtained through the trained deep neural network model;

步骤5)-4：根据计算出来的相关度值，将所有代码对间的相关度从大到小进行排序，取前k个测试模式对作为资源敏感代码检测结果，把其中的待测版本代码标注为可能的资源敏感缺陷代码。Step 5)-4: According to the calculated correlation value, sort the correlation between all code pairs from large to small, take the first k test pattern pairs as the resource-sensitive code detection result, and put the code of the version to be tested in it. Code marked as a possible resource-sensitive defect.

步骤5)-5：可能的资源敏感缺陷代码标注完毕。Step 5)-5: Marking of possible resource-sensitive defect codes is completed.

进一步，其中上述步骤6)的具体步骤如下：Further, wherein the concrete steps of above-mentioned step 6) are as follows:

步骤6)-1：起始状态；Step 6)-1: initial state;

步骤6)-2：对于标注为敏感资源的代码，提示开发和维护人员与此相关的历史版本中出现的位置，建议令其进行修改，并给出一种修复方案。Step 6)-2: For the code marked as sensitive resources, prompt the developers and maintainers of the location in the historical version related to this, suggest to modify it, and provide a repair plan.

步骤6)-3：在程序开发和维护阶段，系统自动对提交代码进行检测，对于存在潜在危险资源的操作，给出警告。Step 6)-3: In the stage of program development and maintenance, the system automatically detects the submitted code, and gives warnings for operations with potentially dangerous resources.

步骤6)-4：将新提交的版本程序作为历史版本数据，用于下一次比对，使得检测结果更加精准。Step 6)-4: Use the newly submitted version program as historical version data for the next comparison to make the detection result more accurate.

步骤6)-5：待测代码中的资源敏感缺陷代码提示完毕。Step 6)-5: The resource-sensitive defect code prompting in the code to be tested is completed.

本发明基于深度神经网络进行特征合并，采用一个标准的度量值来衡量待测代码和历史版本中的缺陷代码之间的相关度水平，从而可以定位到资源敏感缺陷代码块，深入到基本语句层级。在根据类型推断识别出资源敏感代码后，根据与其相似的历史版本中解决方案，进行自动修复并提示开发者和维护者。通过上述方法，识别出了资源敏感代码及其危险操作，提高了软件开发的效率，有益于开发出高质量的软件应用产品。The present invention performs feature merging based on a deep neural network, and uses a standard metric value to measure the level of correlation between the code to be tested and the defective code in the historical version, so that the resource-sensitive defective code block can be located, and it can go deep into the basic statement level . After identifying resource-sensitive code based on type inference, it automatically fixes and alerts developers and maintainers based on similar solutions in historical versions. Through the above method, resource-sensitive codes and their dangerous operations are identified, the efficiency of software development is improved, and it is beneficial to develop high-quality software application products.

附图说明Description of drawings

图1为本发明实施例的一种基于深度神经网络的Python资源敏感缺陷代码检测方法的总体架构图。FIG. 1 is an overall architecture diagram of a Python resource-sensitive defect code detection method based on a deep neural network according to an embodiment of the present invention.

图2为本发明实施例的一种基于深度神经网络的Python资源敏感缺陷代码检测方法的流程图。FIG. 2 is a flowchart of a method for detecting a Python resource-sensitive defect code based on a deep neural network according to an embodiment of the present invention.

图3为一个循环控制结构可能的抽象语法树示意图。Figure 3 is a schematic diagram of a possible abstract syntax tree for a loop control structure.

具体实施方式Detailed ways

本发明方法首先通过CVS等软件版本控制系统，收集了同一个Python软件的所有历史版本被修复的源代码。接着对历史版本和待测版本的源代码进行词法分析和语法分析，根据生成的抽象语法树，进行类型推断，标注出资源对象操作的变量，识别出资源代码模式，并且根据历史修复信息从历史各版本的资源敏感代码模式中挑选出缺陷代码模式和安全代码模式，组成相关模式对和非相关模式对。接着，将待测版本资源敏感代码模式和历史缺陷代码模式组成测试模式对。然后，根据提取的模式特征，计算每个模式对各个特征的相似度，并生成特征向量，得到相应的训练集和测试集。然后，使用上述训练集训练深度神经网络模型，将训练好的深度神经网络模型对测试集进行特征合并，得到待测代码模式与历史缺陷代码模式之间相应的相关度。最后，根据相关度进行排序，选择前k个相关的模式对作为结果，将代码对中的待测代码标注为敏感有潜在缺陷的资源敏感代码，以此在程序开发和维护过程中辅助开发和维护人员进行开发和维护，杜绝异常发生。The method of the invention firstly collects the repaired source codes of all historical versions of the same Python software through a software version control system such as CVS. Then, perform lexical analysis and syntax analysis on the source code of the historical version and the version to be tested, perform type inference according to the generated abstract syntax tree, mark the variables operated by the resource object, identify the resource code pattern, and restore the data from the historical data according to the historical repair information. Defective code patterns and security code patterns are selected from the resource-sensitive code patterns of each version to form related pattern pairs and non-related pattern pairs. Next, the resource-sensitive code pattern of the version to be tested and the historical defect code pattern are formed into a test pattern pair. Then, according to the extracted pattern features, the similarity of each pattern to each feature is calculated, and a feature vector is generated to obtain the corresponding training set and test set. Then, use the above training set to train the deep neural network model, and combine the features of the trained deep neural network model with the test set to obtain the corresponding correlation between the code pattern to be tested and the historical defect code pattern. Finally, sort according to the degree of relevancy, select the top k related pattern pairs as the result, and mark the code under test in the code pair as sensitive and potentially defective resource-sensitive code, so as to assist the development and maintenance of the program in the process of development and maintenance. Maintenance personnel conduct development and maintenance to prevent exceptions from occurring.

为了更好地说明本发明的技术内容，特配合所附图示作如下说明。In order to better illustrate the technical content of the present invention, the following description is given with the accompanying drawings.

本发明的总体架构图如图1所示，流程图如图2所示。本发明提出的一种基于深度神经网络的Python资源敏感缺陷代码检测方法，包括下列6个步骤：The overall architecture diagram of the present invention is shown in FIG. 1 , and the flowchart is shown in FIG. 2 . A deep neural network-based Python resource-sensitive defect code detection method proposed by the present invention includes the following 6 steps:

步骤1：获取同一软件历史版本被修复的源代码和待测版本程序的源代码。CVS等软件版本控制系统中保存了该程序的所有版本，并标注了版本号。可以根据制定的版本号，获取同一Python软件的历史版本和待测版本源代码。Step 1: Obtain the repaired source code of the same historical version of the software and the source code of the program to be tested. All versions of the program are stored in software version control systems such as CVS, and the version numbers are marked. According to the specified version number, the source code of the historical version and the version to be tested of the same Python software can be obtained.

步骤2：使用类型推断方式抽取各版本程序源代码的资源代码模式。首先，先对步骤1中获取的各版本的源代码进行词法分析和语法分析，利用Python标准库中的ast模块相应函数生成抽象语法树。在抽象语法树中，每一个树中节点和子树都对应一个源代码实体。为了更好地进行类型推断，我们根据Python定义的类型，封装了若干个抽象类型Types。每个Types中有一个table属性，表示与当前类型属性或者调用相关的抽象语法树中的名称，如append；对于抽象语法树中的每个节点，我们设置了type和value，同时设置了节点的唯一标识符id。对于每个树中节点，t(x)表示节点的type，即节点的类型，如赋值语句。v(x)表示节点的value，是该节点的文本表示，如该赋值语句的具体内容。Id(x)表示节点的唯一标识符，用来区分节点。Step 2: Use type inference to extract the resource code patterns of the program source codes of each version. First, perform lexical analysis and syntax analysis on the source code of each version obtained in step 1, and use the corresponding functions of the ast module in the Python standard library to generate an abstract syntax tree. In the abstract syntax tree, each tree node and subtree corresponds to a source code entity. For better type inference, we encapsulate several abstract types Types according to the types defined by Python. There is a table attribute in each Types, which represents the name in the abstract syntax tree related to the current type attribute or call, such as append; for each node in the abstract syntax tree, we set the type and value, and set the node's Unique identifier id. For each node in the tree, t(x) represents the type of the node, that is, the type of the node, such as an assignment statement. v(x) represents the value of the node, which is the textual representation of the node, such as the specific content of the assignment statement. Id(x) represents the unique identifier of the node, which is used to distinguish the node.

例如：赋值语句是一个简单的语句，对应于抽象语法树中的一个叶子节点，该叶子节点的type为“assign statement”，value为赋值语句的内容；While循环语句对应于抽象语法树中的一棵子树，该子树的根节点的type为“while statement”，value为while语句的判断条件，孩子节点为while内部语句内容和跳出循环的语句内容。图3为一个循环语句结构可能的抽象语法树。For example, an assignment statement is a simple statement, corresponding to a leaf node in the abstract syntax tree, the type of the leaf node is "assign statement", and the value is the content of the assignment statement; the while loop statement corresponds to a leaf node in the abstract syntax tree. A subtree, the type of the root node of the subtree is "while statement", the value is the judgment condition of the while statement, and the child node is the content of the while statement and the statement that jumps out of the loop. Figure 3 shows a possible abstract syntax tree for a loop statement structure.

最后，后序遍历整个抽象语法树，根据抽象语法树的类型信息和节点中table映射的每个类型相关的属性等信息，推断变量的类型，将推断出的调用资源对象变量的代码片段标注为资源敏感代码模式。资源敏感代码模式是指对资源对象(文件对象、图形用户接口对象等)进行操作的代码片段。Finally, the entire abstract syntax tree is traversed in post-order, and the type of the variable is inferred according to the type information of the abstract syntax tree and the attributes related to each type of the table mapping in the node, and the inferred code fragment for calling the resource object variable is marked as Resource-sensitive code patterns. Resource-sensitive code patterns refer to code fragments that operate on resource objects (file objects, graphical user interface objects, etc.).

例如：E.g:

该代码片段中，self为一个资源对象，调用了switch_backends函数对其进行了操作。因此，这里是一个资源敏感代码模式。In this code snippet, self is a resource object, and the switch_backends function is called to operate on it. So here is a resource-sensitive code pattern.

步骤3：通过步骤2，我们已从源代码中抽取了资源代码模式。本发明抽取的资源敏感代码模式的相关特征为：API(参数类型，参数顺序)、资源名、调用结构和函数结构。Step 3: With Step 2, we have extracted the resource code schema from the source code. The relevant features of the resource-sensitive code mode extracted by the present invention are: API (parameter type, parameter sequence), resource name, calling structure and function structure.

然后，将抽取的特征命名规范化。其中，对于API特征，使用参数类型和参数顺序计算特征相似度；对于资源名特征，使用资源名中的词序列计算特征相似度；对于调用结构特征，使用调用结构相似度作为特征相似度；对于函数结构特征，使用函数结构特征作为特征相似度。Then, the extracted feature names are normalized. Among them, for API features, the parameter type and parameter order are used to calculate the feature similarity; for the resource name feature, the word sequence in the resource name is used to calculate the feature similarity; for the call structure feature, the call structure similarity is used as the feature similarity; for Functional structural features, using functional structural features as feature similarity.

步骤4：首先，对于历史版本，根据历史修复信息将相似的缺陷代码模式两两配对，组成相关模式对；将缺陷代码模式和与其相似的安全代码模式两两配对，组成非相关模式对。对于待测版本，将缺陷代码模式和待测代码模式两两配对，组成待测模式对。通过步骤3，我们可以抽取模式的特征信息，计算不同模式对的各个特征相似度。Step 4: First, for the historical version, pair the similar defect code patterns in pairs according to the historical repair information to form a related pattern pair; pair the defect code patterns with the similar security code patterns in pairs to form an unrelated pattern pair. For the version to be tested, pair the defect code pattern and the code pattern to be tested to form a pattern pair to be tested. Through step 3, we can extract the feature information of the pattern and calculate the similarity of each feature of different pattern pairs.

API的特征相似度采用rVSM算法，其中对于参数类型，采用TF-IDF的算法计算权重，公式如下所示：The feature similarity of the API adopts the rVSM algorithm. For the parameter type, the TF-IDF algorithm is used to calculate the weight. The formula is as follows:

其中TF为该类型出现在API中的频数，Total_api为API总数，Contain_type为包含该类型的API数量。本发明采用上述方法作为API组成的特征向量的权重，同时，对于类型顺序采用2-Grams进行度量，该方法对于类型顺序的改变具有鲁棒性。将类型顺序和参数类型的度量组成一个特征向量。对于两个版本生成的特征向量，采用rVSM算法计算相似度。在该方法中，历史版本特征向量a和待测版本特征向量b之间的余弦距离来表示相似度，公式如下：Where TF is the frequency of the type appearing in the API, Total_api is the total number of APIs, and Contain_type is the number of APIs that contain this type. In the present invention, the above method is used as the weight of the feature vector composed of the API, and at the same time, 2-Grams is used to measure the type order, and the method is robust to the change of the type order. Form a measure of type order and parameter types into a eigenvector. For the feature vectors generated by the two versions, the rVSM algorithm is used to calculate the similarity. In this method, the cosine distance between the feature vector a of the historical version and the feature vector b of the version to be tested is used to represent the similarity, and the formula is as follows:

其中，和

分别表示历史版本特征向量a和待测版本特征向量b，

表示两个特征向量的内积。in, and

respectively represent the feature vector a of the historical version and the feature vector b of the version to be tested,

Represents the inner product of two eigenvectors.

资源名特征相似度采用文本相似度算法。首先，将资源名解析成由一序列词组合而成的形式。接下来，对于历史版本中的资源名R₁和待测版本中的资源名R₂，计算公式如下：The feature similarity of resource name adopts text similarity algorithm. First, the resource name is parsed into a form composed of a sequence of words. Next, for the resource name R₁ in the historical version and the resource name R₂ in the version to be tested, the calculation formula is as follows:

其中，lcs(R₁，R₂)表示R₁中所有的子词在R₂中的出现的个数，从而可以得到资源名的量化值，生成相关的向量。例如“length”和“getLength”，其

而“getLength”“getlength”，其

Wherein, lcs(R₁ , R₂ ) represents the number of occurrences of all subwords in R₁ in R₂ , so that the quantized value of the resource name can be obtained, and a related vector can be generated. For example "length" and "getLength", which

And "getLength""getlength", which

对于函数结构特征相似度和调用结构特征相似度，根据步骤2中得到的抽象语法树，遍历树结构，由树节点相同的个数、计算概率得到相应的相似度，即为所求。最后，由历史版本的代码模式对组成的特征向量集得到训练集，由待测版本的代码模式对组成的特征向量集得到测试集。For the function structure feature similarity and the calling structure feature similarity, according to the abstract syntax tree obtained in step 2, the tree structure is traversed, and the corresponding similarity is obtained from the same number of tree nodes and calculation probability, which is what is required. Finally, the training set is obtained from the feature vector set composed of the code pattern pairs of the historical version, and the test set is obtained from the feature vector set composed of the code pattern pairs of the version to be tested.

步骤5：通过步骤4，我们可以得到由特征向量组成的训练集和测试集。由于其无法整体表示是否与某个危险资源对象操作相关，因而，这里我们采用深度神经网络的算法来实现特征合并，并计算相关度。Step 5: Through step 4, we can get the training set and test set consisting of feature vectors. Since it cannot represent whether it is related to the operation of a certain dangerous resource object as a whole, here we use the algorithm of deep neural network to realize feature merging and calculate the correlation.

首先，使用生成的训练集合训练深度神经网络。本发明设计的神经网络分为三层，分别为输入层，隐藏层-1，隐藏层-2和输出层。其中隐藏层-1是输入层节点个数的两倍，隐藏层-2节点是输入层节点个数的一半。隐藏层-1每个节点H1_i的计算公式如下：First, train a deep neural network using the generated training set. The neural network designed by the present invention is divided into three layers, which are input layer, hidden layer-1, hidden layer-2 and output layer respectively. The hidden layer-1 is twice the number of nodes in the input layer, and the hidden layer-2 node is half the number of nodes in the input layer. The calculation formula of each node H1_i of hidden layer-1 is as follows:

其中w_1i，b为需要训练的参数，Input_i为输入节点值。同理隐藏层-2通过该公式由隐藏层-1推导得出。对于w和b的训练，本发明采用批量梯度下降的方法，步骤如下：Where w_1i , b are the parameters to be trained, and Input_i is the input node value. Similarly, hidden layer-2 is derived from hidden layer-1 through this formula. For the training of w and b, the present invention adopts the method of batch gradient descent, and the steps are as follows:

1)初始化：Δw^(l)＝0，Δb^(l)＝0，w和b则随机初始化为比较小的数值；1) Initialization: Δw^(l) = 0, Δb^(l) = 0, w and b are randomly initialized to relatively small values;

2)假设迭代次数为m，对于i从1到m，使用BP算法计算出梯度并进行累加：2) Assuming that the number of iterations is m, for i from 1 to m, use the BP algorithm to calculate the gradient and accumulate:

其中，

in,

3)更新参数：3) Update parameters:

其中，λ为可选参数，在本发明中取2。通过上述训练方法，训练出深度神经网络模型。Among them, λ is an optional parameter, which is taken as 2 in the present invention. Through the above training method, a deep neural network model is trained.

在检测阶段，将测试集合中每个模式对的特征向量作为输入，通过上述节点公式进行计算。最后的输出为一个相关度值，表示该模式对的相关程度。这里使用非线性的神经网络方法要比使用线性的信息检索方法，效果显著，且可以更好地反应相关度水平。In the detection phase, the feature vector of each pattern pair in the test set is used as input, and it is calculated by the above node formula. The final output is a correlation value, indicating the correlation degree of the pattern pair. The nonlinear neural network method used here is more effective than the linear information retrieval method, and can better reflect the level of relevance.

在深度神经网络中，中间层与输入层的每个链接的权值通过历史版本的数据训练得到，相应的权值亦然。同时，通过大量的训练，改变神经元中间的部分链接和权值，从而优化输出结果。In a deep neural network, the weights of each link between the intermediate layer and the input layer are trained on historical versions of the data, as are the corresponding weights. At the same time, through a large amount of training, some links and weights in the middle of neurons are changed to optimize the output results.

对于得到的相关度值，我们按照从大到小进行排序，并选取前k个模式对作为输出结果。For the obtained correlation value, we sort from large to small, and select the top k pattern pairs as the output result.

步骤6：根据得到的相关度很大的待测敏感代码，提醒开发和维护人员出现的位置和与此相关的历史资源操作，并给出之前对此资源的异常处理方案，并发出警告。对于已经检测过的Python源代码作为历史版本数据用于下一次检测，以此提高检测准确率。对于刚提交的Python源代码，自动进行检测，并根据结果向开发和维护人员发出告警。Step 6: According to the obtained sensitive code to be tested with a high degree of correlation, remind developers and maintenance personnel of the location of the occurrence and related historical resource operations, and give the previous exception handling plan for this resource, and issue a warning. The detected Python source code is used as the historical version data for the next detection, so as to improve the detection accuracy. For the Python source code just submitted, it is automatically detected, and an alert is issued to developers and maintainers based on the results.

例如：在历史版本中，某一处的对资源对象的操作如下所示：For example: in the historical version, the operation of the resource object somewhere is as follows:

在该历史版本中，self变量是一个资源对象，这里是对该对象进行读入操作。开发者为了防止出现异常，在语句外围加上了try_catch异常处理。In this historical version, the self variable is a resource object, which is read in here. In order to prevent exceptions, developers add try_catch exception handling around the statement.

而待测版本的源代码中出现如下所示的语句：The following statement appears in the source code of the version to be tested:

def read_bytes(self，num_bytes，callback＝None，streaming_callback＝None，def read_bytes(self, num_bytes, callback=None, streaming_callback=None,

partial＝False)： partial=False):

self._try_inline_read() self._try_inline_read()

这里也是对资源对象进行了读操作，并且使用相同的API，但是却未进行异常处理。将上述两个代码组合成代码对，通过本发明的方法，可以识别和检测出两者是否相关，从而确定待测代码是否为敏感资源代码，并提醒开发者和维护者进行处理，同时给出相关的历史版本代码信息。Here, the resource object is also read, and the same API is used, but no exception handling is performed. Combining the above two codes into a code pair, through the method of the present invention, it is possible to identify and detect whether the two are related, thereby determining whether the code to be tested is a sensitive resource code, and reminding developers and maintainers to deal with it, while giving Relevant historical version code information.

综上所述，本发明提供了一种基于深度神经网络的Python资源敏感缺陷代码检测方法，解决了目前缺乏针对Python语言资源敏感代码检测和危险操作识别的自动化方法等问题，提高了软件应用质量，保证了软件演化过程中的可控性。In summary, the present invention provides a Python resource-sensitive defect code detection method based on a deep neural network, which solves the problems of the current lack of automatic methods for Python language resource-sensitive code detection and dangerous operation identification, and improves software application quality. , which ensures the controllability in the software evolution process.

Claims

Translated fromChinese

1.一种基于深度神经网络的Python资源敏感缺陷代码检测方法，其特征在于，从软件版本控制系统，收集同一个Python软件的历史版本和待测版本；对于历史版本，接着通过类型推断识别出资源敏感代码模式，并提取对应的模式特征，根据历史修复信息将上述缺陷代码模式和安全代码模式组成相关模式对和非相关模式对，并计算特征相似度生成特征向量，得到训练集；对于待测版本，使用相同的方法提取不同模式和相应特征，将历史版本缺陷代码模式和待测版本模式组成模式对，并计算特征相似度生成特征向量，得到测试集；然后，使用训练集训练深度神经网络模型，将训练好的深度神经网络模型对测试集进行特征合并，得到待测代码与缺陷代码之间的相关度；最后，根据相关度进行排序，选择前k个相关的代码对作为结果，将代码对中的待测代码标注为有潜在缺陷的资源敏感代码，检测出危险资源对象操作，并提供辅助信息；该方法包括下列步骤：1. a Python resource-sensitive defect code detection method based on a deep neural network, is characterized in that, from the software version control system, collect the historical version and the version to be tested of the same Python software; For the historical version, then identify by type inference. Resource-sensitive code patterns, extract the corresponding pattern features, combine the above-mentioned defect code patterns and security code patterns into correlated pattern pairs and non-correlated pattern pairs according to the historical repair information, and calculate the feature similarity to generate feature vectors to obtain a training set; Test version, use the same method to extract different patterns and corresponding features, form a pattern pair with the defect code pattern of the historical version and the version to be tested, and calculate the feature similarity to generate a feature vector to get the test set; then, use the training set to train the deep neural network The network model combines the features of the trained deep neural network model on the test set to obtain the correlation between the code to be tested and the defective code; finally, sorts according to the correlation, and selects the top k related code pairs as the result, The code to be tested in the code pair is marked as potentially defective resource-sensitive code, the operation of dangerous resource objects is detected, and auxiliary information is provided; the method includes the following steps:

1)获取同一软件的历史版本的源代码和待测版本的源代码；软件版本控制系统中保存了软件的所有版本提交，并标准化了版本号；可以根据制定的版本号，获取同一Python软件的历史版本和待测版本源代码；1) Obtain the source code of the historical version of the same software and the source code of the version to be tested; all version submissions of the software are saved in the software version control system, and the version number is standardized; the version number of the same Python software can be obtained according to the established version number. The source code of the historical version and the version to be tested;

2)利用类型推断抽取各版本的资源敏感代码模式；对步骤1中已经收集好的历史版本和待测版本源代码进行词法和语法分析，利用Python标准库中的ast模块生成对应的抽象语法树，将Python类型进行抽象化，并对每个节点设置type和value，再使用全局类型推断的方法，抽取资源敏感代码模式；2) Use type inference to extract the resource-sensitive code patterns of each version; perform lexical and syntactic analysis on the source code of the historical version and the version to be tested that have been collected in step 1, and use the ast module in the Python standard library to generate the corresponding abstract syntax tree , abstract the Python type, set the type and value for each node, and then use the global type inference method to extract resource-sensitive code patterns;

资源敏感代码模式，资源敏感代码模式是指对资源对象进行操作的代码片段；Resource-sensitive code mode, resource-sensitive code mode refers to code fragments that operate on resource objects;

定义1：Python标准库随Python语言一起发行，包含了提供各种系统级功能的内建模块；Definition 1: The Python standard library is distributed with the Python language and contains built-in modules that provide various system-level functions;

定义2：类型推断是一种通过对源代码进行静态分析，推断动态语言中变量类型的方法；Definition 2: Type inference is a method of inferring variable types in dynamic languages through static analysis of source code;

定义3：type用于标识抽象语法树中的节点类型信息，type的具体取值来自Python定义的抽象语法；Definition 3: type is used to identify the node type information in the abstract syntax tree, and the specific value of type comes from the abstract syntax defined by Python;

定义4：value是抽象语法树中的节点内容的文本表示；Definition 4: value is the textual representation of the node content in the abstract syntax tree;

定义1：对于API特征，使用参数类型和参数顺序计算特征相似度；Definition 1: For API features, use parameter type and parameter order to calculate feature similarity;

定义2：对于资源名特征，使用资源名中的词序列计算特征相似度；Definition 2: For the resource name feature, use the word sequence in the resource name to calculate the feature similarity;

定义3：对于调用结构特征，使用调用结构相似度作为特征相似度；Definition 3: For the call structure feature, the call structure similarity is used as the feature similarity;

定义4：对于函数结构特征，使用函数结构特征作为特征相似度；Definition 4: For functional structural features, use functional structural features as feature similarity;

4)计算缺陷代码模式和安全代码模式、缺陷代码模式和待测代码模式之间的各个特征相似度，生成特征向量，并得到训练集和测试集；对于历史版本，根据历史修复信息将相似的缺陷代码模式两两配对，组成相关模式对；将缺陷代码模式和与其相似的安全代码模式两两配对，组成非相关模式对；对于待测版本，将缺陷代码模式和待测代码模式两两配对，组成待测模式对；然后，根据步骤3抽取的各个特征信息，计算不同模式对的各个特征相似度，并生成特征向量；最后，由历史版本的代码模式对组成的特征向量集得到训练集，由待测版本的代码模式对组成的特征向量集得到测试集；4) Calculate the similarity of each feature between the defective code pattern and the safe code pattern, the defective code pattern and the code pattern to be tested, generate a feature vector, and obtain a training set and a test set; for the historical version, according to the historical repair information, similar The defect code patterns are paired in pairs to form a related pattern pair; the defect code pattern and the similar security code pattern are paired in pairs to form an unrelated pattern pair; for the version to be tested, the defect code pattern and the code pattern to be tested are paired in pairs , form pairs of patterns to be tested; then, according to each feature information extracted in step 3, calculate the similarity of each feature of different pattern pairs, and generate feature vectors; finally, the feature vector set composed of the code pattern pairs of the historical version obtains the training set , the test set is obtained from the feature vector set composed of the code pattern pairs of the version to be tested;

定义1：缺陷代码模式是指历史修复信息中随后被修复的资源敏感缺陷代码模式；Definition 1: The defect code pattern refers to the resource-sensitive defect code pattern that is subsequently repaired in the historical repair information;

定义2：安全代码模式是指与缺陷代码模式相似但是没有发现缺陷的资源敏感代码模式；Definition 2: A secure code pattern refers to a resource-sensitive code pattern that is similar to a flawed code pattern but has no flaws found;

定义3：API的特征相似度采用VSM算法，其中对于参数类型，采用TF-IDF的算法计算权重，公式如下所示：Definition 3: The feature similarity of the API adopts the VSM algorithm. For the parameter type, the TF-IDF algorithm is used to calculate the weight. The formula is as follows:

其中TF为该类型出现在API中的频数，Total_api为API总数，Contain_type为包含该类型的API数量；本方法采用上述方法作为API组成的特征向量的权重，同时，对于类型顺序采用2-Grams进行度量，该方法对于类型顺序的改变具有鲁棒性，将类型顺序和参数类型的度量组成一个特征向量；对于两个版本的生成的特征向量，采用VSM算法计算相似度；在该方法中，历史版本特征向量a和待测版本特征向量b之间的余弦距离来表示相似度，公式如下：Among them, TF is the frequency of the type appearing in the API, Total_api is the total number of APIs, and Contain_type is the number of APIs that contain this type. Grams measures, this method is robust to the change of type order, and the measure of type order and parameter type is composed of a feature vector; for the generated feature vectors of the two versions, the VSM algorithm is used to calculate the similarity; in this method , the cosine distance between the feature vector a of the historical version and the feature vector b of the version to be tested represents the similarity. The formula is as follows:

其中，

和

分别表示历史版本特征向量a和待测版本特征向量b，

表示两个特征向量的内积；in,

and

Represents the inner product of two eigenvectors;

定义4：资源名的特征相似度采用文本相似度算法；首先，将资源名解析成由一序列词组合而成的形式；接下来，对于历史版本中的资源名和待测版本中的资源名，计算公式如下：Definition 4: The feature similarity of the resource name adopts the text similarity algorithm; first, the resource name is parsed into a form composed of a sequence of words; next, for the resource name in the historical version and the resource name in the version to be tested, Calculated as follows:

其中，lcs(R₁ R₂)表示R₁中所有的子词在R₂中的出现的个数，从而可以得到资源名的量化值，生成相关的向量；Among them, lcs(R₁ R₂ ) represents the number of occurrences of all subwords in R₁ in R₂ , so that the quantized value of the resource name can be obtained, and the relevant vector can be generated;

定义5：VSM算法为空间向量模型，是计算相似度的一种算法；Definition 5: VSM algorithm is a space vector model, which is an algorithm for calculating similarity;

5)用训练集训练深度神经网络模型进行特征合并，然后对测试集中的模式对用深度神经网络模型计算相关度并排序；使用步骤2)生成的训练集合训练深度神经网络模型，然后将步骤2)生成的测试集合使用训练好的深度神经网络模型进行特征合并，并计算相关度；最后将缺陷代码模式和待测代码模式间的相关度值按照从大到小进行排序，并选取k个代码对作为输出结果；5) Use the training set to train the deep neural network model for feature merging, and then use the deep neural network model to calculate the correlation and sort the pattern pairs in the test set; use the training set generated in step 2) to train the deep neural network model, then step 2. ) The generated test set uses the trained deep neural network model for feature merging, and calculates the correlation; finally, sort the correlation value between the defect code pattern and the code pattern to be tested in descending order, and select k codes pair as the output result;

6)在程序开发和维护阶段，根据相关度排序结果对可能发生错误的资源对象操作进行提醒，辅助开发和维护；根据得到的相关度很大的待测资源敏感代码，提醒开发和维护人员出现的位置和与此相关的历史资源操作，并给出之前对此资源的异常处理方案，并发出警告；对于已经检测过的Python源代码作为历史版本数据用于下一次检测，以此提高检测准确率；对于刚提交的Python源代码，自动进行检测，并根据结果向开发和维护人员发出告警。6) In the program development and maintenance stage, according to the correlation ranking results, remind the operation of resource objects that may have errors to assist development and maintenance; according to the obtained sensitive code of the resource to be tested with a high degree of correlation, remind the development and maintenance personnel to appear The location and related historical resource operations, and give the previous exception handling plan for this resource, and issue a warning; the Python source code that has been detected is used as historical version data for the next detection, so as to improve the detection accuracy rate; for the Python source code that has just been submitted, it is automatically detected, and an alert is issued to developers and maintainers based on the results.