技术领域technical field
本发明属于计算机编程技术领域,具体涉及一种基于代码替换和正则表达式的静态分析工具改进方法。The invention belongs to the technical field of computer programming, and in particular relates to a static analysis tool improvement method based on code replacement and regular expressions.
背景技术Background technique
当前的静态分析工具的处理流程可以分为四个部分:预处理模块、语法分析模块、缺陷模式匹配模块、缺陷模式处理模块。待检测的工程文件或文件夹中的源代码首先经过预处理器的处理得到便于检测的中间代码。中间代码经过语法分析模块的处理,生成抽象语法树。然后缺陷模式匹配模块将已知的缺陷模式和抽象语法树进行匹配,检测其中的缺陷。检测到的缺陷通过缺陷定位和输出模块形成格式化的输出,根据用户的要求生成运行结果。The processing flow of the current static analysis tool can be divided into four parts: a preprocessing module, a syntax analysis module, a defect pattern matching module, and a defect pattern processing module. The source code in the project file or folder to be detected is first processed by a preprocessor to obtain an intermediate code that is easy to detect. The intermediate code is processed by the syntax analysis module to generate an abstract syntax tree. Then the defect pattern matching module matches the known defect patterns with the abstract syntax tree to detect the defects. The detected defects form a formatted output through the defect location and output module, and generate operation results according to user requirements.
其中各个模块的工作过程具体如下:The working process of each module is as follows:
预处理模块对源代码的处理主要包括三个部分:遍历各级目录获取待检测的包含源代码的文件列表,组合头文件和源代码文件,对宏定义进行预处理。其中第一个步骤是可选的,只有在输入是文件夹时才会启用。The processing of the source code by the preprocessing module mainly includes three parts: traversing directories at all levels to obtain a list of files containing source code to be detected, combining header files and source code files, and preprocessing macro definitions. The first of these steps is optional and only enabled if the input is a folder.
语法分析模块的主要功能是对经过预处理器处理过之后的代码进行语法分析,语法分析模块的主要工作包括词法分析、语法分析和语义分析三大部分。经过处理后的中间代码首先经过词法分析切割成一个个被称为“token”的最小化有意义的单位,然后通过语法分析建立起不同“token”之间的关系,建立起一个双向的“token”链表,最终针对“token”的双向链表进行语义分析,构造抽象语法树。The main function of the grammatical analysis module is to perform grammatical analysis on the code processed by the preprocessor. The main work of the grammatical analysis module includes three parts: lexical analysis, grammatical analysis and semantic analysis. The processed intermediate code is first cut into minimal meaningful units called "tokens" through lexical analysis, and then the relationship between different "tokens" is established through grammatical analysis, and a two-way "token" is established. "linked list, and finally perform semantic analysis on the doubly linked list of "token" to construct an abstract syntax tree.
缺陷模式匹配模块的主要功能是对经过语法分析模块处理之后生成的“token”双向链表进行分析,通过将在工业生产和软件开发中总结出来的多种缺陷模式与“token”双向列表进行对比,找到其中匹配的部分,并且将结果交给缺陷处理模块进行处理。The main function of the defect pattern matching module is to analyze the "token" doubly linked list generated after processing by the syntax analysis module. By comparing various defect patterns summarized in industrial production and software development with the "token" doubly linked list, Find the matching part, and send the result to the defect processing module for processing.
缺陷处理模块的主要功能,是针对缺陷模式匹配模块匹配到的缺陷模式以及“token”信息进行处理,生成用户能够直观看出或者用户自定义格式的缺陷报告。该模块的主要工作包括两点,其中一点是根据缺陷模式匹配模块匹配到的缺陷模式以及“token”信息提取相关的缺陷类型、缺陷位置和缺陷相关变量等。另一点则是需要能够对上一步骤提取的缺陷类型、缺陷位置和缺陷相关变量进行处理,按照直观或者用户指定的格式来进行缺陷报告的生成,供用户对缺陷模式检测的结果进行分析。The main function of the defect processing module is to process the defect pattern and "token" information matched by the defect pattern matching module, and generate a defect report that the user can see intuitively or in a user-defined format. The main work of this module includes two points, one of which is to extract the relevant defect type, defect location and defect-related variables according to the defect pattern matched by the defect pattern matching module and the "token" information. Another point is to be able to process the defect type, defect location and defect-related variables extracted in the previous step, and generate defect reports in an intuitive or user-specified format for users to analyze the results of defect pattern detection.
现有的静态分析工具在检测涉及自增自减操作的整形溢出问题上存在不足,漏报率较高,导致一些严重的缺陷不能及时发现,影响系统的稳定性。除此之外,当前的静态分析工具没有考虑到可能错误的键入这类问题,这类问题在日常编程中非常常见且无法被编译器检测出来,也不容易人工检查,所以由静态分析工具来实现这类问题的检查是非常有必要的。Existing static analysis tools have deficiencies in detecting plastic overflow problems involving auto-increment and auto-decrement operations, and the false positive rate is high, resulting in some serious defects not being discovered in time, which affects the stability of the system. In addition, the current static analysis tools do not take into account such problems as possible wrong typing. This kind of problem is very common in daily programming and cannot be detected by the compiler, and it is not easy to check manually, so it is up to the static analysis tool. It is very necessary to realize the inspection of such problems.
发明内容Contents of the invention
有鉴于此,本发明提供了一种基于代码替换和正则表达式的静态分析工具改进方法,本发明的目的是利用表达式替换来提高整形溢出问题的准确度,降低漏报率,同时使用正则表达式来匹配可能错误的键入的代码,降低因为疏忽写错代码带来的影响,减少人工检查的成本。In view of this, the present invention provides a method for improving static analysis tools based on code replacement and regular expressions. Expressions to match possible incorrectly typed codes, reducing the impact of inadvertently writing wrong codes, and reducing the cost of manual inspections.
为了达到上述目的,本发明的技术方案为:静态分析工具包括预处理模块、语法分析以及缺陷模式匹配模块,预处理模块用于对源代码进行预处理,生成中间代码;语法分析模块用于对中间代码进行语法分析最终获得双向token链表;缺陷模式匹配模块用于将双向token链表与缺陷模式进行对比,找到其中匹配的部分,处理获得静态分析结果;其特征在于,在预处理模块中对源代码进行预处理时,对源代码中出现的自增操作i++、++i与自减操作i--、--i进行如下替换:将i++替换为(i=i+1)-1,将++i替换为i=i+1,将i--换为(i=i-1)+1,将--i替换为i=i-1;其中i为自增操作或者自减操作的变量。In order to achieve the above object, the technical solution of the present invention is: the static analysis tool includes a preprocessing module, a grammatical analysis and a defect pattern matching module, the preprocessing module is used to preprocess the source code, and generates intermediate code; the grammatical analysis module is used for The intermediate code is grammatically analyzed to finally obtain a two-way token linked list; the defect pattern matching module is used to compare the two-way token linked list with the defect pattern, find the matching part, process and obtain the static analysis result; it is characterized in that, in the preprocessing module, the source When the code is preprocessed, the self-increment operations i++, ++i and self-decrement operations i--, --i appearing in the source code are replaced as follows: replace i++ with (i=i+1)-1, replace ++i is replaced by i=i+1, i-- is replaced by (i=i-1)+1, and --i is replaced by i=i-1; where i is the self-increment or self-decrement operation variable.
在缺陷模式匹配模块中增加如下正则表达式:if(%var%=%num%);if(%any%&%any%);scanf(%str%,%var%);Add the following regular expressions in the defect pattern matching module: if (% var % = % num %); if (% any % & % any %); scanf (% str %, % var %);
其中%var%、%num%、%any%、%any%、%str%以及%var%均为变量;if为源代码中的if语句,scanf为源代码中的scanf语句。Among them, %var%, %num%, %any%, %any%, %str% and %var% are variables; if is an if statement in the source code, and scanf is a scanf statement in the source code.
进一步地,预处理模块接收外部输入的包含源代码的文件,首先将文件中的头文件和源代码进行组合,然后对源代码中出现的自增操作i++、++i与自减操作i--、--i进行替换之后,再对宏定义进行预处理,最终获得中间代码。Further, the preprocessing module receives an externally input file containing source code, first combines the header file and source code in the file, and then performs the self-increment operation i++, ++i and self-decrement operation i- After -, --i are replaced, the macro definition is preprocessed, and finally the intermediate code is obtained.
进一步地,语法分析模块包括如下步骤:对中间代码首先进行词法分析,是将由字符串序列组成的源代码划分创建一个用于存储token的双向token链表,然后逐字节地读入源代码中的每个字符,并将其划分为一个个完整的具有明确含义的最小化的单位token,然后将按照token在源代码中出现的顺序将所有的token添加到双向token链表中。Further, the grammatical analysis module includes the following steps: lexical analysis is first performed on the intermediate code, which is to divide the source code composed of character string sequences to create a two-way token linked list for storing tokens, and then read them into the source code byte by byte Each character is divided into a complete minimum unit token with clear meaning, and then all tokens will be added to the two-way token linked list in the order in which the tokens appear in the source code.
进一步地,缺陷模式包括在工业生产和软件开发中总结出来的多种缺陷模式。Further, the defect mode includes various defect modes summarized in industrial production and software development.
有益效果:Beneficial effect:
本发明的目的是利用表达式替换来提高整形溢出问题的准确度,降低漏报率,同时使用正则表达式来匹配可能错误的键入的代码,降低因为疏忽写错代码带来的影响,减少人工检查的成本。The purpose of the present invention is to use expression replacement to improve the accuracy of the plastic overflow problem, reduce the rate of false positives, and use regular expressions to match possible wrongly typed codes, reduce the impact of negligently written wrong codes, and reduce manual labor. The cost of the inspection.
具体实施方式Detailed ways
下面举实施例,对本发明进行详细描述。Examples are given below to describe the present invention in detail.
步骤(1)、预处理Step (1), preprocessing
预处理模块的主要功能是针对用户指定的工程文件或工程文件夹中的源代码进行预处理,生成便于进行语法分析的中间代码。The main function of the preprocessing module is to preprocess the source code in the project file or project folder specified by the user, and generate intermediate code for syntax analysis.
预处理模块对源代码的处理主要包括四个部分:遍历各级目录获取待检测的包含源代码的文件列表,组合头文件和源代码文件,对自增自减操作进行替换,对宏定义进行预处理。其中第一个步骤是可选的,只有在输入是文件夹时才会启用。第三个步骤的替换方案如下:The processing of the source code by the preprocessing module mainly includes four parts: traversing directories at all levels to obtain a list of files containing source code to be detected, combining header files and source code files, replacing auto-increment and auto-decrement operations, and performing macro definition preprocessing. The first of these steps is optional and only enabled if the input is a folder. An alternative to the third step is as follows:
i++=>((i=i+1)-1)i++=>((i=i+1)-1)
++i=>(i=i+1)++i=>(i=i+1)
i--=>((i=i-1)+1)i--=>((i=i-1)+1)
--i=>(i=i-1)--i=>(i=i-1)
以i++替换为((i=i+1)-1)为例说明,假设i=3,i++表示先使用i,再让i加一,即如果有赋值语句a=i++,则执行该语句后a=3,i=4,再看a=((i=i+1)-1)这个式子,先处理(i=i+1),此时i的值为4,之后将该值减一赋值给a,即a=3,此时并没有给i赋值,所以i的值不变为4。所以两个式子的效果是一样的。Take i++ replaced by ((i=i+1)-1) as an example, assuming i=3, i++ means to use i first, and then add one to i, that is, if there is an assignment statement a=i++, after executing the statement a=3, i=4, then look at the formula a=((i=i+1)-1), deal with (i=i+1) first, at this time the value of i is 4, then subtract the value One assigns a value to a, that is, a=3, and i is not assigned a value at this time, so the value of i does not change to 4. So the effect of the two formulas is the same.
而原有方法在记录某个变量的变化的时候,对形如i++之类的自加自减并没有记录相应的值的变化,而对于形如i+1的正常加减能够记录,因此可以检测。When the original method records the change of a certain variable, it does not record the corresponding value change for self-increment and self-subtraction such as i++, but it can record the normal addition and subtraction such as i+1, so it can be detection.
步骤(2)、语法分析Step (2), syntax analysis
对于程序源码进行分析,第一步就是词法分析。词法分析的功能是将由字符串序列组成的源代码划分成一个个完整的具有明确含义的最小化的单位,也就是“token”。For the analysis of program source code, the first step is lexical analysis. The function of lexical analysis is to divide the source code composed of string sequences into complete minimum units with clear meaning, that is, "token".
词法分析部分首先会创建一个用于存储“token”的双向链表,然后逐字节地读入源代码中的每个字符,并对将其划分为一个个的“token”,然后将按照“token”在源代码中出现的顺序将他们添加到双向链表中以供语法分析。The lexical analysis part will first create a doubly-linked list for storing "token", and then read each character in the source code byte by byte, and divide it into "token" one by one, and then will follow the "token " in the order in which they appear in the source code to add them to the doubly linked list for parsing.
语法分析过程具体为:首先判断能否打开文件,然后开始执行循环。循环判断当前字符是否为文件结束符,如果是文件结束符,那么程序运行结束。如果不是文件结束符,读取一段数据缓冲区并开始新的循环判断字符,当字符为“\n”(换行符)时结束循环。循环中首先判断字符是否为字母,如果是,那么转关键字和标识符处理,不是继续判断字符是否为数字,如果是,转到数字处理部分。如果不是,继续判断字符是否为运算符,如果是,转到数字处理部分。如果不是,继续循环,读入下一个字符。当读取到“\n”(换行符)时,跳转到判断文件结束部分继续循环。The grammatical analysis process is specifically as follows: firstly, it is judged whether the file can be opened, and then the execution cycle is started. Loop to judge whether the current character is the end of the file, if it is the end of the file, then the program ends. If it is not the end of the file, read a section of data buffer and start a new cycle to judge the character, and end the cycle when the character is "\n" (newline character). In the loop, first judge whether the character is a letter, if yes, then go to the keyword and identifier processing, instead of continuing to judge whether the character is a number, if it is, go to the number processing part. If not, continue to determine whether the character is an operator, and if so, go to the number processing part. If not, continue to loop and read the next character. When "\n" (newline character) is read, jump to the end of the judgment file and continue looping.
之后进行语法分析,语法分析的主要功能是将有序的“token”序列组合成一个可接受的表达式。在这个过程中,通常参考一个递归地定义了表达式组成部分的上下文无关的语法,将一系列具有固定顺序的符号表达式最终组合成一个表达式。在进行语法分析的同时,程序直接对“token”的双向链表进行分析,在这里,程序构造出一棵抽象语法树,这棵抽象语法树在逻辑上等价于原有的代码。构建出抽象语法树之后,程序根据构成这棵抽象语法树的每个节点的“token”之间的关系为“token”添加相应的属性,便于之后的分析。This is followed by parsing. The main function of parsing is to combine ordered sequences of "tokens" into an acceptable expression. In this process, a series of symbolic expressions with a fixed order are finally combined into an expression by referring to a context-free grammar that recursively defines the components of the expression. While performing grammatical analysis, the program directly analyzes the doubly linked list of "token". Here, the program constructs an abstract syntax tree, which is logically equivalent to the original code. After the abstract syntax tree is constructed, the program adds corresponding attributes to the "token" according to the relationship between the "tokens" of each node that constitutes the abstract syntax tree, to facilitate subsequent analysis.
步骤(3)、缺陷模式匹配Step (3), defect pattern matching
首先,构造一个基础的检查类,对于所有的检查类,都必须继承自这个检查类,并且实现其中定义的作为接口的虚函数。First, construct a basic inspection class. All inspection classes must inherit from this inspection class and implement the virtual functions defined in it as interfaces.
之后,将所有的检查类实例化之后产生的实例置于一个用于检查的列表中,然后对语法分析产生的“token”双向链表逐一进行缺陷模式的匹配。其中用于匹配可能错误的键入缺陷的正则表达式如下:After that, put all the instances generated after instantiation of the inspection class into a list for inspection, and then perform defect pattern matching on the "token" doubly linked list generated by the syntax analysis one by one. where the regular expression used to match potentially incorrectly typed defects is as follows:
if(%var%=%num%)if(%var%=%num%)
if(%any%&%any%)if(%any%&%any%)
scanf(%str%,%var%)scanf(%str%,%var%)
最后,对于所有匹配到的缺陷模式,将缺陷模式以及检查到的“token”等相关信息传给接下来的缺陷处理模块,由缺陷处理模块进行缺陷的格式化输出。Finally, for all the matched defect patterns, the defect pattern and the checked "token" and other related information are passed to the next defect processing module, and the defect processing module performs formatted output of the defect.
综上,以上仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。To sum up, the above are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510707442.4ACN105389195B (en) | 2015-10-27 | 2015-10-27 | A kind of static analysis tools improved method replaced based on code with regular expression |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510707442.4ACN105389195B (en) | 2015-10-27 | 2015-10-27 | A kind of static analysis tools improved method replaced based on code with regular expression |
| Publication Number | Publication Date |
|---|---|
| CN105389195A CN105389195A (en) | 2016-03-09 |
| CN105389195Btrue CN105389195B (en) | 2018-08-10 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201510707442.4AActiveCN105389195B (en) | 2015-10-27 | 2015-10-27 | A kind of static analysis tools improved method replaced based on code with regular expression |
| Country | Link |
|---|---|
| CN (1) | CN105389195B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108062474B (en)* | 2016-11-08 | 2022-01-11 | 阿里巴巴集团控股有限公司 | File detection method and device |
| WO2018232767A1 (en)* | 2017-06-24 | 2018-12-27 | 拜椰特(上海)软件技术有限公司 | Lexical analysis tool |
| CN107908405A (en)* | 2017-11-17 | 2018-04-13 | 苏州蜗牛数字科技股份有限公司 | The static examination & verification device and method of code |
| CN109542555A (en)* | 2018-10-26 | 2019-03-29 | 深圳点猫科技有限公司 | A kind of international programming implementation method of realization educational applications and device |
| CN109582567A (en)* | 2018-11-07 | 2019-04-05 | 深圳竹云科技有限公司 | A kind of software defect mode research method based on static analysis |
| CN112733153B (en)* | 2021-01-27 | 2024-09-24 | 腾讯科技(深圳)有限公司 | Source code scanning method, source code scanning device, electronic equipment and storage medium |
| CN113778852B (en)* | 2021-06-04 | 2023-07-28 | 南方科技大学 | A Code Analysis Method Based on Regular Expression |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102955914A (en)* | 2011-08-19 | 2013-03-06 | 百度在线网络技术(北京)有限公司 | Method and device for detecting security flaws of source files |
| CN102968367A (en)* | 2012-08-28 | 2013-03-13 | 华南理工大学 | Static detection method on basis of embedded software and system thereof |
| CN104298594A (en)* | 2014-09-25 | 2015-01-21 | 南京航空航天大学 | Automatic detection and positioning method for source code mid-value miscalculation |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102955914A (en)* | 2011-08-19 | 2013-03-06 | 百度在线网络技术(北京)有限公司 | Method and device for detecting security flaws of source files |
| CN102968367A (en)* | 2012-08-28 | 2013-03-13 | 华南理工大学 | Static detection method on basis of embedded software and system thereof |
| CN104298594A (en)* | 2014-09-25 | 2015-01-21 | 南京航空航天大学 | Automatic detection and positioning method for source code mid-value miscalculation |
| Publication number | Publication date |
|---|---|
| CN105389195A (en) | 2016-03-09 |
| Publication | Publication Date | Title |
|---|---|---|
| CN105389195B (en) | A kind of static analysis tools improved method replaced based on code with regular expression | |
| CN105426711B (en) | A kind of computer software source code similarity detection method | |
| CN110245496A (en) | A source code vulnerability detection method and detector, its training method and system | |
| CN108446540A (en) | Program code based on source code multi-tag figure neural network plagiarizes type detection method and system | |
| CN110147235B (en) | A method and device for semantic comparison between source code and binary code | |
| CN108170468B (en) | A method and system for automatically detecting comments and code consistency | |
| CN106156623B (en) | SQLIA defence methods based on intention | |
| CN113127341B (en) | Incremental code defect detection method and system based on graph network model | |
| WO2007084780A2 (en) | Type inference system and method | |
| CN101908006B (en) | GCC abstract syntax tree-based buffer overflow vulnerability detection method | |
| CN106843840A (en) | A kind of version evolving annotation multiplexing method of source code based on similarity analysis | |
| CN104407872B (en) | How to detect code clones | |
| CN106339313B (en) | A kind of abnormal inconsistent automatic testing method of description with document of Java api routines | |
| Min et al. | Survey on software clone detection research | |
| CN108345468A (en) | Programming language code duplicate checking method based on tree and sequence similarity | |
| Fry et al. | Clustering static analysis defect reports to reduce maintenance costs | |
| Sudhamani et al. | Code similarity detection through control statement and program features | |
| CN108021390A (en) | A kind of document defect self-repairing method of Java Application Programming Interface | |
| Chen et al. | Efficient vulnerability detection based on an optimized rule-checking static analysis technique | |
| Peng et al. | PTLVD: Program slicing and transformer-based line-level vulnerability detection system | |
| US7779049B1 (en) | Source level optimization of regular expressions | |
| CN120029630A (en) | A smart contract decompilation code optimization method and system | |
| CN103049504A (en) | Semi-automatic instrumentation method based on source code inquiring | |
| Khatoon et al. | An evaluation of source code mining techniques | |
| Bernardi et al. | Model checking to improve precision of design pattern instances identification in OO systems |
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |