A kind of stain analytical approach based on stain invariant setTechnical field
The stain analytical approach based on stain invariant set that the present invention proposes is analyzed the problem of high rate of false alarm for solving traditional stain, belong to security of computer software field tests.
Background technology
Code audit, software verification and software test are all the gordian techniquies that ensures software reliability and security, and software test is the most general technology of verifying software quality.High-quality software product demand, impels software test in software development cycle, to occupy more and more consequence.Investigation demonstration in the recent period, software test accounts for the ratio of software development cost and has brought up to 80% from 50%.Stain analytical technology was suggested in 1998, after 2005, became study hotspot, was because it has multiple advantage and application prospect widely, comprising: software protection, software defect are found and software defect analysis etc.
Stain analytical technology is excavated software vulnerability by stain source, location (taint source), tracking tainting (taint propagation) and Hole Detection rule.Basic thought is to be stain by the data markers of possibility victim direct control, and the data source that is regarded as insincere (stain) generally includes: disk file, network service, keyboard and mouse operation etc.Follow the tracks of the propagation condition of stain data, in the time that stain data dissemination arrives key operation, leak just detected.This detection side's ratio juris is: stain data can only be used as non-control purposes, and controls data, for example: code pointer, script, order etc. should be all non-stain data.In essence, stain analysis is a kind of fine-grained information flow tracking technique, can follow the tracks of incredible information flow direction.
External scholar has launched deep research and has obtained some achievements for stain analytical technology.Lam and Chiueh have proposed to carry out monitor code based on the method for stain mark and tracking, and the method has two shortcomings: one is that applicability is low, because need to recompilate code; Another is that stain analysis is not comprehensive, is mainly due to the data processing of not supporting to control stream.Newsome and Song have proposed to detect by dynamic dataflow analytical technology the method for buffer-overflow vulnerability, and the method does not support to control the data processing of stream yet.The development of stain analytical technology is mainly divided into two stages: within 2005-2008, mainly concentrate on the research of stain technology itself, occurred improving the TaintCheck of stain analysis theories, the instruments such as LIFT and Dytan this period; After 2008, focus concentrates on the fusion of stain analytical approach and other technologies, and analysis tool comparatively famous in this one-phase has buzzfuzz etc.
Tainting technology can be divided into two classes at present: one is static analysis (static analysis), and another is performance analysis (dynamic analysis).Static stain analysis is in tainting Procedure embedding type, not need executive routine, and Livshits etc. have realized the static treatment of contamination data under Java; Shankar etc. have proposed a kind of detection method for format string leak in c program, the method inefficiency and there will be a large amount of wrong reports; The extended version Oink of the Cqual that Berkeley University develops at the Cqual of the fragility mainly for detection of C language of exploitation in 2002 and in 2006 is that stream is insensitive in most situation; ARCHER and IPSSA adopt path-sensitive analytical approach to reduce the accuracy of detection of model.Dynamically in the time that program is carried out, whether the instant data that detect are from outside untrusted input in stain analysis, and the method exists and faces path blast, and " stain pattern (taint model) " security mechanism realized in perl script language; Newsome etc. have used shadow EMS memory (shadow memory) technology.2009, the Vijay Ganesh of MIT etc. proposed the white box fuzzing test that stain tracking technique and fuzzing measuring technology are combined first, still needed to rely on the dependence of source code.The propositions such as Tielei Wang in 2010 utilize the method for stain information generating test use case under black box, and the method has improved code coverage.
Dynamically stain analysis is one of gordian technique of software analysis, is often used in and excavates software vulnerability, analysis software behavior, generates leak feature, privacy information protection etc.Owing to can carrying out fine-grained information flow tracking, dynamically stain analysis can be excavated dissimilar software vulnerability, for example: internal memory destruction, format string, SQL inject and order is injected, cross-site scripting attack.Wherein, tainting has directly affected the effect that whole stain is analyzed.For example: if in communication process mark too much without the data of mark, can affect the speed that stain is analyzed, also can bring too high rate of false alarm simultaneously; Contrary, if missed part and should be labeled as the data of stain, can bring and fail to report, its direct result is cannot find software defect, cannot find malicious attack etc.
The TaintCheck that the people such as Newsome developed in 2005, except detecting attack, can also produce attack signature code.The Dytan system that the people such as Clause realize, except following the tracks of the tainting being caused by data stream, can also be followed the tracks of the tainting being caused by control stream.Yin and Song have proposed TEMU system, and it is based on the virtual machine QEMU that increases income.TEMU is the dynamic stain analytic system of a total system (whole-system), its the most applicable tainting to low level (low-level) is followed the tracks of, for example disk file, kernel code, network service, but this system-wide stain analysis meeting causes more serious resource overhead.
The DTA++ that the people such as Kang propose develops based on TEMU, flows owing to only having considered that part is controlled, and controls so reduced the rate of false alarm that stream stain is analyzed.The STILL that the people such as Wang propose is based on static stain analytical technology, and this is the stain analytical technology of the tested program of a kind of unactual execution.Dynamically stain analysis is also applied to the protection to web application.Halfond proposes forward (positive) stain analytical technology and protects web application.So-called forward stain refers to believable input, and relative, negative sense (negative) stain refers to suspicious input.Other all stain analyses are all the negative sense stain analytical technologies adopting.Python program also provides stain analytical model.
Some dynamic stain analytic systems directly build on hardware foundation, for example: LIFT, the dynamic stain analytical technology of this hardware level can greatly improve system running speed.But the dynamic stain analysis of hardware level can only detect low-level software vulnerability conventionally, for example buffer overflow, and the technological expansion of hardware level is also very poor.In addition, stain analysis may operate in source code level or middle expression-form level.For example: the people such as Xu have proposed a source code level stain analysis software based on CIL; Chin and Wagner have realized the stain analysis software based on JVM bytecode pitching pile for java applet.
The analysis of tradition stain refers to the analysis of data stream stain, and stain propagates into destination operand from source operand, for example: y=x+5, before this statement is carried out, if x is stain data, this statement is carried out rear y and is also marked as stain data.Existing research is found: traditional stain analysis can produce and fail to report in some cases, and this causes by controlling stream substantially, control stream and also can propagate stain, but data-flow analysis cannot capture control stream tainting.
By analyzing existing documents and materials, we find, existing technology is mostly the stain analysis based on controlling stream, not the stain analysis of supported data stream.At present also do not have research institution to carry out systematic research to the wrong report problem of stain analysis, but in fact, wrong report also can bring serious problem.In the time that rate of false alarm is high, researchist or analyst will expend a large amount of energy to go to distinguish which stain data be real, and which stain data is false.In the time that the scale of analyzed software strengthens, the wrong report problem that stain is analyzed will be serious all the more.Even can say so, the too high stain analytical technology of rate of false alarm is difficult to apply in practice.For the problem of failing to report of traditional stain analysis, current more existing researched and proposed some solutions.The present invention does not study failing to report of traditional stain analysis, but the wrong report of analyzing for stain launches research.
Wrong report problem with regard to traditional stain analysis conducts a research, and proposes new stain marking convention, should mark stain invariant set.Its core methed is found stain invariant set exactly, analyzes this result and filters out stain invariant set from traditional stain, reaches the effect that reduces rate of false alarm.
Summary of the invention
This programme is studied mainly for the high rate of false alarm problem of traditional stain analytical technology, intends proposing a kind of stain analytical approach based on stain invariant set, reduces the high rate of false alarm of stain analytical technology.The concrete technical problems relating to comprises following two aspects:
(1) redefine stain marking convention, redefine which variable (internal memory or register) and should be marked as stain, and which variable should not be marked as stain.According to traditional definition, as long as source operand is stain, destination operand just should be marked as stain so, and this is a kind of typical data-flow analysis method.For example: y=x+5, when x is stain data, it is also stain data that this statement executes rear y.This has just brought wrong report problems a large amount of in reality, such as: y=x-x, y=x/x, y=x (XOR) x etc., the value of y is not relevant to x, therefore, if y is labeled as to stain, for the follow-up various software analysis technologies based on stain analysis, all can bring disadvantageous result.Therefore, the determine problem definition of stain marking convention of quasi-solution of the present invention, proposes stain tag definitions more accurately.
(2) find stain invariant set.Stain invariant set is the set of stain invariant, and stain invariant refers to and calculated by stain data, the impact of a data but its value is not got dirty.Example as above: y=x-x, y=x/x, y=x(XOR) x etc., x is stain data, y is exactly stain invariant., find stain invariant and can in compiling, automatically be processed by compiler, because compiler can be y=0, y=1, y=0 etc. by these example Automatic Optimal simply in example at these.But this static mode of Compiler Optimization can only be for simple calculation expression, if y=f (x), the f is here a complicated arithmetic expression, and static method just cannot be determined the value of f (x), naturally also cannot find stain invariant set.The present invention intends taking a kind of dynamic method to find stain invariant set, reduces the rate of false alarm that stain is analyzed.
The present invention is by the following technical solutions to achieve these goals:
A stain analytical approach based on stain invariant set, comprises the steps:
Step 1: by stain analytical approach, obtain original stain data original_set;
Step 2: obtain stain invariant set invariable_set;
Step 2.1: carrying out input is the program p of input, obtains all variablees of program p and the set var_set of value thereof, wherein input refers to the assignment set of the input that outside is accepted;
Step 2.2: loop initialization number of times is 0;
Step 2.3: in the time that cycle index is less than threshold, enter step 2.4; Otherwise enter step 3;
Step 2.4: change input and obtain new input new_input;
Step 2.5: carrying out input is the program p of new-_input, obtains all variablees of program p and the set res_set of value thereof;
Step 2.6: relatively gather the item that var_set is different from set res_set, productive set diff_set;
Step 2.7: filter out set diff_set var_set from set, be retained in variable under different input conditions and do not have the set of vicissitudinous variable and value thereof, enter step 2.3;
Step 3: obtain final stain data taint_set, final stain data taint_set=original stain data original_set-stain invariant set invariable_set;
Step 4: formulate stain data structure;
Step 5: follow the tracks of tainting;
Step 6: follow the tracks of stain pointer, in order to realize the tracking to stain pointer, in the time that the address of indirect addressing or register are stain data, destination operand is also marked as stain;
Step 7: formulate Hole Detection rule, directly internally store function carries out pitching pile, and before interior store function is called, detection function parameter, whether by stain mark, if so, has detected interior store function leak.
Step 5 has been used a kind of tainting tracking of lightweight, specifically comprises the steps:
Step 1: initialization has the ephemeral data structure of memory address and width information, register, instruction operation code, instruction type;
Step 2: search all registers that are read and memory address in ephemeral data structure;
Step 3: all registers that are read and memory address in inquiry stain data structure;
Step 4: judge whether the data of storing in all registers that are read and memory address are stain data, enter step 5, otherwise enter step 6 if not stain data;
Step 5: delete those and be recorded in the register that is modified in ephemeral data structure and the stain information of memory address;
Step 6: those registers of being modified and the memory address of label record in ephemeral data structure is stain.
Because the present invention has adopted above technical scheme, so possess following beneficial effect:
The present invention reduces the rate of false alarm problem that traditional stain is analyzed, and is the improvement of carrying out on existing symbol carries into execution a plan basis.The present invention proposes new stain marking convention, should mark stain invariant set, and its core methed is found stain invariant set exactly, analyzes this result and filters out stain invariant set from traditional stain, reaches the effect that reduces rate of false alarm.Compared with other optimisation technique, the present invention has very significantly effect.One, by identification stain invariant set, reduces stain and analyzes rate of false alarm; And in existing stain analysis and research, also do not have research institution to propose the effective ways for stain analysis wrong report.On the other hand, high-level efficiency, the inventive method identification stain invariant set, adopts variation input, and the actual mode of carrying out, to low in resources consumption.Therefore, this method can find stain invariant set at short notice.
Brief description of the drawings
Fig. 1 is the process flow diagram of the stain analytical approach based on stain invariant set of the present invention;
Fig. 2 is the process flow diagram of the method for obtaining stain invariant set of the present invention;
Fig. 3 is the process flow diagram of the tainting tracking of lightweight of the present invention.
Embodiment
Utilize new stain tag definitions to find stain invariant, thereby obtain stain invariant set.So-called stain invariant, is calculated and is got by stain data, but should not be labeled as the variable of stain according to new definition.Such as the variable y of y=x/x.The core of this method filters out stain invariant set exactly from traditional stain data, thereby reaches the effect that reduces rate of false alarm.Stain analytical algorithm based on stain invariant set is as follows:
begin:
1 original_set = getTaintset(p, input);
2 invariable_set = getNonTaintset(p, input);
3 taint_set = original_set-invariable_set;
end:
Algorithm Analysis: statement 1, by traditional stain analytical approach, obtains original stain data original_set; Statement 2, obtains stain invariant set invariable_set; Statement 3, obtains final stain data taint_set.
Obtain original stain data, it is innovative point of the present invention that the present invention adopts prior art to obtain stain invariant set.The algorithm that obtains stain invariant set is as follows:
begin:
1 var_set = execute(p, input);
2 num = 0;
3 while(num< threshold);
{
4 new_input = mutate(input);
5 res_set = execute(p, new_input);
6 diff_set = getDiff(var_set, res_set);
7 var_set-= diff_set;
8 num ++;
}
end
Algorithm Analysis: statement 1, executive routine p, input is input, obtains the set var_set of all variablees and value thereof; Statement 2, loop initialization number of times is 0; Statement 3, in the time that cycle index is less than threshold, enters loop body; Statement 4, changes input and obtains new input new_input; Statement 5, executive routine p, input is new-_input, obtains the set res_set of all variablees and value thereof; Statement 6, more then obtains var_set, the item that res_set is different, productive set diff_set; Statement 7 filters out diff_set from var_set, retains those values and does not have vicissitudinous variable; Statement 8, increases progressively cycle index.
Embodiment 1
Step 1: by traditional stain analytical approach, obtain original stain data original_set.Wherein traditional stain analytical approach is to carry out mark for common stain source, taking disk file as example.The API providing by Pin carries out pitching pile to operating system function, as: open, read, mmap etc., because the operation of disk file is generally passed through to these functions.It should be noted that, different operating system, needs the function difference of pitching pile, and taking Windows as example, the function that needs pitching pile is OpenFile, ReadFile etc.
Step 2: obtain stain invariant set invariable_set.
Step 2.1: carrying out input is the program p of input, obtains the set var_set of all variablees and value thereof, wherein input refers to the assignment set of the input that outside is accepted;
Step 2.2: loop initialization number of times is 0;
Step 2.3: in the time that cycle index is less than threshold, enter step 2.4; Otherwise enter step 3;
Step 2.4: change input and obtain new input new_input;
Step 2.5: carrying out input is the program p of new-_input, obtains the set res_set of all variablees and value thereof;
Step 2.6: relatively gather the item that var_set is different from set res_set, productive set diff_set;
Step 2.7: filter out set diff_set var_set from set, be retained in variable under different input conditions and do not have the set of vicissitudinous variable and value thereof, enter step 2.3;
Step 3: obtain final stain data taint_set:taint_set=original_set-invariable_set.
Step 4: formulate stain data structure.Stain data read and amendment is operation the most frequently during stain is analyzed, stain data read and refer to the stain information that reads an address or register, and stain data modification is to show an address or register is composed with new stain information.For example, if need to store many stains mark (multi taint mark) that more stain information can adopt Dytan to propose: which stain source is data belong to.If do not need so many stain information, can adopt the modal analysis granularity of stain analytical technology---the analysis granularity of byte level of binary level, only, by 1 stain information of storing each byte, 0 represents it is not stain, 1 expression is stain.
Step 5: follow the tracks of tainting.Stain information can be carried out and propagate along with instruction, simply: stain information can propagate into destination operand from source operand.For example: instruction moveax, ecx, before execution, register ecx is stain data, carries out late register eax and also should be marked as stain data.Can be expression-form in the middle of easy to handle more by x86 instruction transformation, in the middle of these, in fact expression-forms be similar to reduced instruction set computer, only have a small amount of statement type.Also can adopt the tainting tracking of lightweight, and expression-form in the middle of not relying on, directly to x86, instruction is analyzed.
Step 6: follow the tracks of stain pointer.Step 5 can be processed the tainting of explicit (explicit), in addition, the tainting of the implicit expression being caused by stain pointer (implicit) is followed the tracks of.For example: instruction moveax, [ecx+0x20], in the time that register ecx is stain data, register eax also should be marked as stain after execution.
Follow the tracks of the tainting being caused by stain pointer extremely important, because actual software is often used stain pointer to carry out table lookup operation.If do not follow the tracks of stain pointer, probably cause wrong report.In order to realize the tracking to stain pointer, the indirect addressing mode of all instructions is carried out to special processing.Particularly, in the time that the address of indirect addressing or register are stain data, destination operand is also marked as stain.
Step 7: formulate Hole Detection rule.If for low-level software vulnerability, typical way is to detect whether to have jump instruction by stain mark, if so, leak detected.The principle of this mode be think leakyly all can cause the transfer of being flowed by attacker control, but in fact exist many attacks can't reprogramming control stream, for example, for the attack of function parameter.If for high level software vulnerability, directly internally store function carries out pitching pile, and before interior store function is called, detection function parameter, whether by stain mark, if so, has detected interior store function leak.The interior store function of monitoring comprises memory allocation function, as: malloc, alloc, realloc, internal memory operation function, as: memcpy, memmove, string operation function, as: strncpy, strcat.