Disclosure of Invention
In view of this, the present application provides a method and apparatus for identifying a file, which are used to improve the accuracy of a virus detection result without occupying more memory when detecting a virus for a file passing through a network device.
Specifically, the application is realized by the following technical scheme:
according to a first aspect of the present application, there is provided a file identification method comprising:
carrying out file identification on the received file to be identified;
when the file to be identified is a file of a set type, extracting a file header from the file to be identified;
carrying out hash calculation on the target information in the file header to obtain a local hash value;
Matching the local hash value with a virus hash feature library, wherein the virus hash feature library comprises feature hash values of viruses;
and when the matching is successful, confirming that the file to be identified is a virus file.
Optionally, the file to be identified comprises an executable file under a Windows operating system, wherein the file header comprises an image file header and an optional image header;
performing hash calculation on the target information in the file header to obtain a local hash value, including:
Carrying out hash calculation on the mapping file header to obtain an intermediate hash value;
Extracting target selectable image head information matched with the machine type code from the selectable image head according to the machine type code;
And carrying out hash calculation according to the intermediate hash value and the target selectable image head information to obtain the local hash value.
Optionally, the file identifying method provided in this embodiment further includes:
And when the matching is unsuccessful, carrying out pattern string feature matching on the file to be identified so as to identify whether the file to be identified is a virus file or not.
Optionally, the file identifying method provided in this embodiment further includes:
when the local hash value is not successfully matched with the virus hash feature library or the file to be identified is not identified as a virus file when pattern string feature matching is carried out on the file to be identified, carrying out full-text hash calculation on the file to be identified to obtain a hash result, and identifying whether the file to be identified is a virus file according to the hash result.
Optionally, extracting the header from the file to be identified includes:
And disassembling the file to be identified by using the file analysis plug-in corresponding to the set type so as to extract the file header in the file to be identified.
According to a second aspect of the present application, there is provided a document identifying apparatus comprising:
the identification module is used for carrying out file identification on the received file to be identified;
The extraction module is used for extracting a file header from the file to be identified when the file to be identified is a file of a set type;
The hash calculation module is used for carrying out hash calculation on the target information in the file header to obtain a local hash value;
The first matching module is used for matching the local hash value with a virus hash characteristic library, wherein the virus hash characteristic library comprises characteristic hash values of viruses;
And the confirming module is used for confirming that the file to be identified is a virus file when the matching result of the first matching module is that the matching is successful.
Optionally, the file to be identified comprises an executable file under a Windows operating system, wherein the file header comprises an image file header and an optional image header;
The hash calculation module is specifically configured to perform hash calculation on the image file header to obtain an intermediate hash value, extract target selectable image header information matched with a machine type code from the selectable image header according to the machine type code, and perform hash calculation on the intermediate hash value and the target selectable image header information to obtain the local hash value.
Optionally, the file identifying apparatus provided in this embodiment further includes:
And the second matching module is used for carrying out pattern string feature matching on the file to be identified when the matching result of the first matching module is that the matching is unsuccessful, so as to identify whether the file to be identified is a virus file.
Optionally, the file identifying apparatus provided in this embodiment further includes:
And the third matching module is used for carrying out full-text hash calculation on the file to be identified to obtain a hash result when the matching result of the first matching module is unsuccessful or the matching result of the second matching module is that the file to be identified is a virus file, and identifying whether the file to be identified is a virus file according to the hash result when the file to be identified is a complete file.
Optionally, the extracting module is specifically configured to disassemble the file to be identified by using a file parsing plug-in corresponding to the setting type, so as to extract a file header in the file to be identified.
According to a third aspect of the present application there is provided an electronic device comprising a processor and a machine-readable storage medium storing a computer program executable by the processor, the processor being caused by the computer program to perform the method provided by the first aspect of the embodiment of the present application.
According to a fourth aspect of the present application there is provided a machine-readable storage medium storing a computer program which, when invoked and executed by a processor, causes the processor to carry out the method provided by the first aspect of the embodiments of the present application.
The embodiment of the application has the beneficial effects that:
According to the file identification method and device, after the received file to be identified is identified, when the file to be identified is identified as the file with the set type, the file header is extracted from the file to be identified, then hash calculation is carried out on target information in the file header to obtain a local hash value, the local hash value is matched with a virus hash feature library, and when the matching is successful, the file to be identified is confirmed to be a virus file. According to the method and the device, only the local hash calculation is needed to be carried out on the file header of the file to be identified, and the matching processing can be carried out on the local hash result obtained through the local hash calculation and the virus hash feature library, so that whether the file to be identified is a virus file or not is identified, and therefore when the file passing through the network equipment is subjected to virus detection, more memory is not needed to be occupied, and accuracy of the virus detection result is improved. In addition, since hash calculation is not needed based on the whole file to be identified, the identification speed of file identification is improved.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the corresponding listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.
The file identification method provided by the application is described in detail below.
Referring to fig. 1, fig. 1 is a flowchart of a file identification method provided by the present application, where the method may be applied to a network security device, and the network security device may, but is not limited to, a firewall, etc., and when the network security device implements the method, the method may include the following steps:
s101, carrying out file identification on the received file to be identified.
In this step, after the traffic enters the network security device, the file transmitted in the traffic is identified, and for convenience of description, the transmitted file may be referred to as the file to be identified.
Optionally, before the file to be identified is identified, application identification may be performed on the stream, and when the application of the set protocol is identified, step S101 is performed, that is, file identification is performed on the file to be identified of the application conforming to the set protocol.
Specifically, for some scenes only needing application control, when an application corresponding to the data stream is identified, the data stream of the application is characterized to be safe to a certain extent, that is, deep message detection processing is not needed, so that message processing performance is improved to a certain extent and message processing time is saved. In this scenario, in order to further improve the security of the data flow entering the network, this embodiment proposes to execute the flow shown in fig. 1 after identifying the application.
In addition, there may be a case where an application needs to be identified in other scenes, in order to adapt to the actual needs of other scenes, application identification is performed first on the premise of changing the implementation flow of other scenes as little as possible, and then the flow shown in fig. 1 is executed after the application identification.
It should be noted that the above-mentioned setting protocol may be, but not limited to, http and FTP (file transfer protocol), and the like.
S102, when the file to be identified is a file of a set type, extracting a file header from the file to be identified.
In this step, after the file type of the file to be identified is identified, since the ratio of the executable files of some types in the traffic is very large, viruses generally invade into such files, in order to ensure the security of the file and the security of the network, the file type of the file to be identified by viruses is set, when the file type of the file to be identified is the file type, it is indicated that the virus detection needs to be performed on the file to be identified, and then the file header is extracted from the file to be identified.
S103, carrying out hash calculation on the target information in the file header to obtain a local hash value.
In this step, the target information is feature information for virus identification, that is, the feature information may be changed due to viruses, based on which the target information of the file header is extracted, and then hash calculation is performed on the target information to obtain the local hash value.
S104, matching the local hash value with a virus hash characteristic library.
Wherein the virus hash feature library comprises feature hash values of viruses.
In this step, in order to identify whether a virus exists in a file, a virus feature library is preconfigured, and virus features of the currently existing virus are subjected to hash calculation, so that a feature hash value of the virus is obtained, and a virus hash feature library is generated. On the basis, when the local hash value is matched with the virus hash feature library, if the virus hash feature library comprises the local hash value, the local hash value is confirmed to be successfully matched with the virus hash feature library, step S105 is executed, namely the file to be identified is a virus file, and if the virus hash feature library does not comprise the local hash value, the local hash value is confirmed to be not matched with the virus hash feature library.
The virus hash feature library can be dynamically updated, and as viruses increase, the virus features of the newly added viruses are subjected to hash calculation to obtain the feature hash values of the newly added viruses, and then the feature hash values are updated into the virus hash feature library.
It should be noted that the virus hash feature library may further include a virus identifier of a virus, that is, a correspondence between the virus identifier and a virus feature is recorded in the virus hash feature library, so that when the local hash value is matched with the virus hash feature library, it can be identified whether a file to be identified has a virus or not, and also which virus belongs to the file to be identified. Specifically, when the local hash value is confirmed to be in the virus hash feature library, viruses in the files to be identified are confirmed, and meanwhile, the virus identification of the viruses in the files to be identified can be determined based on the corresponding relation between the virus identification and the virus features. Therefore, the accuracy of the virus identification result can be improved, and the user can conveniently execute effective countermeasure based on the identified virus after the virus identification result is displayed to the user.
And S105, when the matching is successful, confirming that the file to be identified is a virus file.
According to the file identification method, after the received file to be identified is identified, when the file to be identified is identified as the file with the set type, the file header is extracted from the file to be identified, then hash calculation is carried out on target information in the file header to obtain a local hash value, the local hash value is matched with a virus hash feature library, and when the matching is successful, the file to be identified is confirmed to be a virus file. According to the method and the device, only the local hash calculation is needed to be carried out on the file header of the file to be identified, and the matching processing can be carried out on the local hash result obtained through the local hash calculation and the virus hash feature library, so that whether the file to be identified is a virus file or not is identified, and therefore when the file passing through the network equipment is subjected to virus detection, more memory is not needed to be occupied, and accuracy of the virus detection result is improved. In addition, since hash calculation is not needed based on the whole file to be identified, the identification speed of file identification is improved.
Alternatively, the files to be identified may be, but are not limited to, executable files, office files, compressed files, and the like.
Alternatively, the setting type may be, but is not limited to, a PE executable file under a Windows operating system. Based on the above, the file header includes an image file header and an optional image header, and based on this, step S103 may be executed according to the following procedure, where a hash calculation is performed on the image file header to obtain an intermediate hash value, target optional image header information matched with a machine type code is extracted from the optional image header according to the machine type code, and hash calculation is performed according to the intermediate hash value and the target optional image header information to obtain the local hash value.
Optionally, step S102 may be executed according to a procedure that the file to be identified is disassembled by using the file parsing plug-in corresponding to the setting type, so as to extract a header of the file to be identified.
Specifically, the file analysis plug-in corresponding to the file to be identified may be utilized to disassemble the file to be identified, and when the identified file type of the file to be identified is a PE executable file of PE type, the mapping file header and the optional mapping header may be resolved when the file analysis plug-in utilizing the PE executable file performs analysis identification on the PE executable file.
On the basis, because the flow of the network security device continuously enters the network security device, correspondingly, when the identification of the file to be identified is carried out, the image file header is identified firstly according to the sequence of the content in the message header, and then the optional image header is continuously identified when the other content of the message header of the file to be identified is received successively.
It should be noted that, since the bit widths of the machine type codes supported by the viruses are different, the target selectable image header for performing the secondary hash is selected according to the bit widths of the machine type codes, that is, the machine type codes are parsed from the image file header after the image file header is parsed, after the selectable image header is extracted, the target selectable image header information which is consistent with the bit widths of the machine type codes parsed in the foregoing may be extracted from the selectable image header, and then the hash calculation is performed based on the intermediate hash value and the target selectable influence header information, so as to obtain the local hash value, thereby adapting the machine type codes supported by the viruses, and further, based on the local hash value, to identify whether the file to be identified is the virus file or not in preparation.
Optionally, based on any one of the above embodiments, in this embodiment, when the matching is unsuccessful, pattern string feature matching is performed on the file to be identified, so as to identify whether the file to be identified is a virus file.
Specifically, when the matching is unsuccessful based on the local hash value, it can be confirmed to a certain extent that the file to be identified is not a virus file, and in order to perform virus detection on the file to be identified more easily, the embodiment proposes that pattern string feature matching is performed on the file to be identified, so as to further confirm whether the file to be identified is a virus file, thereby further improving the virus identification result of the file to be identified.
It should be noted that, the method based on pattern string feature matching may be implemented with reference to the method provided so far, which is not limited in this embodiment.
Further, when the local hash value is not successfully matched with the virus hash feature library or the file to be identified is not identified as a virus file when pattern string feature matching is performed on the file to be identified, performing full-text hash calculation on the file to be identified to obtain a hash result, and identifying whether the file to be identified is a virus file according to the hash result.
Specifically, when the matching result based on the local hash value and the virus hash feature library is that the matching result is not successful, in order to more accurately identify the virus of the file to be identified, the embodiment proposes that whether the current file to be identified is complete is confirmed, if the current file to be identified is a complete file, hash calculation is performed on the file to be identified to obtain a hash result, and then whether the file to be identified is a virus file is confirmed according to the hash result, so that the accuracy of the virus detection result of the file is further improved.
And when the matching result of the pattern string feature matching is unsuccessful, the result of the pattern string feature matching based on the local hash value is indicated to be unsuccessful, and in this embodiment, it is proposed to confirm whether the current file to be identified is complete, if the current file to be identified is complete, hash calculation is performed on the file to be identified to obtain a hash result, and then, whether the file to be identified is a virus file is confirmed according to the hash result, thereby further improving accuracy of the virus detection result of the file.
It should be noted that, the virus identification based on the complete file to be identified may be implemented according to the method provided at present, which is not limited in this embodiment.
Alternatively, the hash calculation may be performed by, but not limited to, employing a message digest MD5 algorithm, or the like.
It should be noted that the parsed image file header may include, but is not limited to, a Machine type code Machine, a number of segments in the file to be identified, a size of an optional image header, and the like.
When the file to be identified is a PE executable file, the extracted file header may be recorded as PE FILE HEADER, and on the basis, when the file resolution plug-in corresponding to the PE executable file is used for file disassembly of the PE executable file, the information such as DOS header, DOS STUB, PE signature and the like can be disassembled in addition to the image file header and the optional image header.
First, the contents of the image file header are described as follows:
The Machine type identifier is denoted as Machine, and the unique Machine code used by each CPU may include, but is not limited to, 32 bits, 64 bits, etc., such as a Machine code compatible with a 32-bit Intel x86 chip is 14C.
The number of the segments in the file to be identified is recorded as Number Of Sections, and the number of the sections existing in the file, namely the number of the section segments in the PE file, such as the number of the section segments of data, text and the like of the PE section table part, is indicated.
The size of the optional image Header is Size Of Optional Header, and the size of the structural body optional image Header PE Option Header is shown as a whole.
The optional IMAGE Header is denoted as PE Option Header, and its structure is image_ OPTIONAL _header32, and the number of main members of the optional IMAGE Header is 9, specifically:
Magic word, magic for the type of the file to be identified is PE type, 32 bits or 64 bits.
Address Of Entry Point, a code start address for the program first to execute is indicated for the program entry address.
Image Base, which is the mapping Base address, is used for mapping the real address position of the PE file in the memory space, and indicates the preferential loading address of the file (the virtual memory range of the 32-bit process is 0-7 FFFFFFF).
Section Alignment is the memory alignment granularity, i.e. the alignment granularity when the PE file is mapped to memory. FILE ALIGNMENT is the disk alignment granularity, i.e., the alignment granularity of the PE file when stored in disk. The former establishes the minimum unit of the section in the memory, and the latter establishes the minimum unit of the section in the disk file.
The Size Of the Image is the total Size Of the PE file Image in the memory, namely the Size Of the space occupied by the PE Image in the virtual memory is specified.
Size Of Headers, which is the size of the entire PE header, including DOS header+PE label+standard PE header+optional PE header+section table total size.
Subsystem, which is a Subsystem used by the user interface, distinguishes between system driver files and common executable files.
Number Of Rva And Size, designating the number of Data Directory arrays for the number of Directory entries.
Data Directory, an array of Data Directory table IMAGE Data Directory structures.
It should be noted that, a virus file may be determined according to information such as a program entry address, a node structure address, a timestamp, and a size of a space occupied by the PE image in the virtual memory.
By adopting the file identification method provided by any embodiment of the application, even though the file is not sensitive to local modification, when a virus file is changed (such as a code segment is changed or additional data is changed), the file header information of the file is fixed, so that similar variant viruses can be detected by the same rule, and the practicability and the universality of the file identification method provided by the application are improved. In addition, in order to reduce the false alarm rate, when analyzing a mass sample to extract local hash rules, the rules of PE (polyethylene) shell adding, package adding, infection and the like which are easy to generate false alarm are removed, and the removed parts can be subjected to complementary detection through full-text hash and pattern string feature matching, so that the accuracy of file identification is further improved.
By adopting the file identification method provided by any embodiment of the application, the local hash calculation method has a certain merging rate, and compared with the full-text hash algorithm, the local hash algorithm can cover a large number of virus samples, and reduces the use of memory.
In addition, for the local hash matching method, as the information of the concerned PE file is at the file head, only the first plurality of bytes of the PE file need to be calculated and processed, hash calculation and matching are not needed to be carried out on the whole file, or AC matching processing is carried out on the whole file, so that the CPU utilization rate is greatly reduced, and meanwhile, the file identification performance is greatly improved.
Furthermore, as the local hash can be calculated for the first packet of most scenes, the recognition rate can be improved by adopting any file recognition method provided by the application under the condition that the scenes such as the disorder of the message, including the disorder of the IP layer, the disorder of the TCP and the disorder of the application layer are not recombined.
Based on the same inventive concept, the application also provides a file identification device corresponding to the file identification method. The implementation of the document identification apparatus may refer specifically to the description of the document identification method described above, and will not be discussed here.
Referring to fig. 2, fig. 2 is a schematic diagram of a file identifying apparatus according to an exemplary embodiment of the present application, which is disposed in a network security device, and includes:
The identifying module 201 is configured to identify a received file to be identified;
The extracting module 202 is configured to extract a header from the file to be identified when the file to be identified is a file of a set type;
the hash calculation module 203 is configured to perform hash calculation on the target information in the file header to obtain a local hash value;
A first matching module 204, configured to match the local hash value with a virus hash feature library, where the virus hash feature library includes feature hash values of viruses;
And the confirming module 205 is configured to confirm that the file to be identified is a virus file when the matching result of the first matching module is that the matching is successful.
Optionally, based on the above embodiment, the file to be identified in the present embodiment includes an executable file under a Windows operating system, where the header includes an image header and an optional image header;
The hash calculation module 203 is specifically configured to perform hash calculation on the image file header to obtain an intermediate hash value, extract target selectable image header information matched with the machine type code from the selectable image header according to the machine type code, and perform hash calculation according to the intermediate hash value and the target selectable image header information to obtain the local hash value.
Optionally, based on any one of the foregoing embodiments, the file identifying apparatus provided in this embodiment further includes:
and the second matching module (not shown in the figure) is used for carrying out pattern string feature matching on the file to be identified to identify whether the file to be identified is a virus file or not when the matching result of the first matching module is that the matching is unsuccessful.
Further, the message processing apparatus provided in this embodiment further includes:
And a third matching module (not shown in the figure) configured to perform full-text hash calculation on the file to be identified to obtain a hash result when the matching result of the first matching module 204 is unsuccessful, or when the matching result of the second matching module (not shown in the figure) is that the file to be identified is not identified as a virus file, and identify whether the file to be identified is a virus file according to the hash result.
Optionally, based on any one of the foregoing embodiments, the extracting module 202 is specifically configured to disassemble the file to be identified by using a file parsing plug-in corresponding to the setting type, so as to extract a header of the file to be identified.
In the file identification device provided by any embodiment of the application, after the received file to be identified is identified, when the file to be identified is identified as a file of a set type, a file header is extracted from the file to be identified, then hash calculation is carried out on target information in the file header to obtain a local hash value, the local hash value is matched with a virus hash feature library, and when the matching is successful, the file to be identified is confirmed to be a virus file. According to the method and the device, only the local hash calculation is needed to be carried out on the file header of the file to be identified, and the matching processing can be carried out on the local hash result obtained through the local hash calculation and the virus hash feature library, so that whether the file to be identified is a virus file or not is identified, and therefore when the file passing through the network equipment is subjected to virus detection, more memory is not needed to be occupied, and accuracy of the virus detection result is improved. In addition, since hash calculation is not needed based on the whole file to be identified, the identification speed of file identification is improved.
Based on the same inventive concept, the embodiments of the present application provide an electronic device, which may be, but is not limited to, the network security device described above. As shown in fig. 3, the electronic device includes a processor 301 and a machine-readable storage medium 302, the machine-readable storage medium 302 storing a computer program executable by the processor 301, the processor 301 being caused by the computer program to perform a file identification method provided by any of the embodiments of the present application. The electronic device further comprises a communication interface 303 and a communication bus 304, wherein the processor 301, the communication interface 303 and the machine readable storage medium 302 perform communication with each other via the communication bus 304.
The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The machine-readable storage medium 302 may be a Memory, which may include random access Memory (Random Access Memory, RAM), DDR SRAM (Double Data Rate Synchronous Dynamic Random Access Memory, double rate synchronous dynamic random access Memory), or Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The Processor may be a general-purpose Processor including a central processing unit (Central Processing Unit, CPU), a network Processor (Network Processor, NP), etc., or may be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
For the electronic device and the machine-readable storage medium embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments for relevant points, since the method content involved is substantially similar to that of the method embodiments described above.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The implementation process of the functions and roles of each unit/module in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be repeated here.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The above described apparatus embodiments are merely illustrative, wherein the units/modules illustrated as separate components may or may not be physically separate, and the components shown as units/modules may or may not be physical units/modules, i.e. may be located in one place, or may be distributed over a plurality of network units/modules. Some or all of the units/modules may be selected according to actual needs to achieve the purposes of the present solution. Those of ordinary skill in the art will understand and implement the present application without undue burden.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.