Disclosure of Invention
In view of the above, the present invention provides a document compliance detection tool, a method of using the same, a device, a medium, and a product, so as to solve the problem that the intranet environment is difficult to check the document compliance.
The invention provides a file compliance detection tool which comprises a scanning engine, an analysis engine, a security policy library and a service database, wherein the scanning engine is used for acquiring a file to be detected in an OOXML format, invoking a software development kit configured based on the OOXML protocol to scan the file to be detected to obtain text contents to be detected in the file to be detected, the security policy library is used for storing a sensitive word dictionary and the security detection rule library, the analysis engine is used for obtaining a sensitive word to be detected from the text contents to be detected according to the sensitive word recorded in the sensitive word dictionary in a matching mode, and conducting risk analysis on the sensitive word to be detected by utilizing the security detection rule library to generate a file compliance detection report, and the service database is used for storing the file compliance detection report and the text contents to be detected.
In some optional embodiments, the software development kit comprises a decompression interface, a type identification interface, a file positioning interface, a file summarizing interface and a text extraction interface, wherein the decompression interface is used for decompressing files to be detected to a preset catalog, the type identification interface is used for determining file types of the decompressed files, the file positioning interface is used for positioning at least one target fragment file according to the file types of the decompressed files, the target fragment file is a fragment containing part of text content to be detected, the file summarizing interface is used for collecting all fragment files associated with the target fragment file according to the target fragment file to obtain a file set, and the text extraction interface is used for extracting all element information in the file set according to XML architecture definition and extracting text parts in the element information.
In some alternative embodiments, the software development suite is developed according to OOXML protocols in either the C language or the c++ language.
In some optional embodiments, the security detection rule base includes a sensitive word validation rule, a security level rule and a processing rule, the sensitive word validation rule is used for determining a triggering condition of triggering security risks by a sensitive word to be detected, the security level rule is used for evaluating the security level to which the sensitive word triggering the risks belongs, and the processing rule is used for generating security measures for coping with the risks according to the triggered security risks.
The method comprises the steps of obtaining a file to be detected in an OOXML format, calling a software development suite configured based on an OOXML protocol to scan the file to be detected to obtain text content to be detected in the file to be detected, matching sensitive words recorded in a sensitive word dictionary from the text content to be detected to obtain sensitive words to be detected, and carrying out risk analysis on the sensitive words to be detected by using a safety detection rule base to generate a file compliance detection report.
In some optional embodiments, the method comprises the steps of calling a software development suite configured based on an OOXML protocol to scan the files to be detected to obtain text contents to be detected in the files to be detected, wherein the method comprises the steps of decompressing the files to be detected to a preset catalog, identifying file types of the decompressed files, positioning at least one target fragment file according to the file types of the decompressed files, wherein the target fragment file is a fragment containing part of the text contents to be detected, collecting all fragment files associated with the target fragment file according to the target fragment file to obtain a file set, extracting all element information from the file set according to XML architecture definition, and extracting text parts in the element information to obtain the text contents to be detected.
In some optional embodiments, the step of performing risk analysis on the sensitive words to be detected by using a security detection rule base to generate a file compliance detection report includes judging whether each sensitive word to be detected triggers security risks or not according to a rule effective by the sensitive words, marking security levels to which the sensitive words triggering the risks belong according to a security level rule, generating security measures for coping with the risks according to the triggered security risks through a processing rule, and writing the sensitive words triggering the security risks, the triggered security levels and the security measures into a blank text document to obtain the file compliance detection report.
In a third aspect, the invention provides a computer device comprising a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of the first aspect or any of its corresponding embodiments.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect or any of its corresponding embodiments.
In a fifth aspect, the invention provides a computer program product comprising computer instructions for causing a computer to perform the method of the first aspect or any of its corresponding embodiments.
The technical scheme provided by the invention has the following advantages:
The software development suite based on OOXML protocol configuration realizes the text content identification capability of text data, and integrates a sensitive word dictionary and a safety detection rule base to carry out compliance detection on the identified text content. The software development suite configured by the OOXML protocol is small in size, character recognition is realized based on the XML protocol, a large number of picture material libraries are not needed, and light-weight compound rule detection can be realized for document format files except pictures. For units with higher data security requirements, the scheme realizes the compliance detection function by extremely light software, and the implementation form of the scheme is possibly just a USB flash disk or optical disk device, so that main computing resources are inclined towards service functions such as security identification and protection, the safety and the reliability of the scheme are guaranteed, and the problem that the external network units cannot be connected and the file compliance detection difficulty is high is solved.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In accordance with an embodiment of the present invention, a file compliance detection tool embodiment is provided, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.
The file compliance detection tool comprises a scanning engine, an analysis engine, a security policy library and a service database, wherein the scanning engine is used for acquiring a file to be detected in an OOXML format, invoking a software development suite configured based on an OOXML protocol to scan the file to be detected to obtain text contents to be detected in the file to be detected, the security policy library is used for storing a sensitive word dictionary and the security detection rule library, the analysis engine is used for obtaining a sensitive word to be detected from the text contents to be detected according to the sensitive word recorded in the sensitive word dictionary in a matching mode, performing risk analysis on the sensitive word to be detected by using the security detection rule library to generate a file compliance detection report, and the service database is used for storing the file compliance detection report and the text contents to be detected.
In particular, unstructured data is data without specific formats and rules, which refer to aspects of data presentation, such as pictures, office documents, text, audio, video, reports, etc., which have no basis for regularity, so that it is inconvenient to retrieve and access. While for data other than pictures, document data may be stored mostly in OOXML (OfficeOpen Extensible Markup Language ) format. OOXML is an XML-based office document format, including word documents, excel spreadsheets, powerPoint presentations, and Chart, diagram, shape, etc. The specification has been adopted by ISO and IEC as the ISO/IEC 29500 standard. Based on the standard, the embodiment of the invention develops a set of base library from the bottom layer of the protocol and provides a large part of Open XML software development suite (Software Development Kit, SDK) compatible with the calling style API, and the upper layer application uses the corresponding SDK to extract the text in the office document format file, thereby realizing the detection of the office document data security of the special industry.
As shown in FIG. 1, the document compliance detection tool provided by the invention comprises an execution component and a database, wherein the database also comprises a security policy library and a service database, the security policy library is used as a core and is divided into two modules, one is a sensitive word dictionary which can be used for searching and matching along with the service or software of a client according to a built-in sensitive word dictionary table of the current security rule, and the other is a security detection rule library which is preset with a plurality of determination rules for security detection, so that which sensitive words take effect and have what security risk during security compliance analysis are control items of security detection behaviors.
In this embodiment, the service core of the execution component is a scan engine and an analysis engine, wherein the technical core of the scan engine is an OOXML-based SDK, and the scan engine firstly retrieves all files in the matching file format of the target directory, then extracts text content in the files in the matching file format by calling the SDK, and then stores the extracted text content in the service database (into the information database location of fig. 1). In this embodiment, the matching file format file refers to an XML-like file, and different types may be separated according to different presentation contents thereof, for example, html, ppt, doc, xlx, pptx, docx.
The analysis engine is separated from the scanning engine and each independently performs its own function. The analysis engine is used for loading the business database, the sensitive word dictionary list and the safety detection rule base simultaneously. Firstly, using a sensitive word dictionary and texts in an information database as comparison items, performing sensitive word matching on extracted text contents, and then performing operations such as recording, processing, summarizing and the like on the matched sensitive words according to a safety detection rule base as an analysis standard. And finally, uploading the analyzed result to a business database as a current file compliance detection report for software presentation or document downloading for use by security practitioners. The security policy database and the business database are components existing in the form of sql-type databases.
According to the technical scheme provided by the embodiment of the invention, the software development suite based on the OOXML protocol configuration realizes the text content identification capability of text data, and then the sensitive word dictionary and the safety detection rule base are integrated to carry out compliance detection on the text content obtained by identification, so that a complete software tool is formed. The software development suite configured by the OOXML protocol is small in size, character recognition is realized based on the XML protocol, a large number of picture material libraries are not needed, and light-weight compound rule detection can be realized for document format files except pictures. For units with higher data security requirements, the scheme realizes the compliance detection function by extremely light software, and the file compliance detection tool can be stored in a U disk or optical disk device, so that main computing resources are inclined towards service functions such as security identification and protection, the safety and the reliability of the file compliance detection tool are guaranteed, and the problem that the file compliance detection difficulty is high due to the fact that an external network unit cannot be connected is solved.
In some optional embodiments, the software development kit comprises a decompression interface, a type identification interface, a file positioning interface, a file summarizing interface and a text extraction interface, wherein the decompression interface is used for decompressing files to be detected to a preset catalog, the type identification interface is used for determining file types of the decompressed files, the file positioning interface is used for positioning at least one target fragment file according to the file types of the decompressed files, the target fragment file is a fragment containing part of text content to be detected, the file summarizing interface is used for collecting all fragment files associated with the target fragment file according to the target fragment file to obtain a file set, and the text extraction interface is used for defining and extracting all element information in the file set according to an XML architecture and extracting text parts in the element information.
Specifically, the main business of the file compliance detection tool provided by the invention is a data security detection service, and is not used for performing all general operations on a text document. Therefore, the software development suite based on the OOXML protocol configuration has the main functions of identifying the text of Office documents and reading the text in a slicing way. The method supports each XML office document type, and each document type only keeps the reading function, so that the components are ensured to be light enough on the premise of meeting the requirements.
According to the OOXML standard, office documents are actually zip compression packages composed of a group of XML files and other types (mostly picture types) of files, and when decompressed according to a zip format, the following structure can be obtained.
1. Rels directory for storing files with a suffix of rels, and records the dependency relationship between the parts (which can be understood as file fragments).
2. DocProps directory, content for storing file attributes and settings of application programs, has little relation with actual content in each file.
3. Content types directory, which may be doc, ppt, etc., for storing fragments of the file record of the xml suffix, the stored fragments being the text Content actually recorded in the document.
4. Content types xml file, file type for recording each file contained in the compression package. Thus, it should be noted that, judging the true type of each file in the zip compression package should not be judged by the file suffix, but should be judged from the [ content_types ]. Xml file, wherein the type corresponding to the target file fragment needs to be obtained through the ML, and regarding the meaning of the ML, see the following table.
TABLE 1 Markup Language (ML) Markup for OOXML parsing
Based on the storage characteristics of the OOXML file, as shown in FIG. 2, the basic flow of the OOXML file analysis based on the OOXML protocol is as follows:
1. the Office file is decompressed by the zip compression package.
2. Reading [ content_types ]. Xml files to obtain the types of all files (the formats for identifying the current zip compression reality are Word, PPT or Excel, etc.);
3. Taking docx format as an example of a target file to be read, firstly reading a Relationship file under a rels directory, and obtaining the position of a document/xml file, namely word/document/xml, through the Relationship file;
4. Reading word/document.xml file and associated relation file wprd _rels/document.xml file to obtain the storage position of the fragments of all the files of the word;
5. And then, according to XML architecture (XML Schema Definition, XSD) definition of WML (description of data in word) to learn all element information in the file, so as to extract text parts in the element information.
For step 5, the present embodiment mentions that the content in each file fragment is obtained by XSD definition, because ML is defined in xml format, and thus the corresponding attribute descriptions all conform to the format rule of xml, i.e., XSD definition. The role of XSD definition mainly includes defining elements that can appear in a document, defining attributes that can appear in a document, defining which elements are sub-elements, defining the order of sub-elements, defining the number of sub-elements, defining whether an element is empty or can contain text, defining the data types for an element and an attribute, defining default values and fixed values for an element and an attribute. Wherein, XML elements are basic units for forming XML documents, each element comprises a start tag, an end tag and contents between the two tags, and the contents between the two tags can be texts, sub-elements, notes and the like. Concepts related to subelements, orders, attributes, comments, etc. are explained as the prior art, and are not strongly related to text content to be extracted, and are not described herein.
Therefore, based on the file analysis flow, the software development suite mainly comprises a decompression interface, a type identification interface, a file positioning interface, a file summarizing interface and a text extraction interface, wherein each interface is called by a scanning engine to act. The method comprises the steps of obtaining Office files by decompression of zip compression packages, obtaining the files of the Office files by decompression of the zip compression packages by decompression of the compression packages, obtaining the files of word, ppt, excel and other types of file fragments by reading the Relationship files under the rels directory by a file positioning interface, obtaining the storage positions of all fragment files of word, ppt or excel by a file summarizing interface, obtaining the file collection by obtaining the storage positions of all fragment files of word, ppt or excel by a file summarizing interface, and obtaining all element information in the files by an XSD definition by a text extraction interface, and extracting text parts in the element information.
The software development kit configured by the embodiment of the invention not only can accurately read the text of the OOXML format office document and support the office document type of each XML, but also only can keep the reading function of each document type, ensures that the components are light enough on the premise of meeting the requirements, and remarkably improves the problems of difficult deployment and low detection efficiency of the document compliance detection system in the intranet environment.
In some alternative embodiments, the software development kits configured by the present invention are developed in accordance with OOXML protocols in either the C language or the c++ language.
Specifically, some components based on the OOXML format currently on the market often lack autonomous requirements, and are not suitable for applications of the security compliance detection software, for example, apache POI based on java implementation, the components need to deploy a JRE environment to run java items, and for the cross-platform requirement of the security compliance detection software, the JRE of linux and windows needs to be prepared simultaneously to support the running of Apache POI. Such as the Open XML SDK for Office component provided by microsoft, which, while used to read and write and modify OOXML formatted documents, is based on c# development, is tied to the net development framework. Any existing component is only suitable for own system products and cannot be deployed in a cross-platform mode, so that the existing component cannot be combined with a file detection tool. According to the file compliance detection tool provided by the embodiment of the invention, the software development suite of the tool is based on the OOXML standard, a set of basic library is developed from the protocol bottom layer through a general C language or C++, and an API (application program interface) compatible with the OpenXML SDK calling style is provided, and the upper layer application extracts the text in the office document format file by using the corresponding SDK, so that the detection of the office document data safety in the special industry is realized, the cross-platform deployment effect is realized, and the application scene of the file compliance detection tool is increased.
In some optional embodiments, the security detection rule base provided by the invention includes a sensitive word validation rule, a security level rule and a processing rule, the sensitive word validation rule is used for determining a triggering condition of triggering security risk by a sensitive word to be detected, the security level rule is used for evaluating the security level to which the sensitive word triggering the risk belongs, and the processing rule is used for generating security measures for coping with the risk according to the triggered security risk.
Specifically, the security detection rule base of the embodiment of the invention monitors the extracted document data safely, can determine which sensitive words take effect in which specific scenes, so as to determine different security risks to trigger, define different security levels for different sensitive words, and can generate security measures for coping with risks according to the triggered security risks, thereby providing detailed risk positioning description and coping measure description for file compliance detection reports generated subsequently, improving the efficiency of risk coping, reducing the difficulty of manual risk analysis, and improving the file detection efficiency of file compliance detection tools in special units under the condition of only an intranet.
The application method of the file compliance detection tool provided in this embodiment, as shown in fig. 3, specifically includes the following steps:
step S301, obtaining a file to be detected in an OOXML format;
step S302, a software development suite based on OOXML protocol configuration is called to scan a file to be detected, and text content to be detected in the file to be detected is obtained;
Step S303, matching the sensitive words recorded in the sensitive word dictionary from the text content to be detected to obtain sensitive words to be detected;
And S304, performing risk analysis on the sensitive vocabulary to be detected by using a safety detection rule base, and generating a file compliance detection report.
In particular, the explanation of the principle of the present method embodiment may refer to the related description of the foregoing tool embodiment, and will not be repeated here.
In some alternative embodiments, step S302 includes:
Step a1, decompressing a file to be detected to a preset catalog;
Step a2, identifying the file type of each decompressed file;
Step a3, positioning at least one target fragment file according to the file type of each decompressed file, wherein the target fragment file is a fragment containing part of text content to be detected;
Step a4, collecting all fragment files associated with the target fragment file according to the target fragment file to obtain a file set;
And a5, extracting all element information from the file set according to XML architecture definition, and extracting text parts in the element information to obtain text contents to be detected.
In particular, the explanation of the principle of the present method embodiment may refer to the related description of the foregoing tool embodiment, and will not be repeated here.
In some alternative embodiments, step S304 includes:
Step b1, judging whether each sensitive word to be detected triggers a security risk or not according to a sensitive word validation rule;
Step b2, marking the security level of the sensitive vocabulary triggering the risk through a security level rule;
step b3, generating security measures for coping with risks according to the triggered security risks through processing rules;
and b4, writing the sensitive vocabulary triggering the safety risk, the triggered safety risk, the safety level and the safety measure into a blank text document to obtain a document compliance detection report.
In particular, the explanation of the principle of the present method embodiment may refer to the related description of the foregoing tool embodiment, and will not be repeated here.
The embodiment of the invention also provides a computer device for running the file compliance detection tool shown in the figure 1.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, and as shown in fig. 4, the computer device includes one or more processors 10, a memory 20, and interfaces for connecting components, including a high-speed interface and a low-speed interface. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 4.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.
The memory 20 may include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The memory 20 may comprise volatile memory, such as random access memory, or nonvolatile memory, such as flash memory, hard disk or solid state disk, or the memory 20 may comprise a combination of the above types of memory.
The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.
The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random-access memory, a flash memory, a hard disk, a solid state disk, or the like, and further, the storage medium may further include a combination of the above types of memories. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Portions of the present invention may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or aspects in accordance with the present invention by way of operation of the computer. Those skilled in the art will appreciate that the existence of computer program instructions in a computer-readable medium includes, but is not limited to, source files, executable files, installation package files, and the like, and accordingly, the manner in which computer program instructions are executed by a computer includes, but is not limited to, the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled programs, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed programs. Herein, a computer-readable medium may be any available computer-readable storage medium or communication medium that can be accessed by a computer.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.