Method for generating word document in template modeTechnical Field
The invention relates to a document processing technology, in particular to a method for generating a word document in a templating mode.
Background
With the application of word documents to our work, in a service scene of partial document processing, there are a large number of documents with the same format, and service personnel need to manually edit and review the large number of format documents, which requires a lot of time consumption and even error condition. It is desirable to provide a method for templating word, replacing dynamic content in a format document with a variable expression (similar to $ { variable name } in other template engines), and replacing the variable expression with the dynamic content when generating the document to complete generation of the document, thereby solving the problems of time consumption and easy error of the document.
Microsoft Office Word is the most popular Word processing program and is the essential productivity tool in our work.
Apache POI is an open source code function library of an Apache software foundation, provides API for Java programs to have the functions of reading and writing Microsoft Office format archives, and generates and modifies word documents through a service system.
The ZIP file format is a file format for data compression and document storage, Microsoft has built-in support for the ZIP format from the Windows ME operating system, and can open and produce compressed files in the ZIP format even if decompression software is not installed on a user's computer, and OS X and popular Linux operating systems also provide similar support for the ZIP format. So the zip format is often the most common choice if files are propagated and distributed over a network.
XML is a markup language for marking electronic documents to be structured.
In generating a large number of documents with the same format and internally distinct, there are two ways:
the first is that business personnel manually write documents, and as the number of documents increases, a large amount of time is consumed, and the problem of errors can easily occur.
The second method is that a developer acquires service data by using an Apache POI and generates a document according to an API provided by the POI, so that the problem of document generation efficiency can be solved, but with the increase of service documents, the developer needs to develop and write service codes responding to the service data, the service person needs to spend a large amount of time on document formats, and the time for testing, version release and the like is also needed.
At present, a business system generates a word format document according to business logic, developers are required to compile corresponding business codes and inquire business data to generate the word document, new business codes are required to be compiled by the developers with the increase of business scenes, and the setting of word styles in a code mode is not as convenient as the visual editing of words due to the complex word format, so that time is saved and the word document is generated. The new business needs to be developed, tested, on-line and the like when being on-line. More manpower, time and maintenance cost are needed.
Disclosure of Invention
The invention provides a method for generating a word document in a templating manner, which is characterized in that a < variable name > mark template expression is compiled by using a docx document, all expressions in the docx document are obtained through zip + xml (word/document. xml) analysis, and value replacement is carried out to generate a new docx document and complete the generation of the document. The format document can be compiled according to actual business requirements, is simple to compile and easy to integrate, and can be applied to business scenes.
The object of the invention is thus achieved. A method for generating word documents in a templating manner comprises the following steps:
1) beginning to decompress the docx document, extracting word/document.xml files, and analyzing the document.xml to obtain an xml object;
2) the method comprises the steps that (1) w: p paragraph nodes are obtained by traversing from a root node object document.xml, and all text contents of a word are in the w: p nodes;
3) traversing all w: r/w: t sub-nodes of w: p paragraph nodes, obtaining w: t text contents, splicing the texts, and obtaining paragraph contents;
4) by regular expressions
Judging whether the paragraph content has an expression or not, if not, continuing to analyze the next paragraph;
5) traversing w: t text nodes, and starting to judge and analyze the expression of the w: t text content;
6) judging whether the expression $ initial character exists or not, if yes, marking the position of w: t in w: p paragraph; if not, continuing to return to the step 6) to search a starting mark for the next w: t;
7) continuing to judge the expression } end character, if the character exists, recording the position of w: t in w: p paragraph; if not, continuing to return to the step 7) to search for an end mark for the next w: t;
8) recording a starting position and an ending position, collecting text node information from a text node of w: t at the beginning to a text node of w: t at the end, and acquiring text contents from the starting node to the ending node; if w is still, t is not analyzed completely, returning to the step 5) to continue searching the expression;
9) starting to traverse the information collected by the paragraph where the expression $ { } is located, acquiring all w: t nodes of the expression, and acquiring a starting w: t node;
10) traversing all w: t nodes of the expression, and splicing text contents of all w: t;
11) cleaning w: t text contents behind the w: t node at the beginning, judging whether the w: t node with the beginning mark of the next expression possibly exists, clearing the text contents before the ending mark, and writing all the contents spliced in the step 10) into the w: t node at the beginning;
12) reconstructing new empty character string content, searching the position of the character string at the start of the expression, splicing the character strings at the position of the character string at the start of the expression, searching the position of the character string at the end of the expression, extracting a variable name, and returning to the step 9) to continue the next expression analysis if all the character strings are searched;
13) obtaining the value of the corresponding variable name from the parameter map according to the variable name extracted in the step 12), and splicing the value to the content of the character string;
14) judging whether a starting mark $ {' exists in text content behind the expression in the step 12), if yes, returning to the step 12), and continuing splicing and analyzing the expressions until all expressions are analyzed;
15) and completing traversal analysis of all w: p, generating a new document.xml file, covering the new document/document.xml file in the template document, and completing expression replacement of the document.
The method has the advantages that the method is suitable for efficiently generating WORD in batches only by making a template document, replacing the expression with data by a subsequent program, analyzing the XML structure by using a docx document based on XML and ZIP technologies, extracting the expression, performing text replacement, avoiding disorder of the document structure and the style, improving the generation efficiency of the format document, and meeting the standard of WORD documents.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
fig. 2 is a view showing a frame structure of a docx file in the present invention.
Detailed Description
In the existing business system, a general word needs to depend on an office suite, and due to the fact that a plurality of business personnel write, a certain difference exists between a possibly generated document format and an expected word format, and in the situations that a large amount of documents are needed, efficiency and error rate exist, rework and the like, in order to solve the problems, the business personnel write a format document with a variable expression, store the format document in the business system, obtain dynamic data needed by a template document through the business system, call the invention to complete the generation of the document, and the detailed description is given below on a specific scheme.
Firstly, template managers write format documents, variable contents in the documents, and the writers use a $ { variable name } mode as a replacement grammar mark to finish writing the template documents.
The following results the figures and implementations further illustrate the invention:
1) decompressing the docx document, extracting word/document.xml files, and analyzing the document.xml to obtain an xml object;
2) xml, preparing to obtain a variable expression in a paragraph, wherein all text contents of the word are in the w: p nodes (the position of $ { name } needs to be obtained through analysis as follows):
3) all w: r/w: t child nodes of w: p paragraph nodes are traversed. And acquiring the text content of w: t and splicing the text. Acquiring the content of the paragraph;
4) by regular expressions
Judging whether the paragraph content has an expression or not, and continuing the next paragraph analysis if not;
5) traversing w: t text nodes, and starting to judge and analyze the expression of the w: t text content;
6) judging whether the expression $ initial character exists or not, if yes, marking the position of w: t in w: p paragraph; if not, continuing to enter the step 6) to search a starting mark for the next w: t;
7) continuing to judge the expression } end character, if the character exists, recording the position of w: t in w: p paragraph; if not, continuing to step 7) to search for an end mark for the next w: t;
8) recording the starting position and the ending position, and collecting text node information from the text node of the beginning w: t to the text node of the ending w: t; and the text content from the starting node to the ending node can be obtained, if w is t is not analyzed, the step 5) is returned to continue searching the expression;
9) beginning to traverse the expression information collected by the paragraph, acquiring all w: t nodes of the expression, and acquiring the beginning w: t nodes;
10) traversing all w: t nodes of the expression, and splicing text contents of all w: t;
11) clearing the text contents of w: t except the initial w: t node (the w: t node of the initial mark of the expression of the next expression possibly exists, judging the w: t node which is coincident with the next mark, and clearing the text contents before the end mark), and completely writing all the spliced contents into the initial w: t node in step 10);
12) reconstructing new empty character string content, searching the expression starting character string position, splicing character strings at the starting character string position, searching the expression ending character string position, extracting variable names, and returning to the step 9) to continue the next expression analysis if all the character strings are searched;
13) according to the extracted variable names, obtaining values corresponding to the variable names from the parameter map, and splicing the values to the content of the character string;
14) judging whether an expression starting mark exists in text content behind the expression, returning to the step 12) and continuing splicing and analyzing the expression until all expressions are analyzed;
15) and completing traversal analysis of p, generating a new document.xml file, covering the word/document.xml file of the template document, completing expression replacement of the document, and covering the word/document.xml file in the compressed document to complete document generation.
Because the docx file is generated based on a compression mode, no dependency relationship exists between the operating system environment generated by the WORD document and a third-party middleware, so that the independence of the operating system is realized, and the purpose of generating the WORD document in any environment in a deployable manner is achieved.
The open source software POI (http:// POI. apache. org /) supports the function of analyzing word documents for editing, theoretically, the function can also be realized, and the invention scheme is adopted by considering the factors of execution efficiency, software dependence, clear docx document structure and the like.