CN102646179A

Movatterモバイル変換

Info

Publication number: CN102646179A
Application number: CN2012100457361A
Authority: CN
Inventors: 刘红梅; 李雷
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2012-02-27
Filing date: 2012-02-27
Publication date: 2012-08-22

Abstract

The invention belongs to the field of multimedia signal processing, and particularly relates to a PDF (Portable Document Format) document information embedding and extraction method based on PDF documents. The method is characterized in that a new document body added in additive updating of the PDF document is utilized as a carrier for hidden information; and the hidden information can be invisibly written-in at the beginning of establishment of the document, has no influence on the document display layer, can be transmitted on the internet along with the transmission of the document content, has large embeddable capacity and can not be damaged due to transmission or the common document editing behavior, and has concealment without easy finding and damage for attackers. As a method for authenticating the PDF documents, the PDF document information embedding and extraction method has the advantages that relevant authentication information such as the author, the provenance and the copyright of the document can be invisibly embedded into the PDF documents, so that the PDF document information embedding and extraction method is practical for copyright authentication, truth-falsehood distinguishing and the like of the PDF documents.

Description

Translated fromChinese

一种基于PDF文件体的PDF文件信息嵌入和提取方法A PDF file information embedding and extraction method based on PDF file body

技术领域technical field

本发明属于多媒体信号处理领域，具体涉及一种基于PDF文件体的PDF文件信息嵌入和提取方法。The invention belongs to the field of multimedia signal processing, and in particular relates to a method for embedding and extracting PDF file information based on a PDF file body.

背景技术Background technique

近些年，随着网络技术的快速发展,人们开始越来越多地通过互联网传输和获取信息。与此同时，电子商务、电子政务等新型办公模式正被广泛应用，越来越多的行政、商业文件如授权书、注册单、合同、发票等开始以电子文档的形式进行流通和传输。但在互联网这个开放的环境中,拷贝、篡改等恶意行为时刻威胁着电子文档的版权归属问题，大量版权盗用、非法传输、信息伪造等问题层出不穷。基于这种情况,电子文档的数据隐藏技术日益成为版权认证、真伪鉴定、解决纠纷的主要手段。In recent years, with the rapid development of network technology, people have begun to transmit and obtain information more and more through the Internet. At the same time, new office models such as e-commerce and e-government are being widely used, and more and more administrative and commercial documents, such as authorization letters, registration forms, contracts, invoices, etc., have begun to be circulated and transmitted in the form of electronic documents. However, in the open environment of the Internet, malicious acts such as copying and tampering always threaten the copyright ownership of electronic documents, and a large number of copyright theft, illegal transmission, information forgery and other problems emerge in endlessly. Based on this situation, the data hiding technology of electronic documents has increasingly become the main means of copyright certification, authenticity identification and dispute resolution.

PDF(Portable Document Format)文件格式是Adobe公司开发的电子文件格式。这种文件格式在Windows、Unix、Mac等操作系统中都是通用的，独立于操作系统平台。PDF文件格式可以将文字、字型、格式、颜色及独立于设备和分辨率的图形图像等封装在一个文件中。该格式文件还可以包含超文本链接、声音和动态影像等电子信息，支持特长文件，集成度和安全可靠性都较高。再者，PDF文件使用了工业标准的压缩算法，易于传输与储存。上述特性使得PDF成为在Internet上进行电子文档发行和数字化信息传播的理想文档格式。因此，基于PDF文档的信息隐藏技术的研究，对当前的应用环境，是有十分重要的实际意义的。下面对现有技术中PDF文件的结构进行简要分析，以便对本发明进行理解。The PDF (Portable Document Format) file format is an electronic file format developed by Adobe. This file format is common in operating systems such as Windows, Unix, and Mac, and is independent of the operating system platform. The PDF file format can encapsulate text, fonts, formats, colors, and graphic images independent of devices and resolutions in one file. Files in this format can also contain electronic information such as hypertext links, sounds, and dynamic images. It supports extra-long files, and has a high degree of integration, security and reliability. Furthermore, PDF files use industry-standard compression algorithms, which are easy to transfer and store. The above characteristics make PDF an ideal document format for electronic document distribution and digital information dissemination on the Internet. Therefore, the research on information hiding technology based on PDF documents has very important practical significance to the current application environment. The structure of the PDF file in the prior art is briefly analyzed below in order to understand the present invention.

如图1所示是原始PDF的文件结构图，包括四部分：文件头(Header),文件体(Body)，交叉引用表(Cross-reference table)和文件尾(Trailer)。文件头标识PDF文件版本信息；文件体由一系列的间接对象组成，基本包含了PDF文件的内容；交叉引用表包含间接对象的地址信息，初始状态只有一个单元；文件尾记录PDF文件的根对象以及交叉引用表的起始地址等信息。As shown in Figure 1 is the file structure diagram of the original PDF, including four parts: file header (Header), file body (Body), cross-reference table (Cross-reference table) and file trailer (Trailer). The file header identifies the version information of the PDF file; the file body consists of a series of indirect objects, basically including the content of the PDF file; the cross-reference table contains the address information of the indirect objects, and the initial state is only one unit; the end of the file records the root object of the PDF file And information such as the starting address of the cross-reference table.

如图2所示，是经过追加式更新操作的PDF文件结构。在一次追加更新操作中，任何新的对象或者被修改的对象都会被添加到原始PDF文件尾的后面，组成新的文件体，新文件体对应的新交叉引用单元和新文件尾也会随着被插入到末尾。As shown in Figure 2, it is the structure of the PDF file after the append update operation. In an append update operation, any new or modified objects will be added to the end of the original PDF file to form a new file body, and the new cross-reference unit and new file tail corresponding to the new file body will also follow is inserted at the end.

如图3所示，是PDF文件交叉引用表示例图。每个交叉引用表包含一定范围内相邻对象号的对象词条。每个交叉引用表以关键字xref为一行开始，开始的一行包含由空格分开的两个数字，第一个数字表示该文件体中第一个对象的对象号，第二个数字表示该文件体中所有对象的数量。接下来的是对应PDF文件每个对象的每行一条的词条，词条结构是：As shown in Figure 3, it is an example diagram of the cross-reference table of PDF files. Each cross-reference table contains object entries for a range of adjacent object numbers. Each cross-reference table starts with the keyword xref as a line. The first line contains two numbers separated by spaces. The first number indicates the object number of the first object in the file body, and the second number indicates the file body. The number of all objects in . Next is an entry corresponding to each line of each object in the PDF file. The entry structure is:

nnnnnnnnnn ggggg x ynnnnnnnnnn ggggg x y

其中nnnnnnnnnn是10字节的偏移量，表示从PDF文件开头到该对象开头的字节数，字节数不够10字节的则偏移量前面的数字填零；ggggg是5字节的等级号，除去0号对象外，其它对象的交叉引用表中的初始等级号均为0，每次词条被重用，都会被赋予一个新的等级号，最大为65535。x为对象状态关键字，有n、f、eol三个状态关键字，n表示正在使用的词条，f表示已被废弃的词条。eol为结束符。图3中的示例中指示了0到5一共六个对象的相关信息。Among them, nnnnnnnnnn is a 10-byte offset, indicating the number of bytes from the beginning of the PDF file to the beginning of the object. If the number of bytes is less than 10 bytes, the number in front of the offset will be filled with zeros; ggggg is a 5-byte level No. Except for object No. 0, the initial level number in the cross-reference table of other objects is 0. Every time an entry is reused, it will be given a new level number, up to 65535. x is the object state keyword, there are three state keywords n, f, eol, n represents the entry in use, and f represents the entry that has been discarded. eol is the terminator. In the example in FIG. 3 , the related information of six objects from 0 to 5 are indicated.

发明内容Contents of the invention

本发明解决的技术问题是克服现有技术的不足，提供一种将嵌入信息嵌入到PDF文件新建的文件体中并能从PDF文件中提取出嵌入信息以对PDF文件进行鉴定的基于PDF文件体的PDF文件信息嵌入和提取方法。利用本发明对PDF嵌入信息后能够有效解决PDF版权认证、真伪辨别的问题，而且本发明对PDF文档的编辑行为具有很好的鲁棒性。The technical problem solved by the present invention is to overcome the deficiencies of the prior art, to provide a PDF-based document body that embeds embedded information into the newly created document body of the PDF file and can extract the embedded information from the PDF file to identify the PDF file PDF file information embedding and extraction method. The invention can effectively solve the problems of PDF copyright authentication and identification of authenticity after embedding information in PDF, and the invention has good robustness to editing behavior of PDF documents.

为解决上述技术问题，本发明的技术方案如下：In order to solve the problems of the technologies described above, the technical solution of the present invention is as follows:

一种基于PDF文件体的PDF文件信息嵌入和提取方法，包括如下步骤：A method for embedding and extracting PDF file information based on a PDF file body, comprising the steps of:

（1）进行隐藏信息嵌入，其具体是：(1) Embedding hidden information, specifically:

读入原始PDF文件流；Read in raw PDF file stream;

读入隐藏信息进行分段，对每个隐藏信息段进行置乱，记录置乱参数；Read in the hidden information for segmentation, scramble each hidden information segment, and record the scrambling parameters;

查找并确定原始PDF文件流中的最大对象号；Find and determine the largest object number in the raw PDF file stream;

将最大对象号加1作为新文件体插入的第一个新对象号，将每个隐藏信息段进行编码后作为新文件体的新对象依次写入原始PDF文件中，并生成新对象位置标志；Adding 1 to the maximum object number as the first new object number inserted into the new file body, encoding each hidden information segment as a new object in the new file body, and writing them in the original PDF file in turn, and generating a new object position mark;

隐藏信息嵌入完毕后，写入新文件体对应的新交叉引用表和新文件尾，完成一次追加更新；After the hidden information is embedded, write the new cross-reference table corresponding to the new file body and the end of the new file to complete an additional update;

输出带隐藏信息的PDF文件及输出置乱参数和新对象位置标志作为密钥；Output PDF files with hidden information and output scrambling parameters and new object position marks as keys;

（2）提取隐藏信息，其具体是：(2) Extract hidden information, specifically:

读取带隐藏信息的PDF文件流及密钥；Read the PDF file stream and key with hidden information;

根据密钥中新对象位置标志，在PDF文件的数据流中查找并确定以追加更新方式写入的新对象；According to the new object position mark in the key, find and determine the new object written in the way of appending update in the data stream of the PDF file;

提取所确定的新对象内的数据流并对其进行解码；extracting and decoding the data stream within the determined new object;

根据密钥中的置乱参数，将解码后的新对象数据流进行反置乱；Descramble the decoded new object data stream according to the scrambling parameters in the key;

将进行反置乱后的数据流顺序组合并输出，得到隐藏信息。Combine and output the descrambled data streams sequentially to obtain hidden information.

上述方案中，所述对每个隐藏信息段进行置乱，记录置乱参数的具体步骤是利用混沌映射对每个隐藏信息段进行置乱，记录映射参数作为置乱参数。In the above solution, the specific step of scrambling each hidden information segment and recording the scrambling parameters is to scramble each hidden information segment using a chaotic map, and record the mapping parameters as the scrambling parameters.

上述方案中，其特征在于，所述新对象位置标志为所有隐藏信息段所对应的新对象号。In the above solution, it is characterized in that the new object position mark is the new object number corresponding to all hidden information segments.

上述方案中，其特征在于，所述读入隐藏信息进行分段获得隐藏信息段时，还记录隐藏信息段的数量值；In the above scheme, it is characterized in that, when the hidden information is segmented to obtain the hidden information segment after reading in the hidden information, the quantity value of the hidden information segment is also recorded;

所述对象位置标志为插入的新文件体中的第一个新对象号和隐藏信息段的数量值。The object position mark is the number of the first new object and the number of hidden information segments in the inserted new file body.

与现有技术相比，本发明技术方案的有益效果是：Compared with the prior art, the beneficial effects of the technical solution of the present invention are:

本发明利用PDF文件追加式更新中添加的新文件体作为隐藏信息的载体，隐藏信息在文件建立之初就不可见地被写入，对文件显示层面没有任何影响，可以随着文档内容的传输而在互联网上传输，可嵌入容量足够大，不会因为传输或常用的文档编辑行为而被破坏。对于攻击者，具有隐蔽性，不易查找破坏。本发明作为PDF文档认证的一种方法，可以不可见地在PDF文件中嵌入文件的作者、出处、版权等相关认证信息，对PDF文件的版权认证、真伪辨别等具有实用性。The present invention uses the new file body added in the appended update of the PDF file as the carrier of the hidden information, and the hidden information is written invisibly at the beginning of the file creation, without any influence on the file display level, and can be transmitted along with the content of the file. Transmitted over the Internet, the embeddable capacity is large enough that it will not be destroyed by transmission or common document editing actions. For attackers, it is concealed and difficult to find damage. As a method for authenticating PDF documents, the present invention can invisibly embed relevant authentication information such as the author, source, and copyright of the document in the PDF file, and is practical for copyright authentication and authenticity discrimination of PDF files.

附图说明Description of drawings

图1 是原始PDF文件的结构示意图；Figure 1 is a schematic diagram of the structure of the original PDF file;

图2是经过追加式更新操作后的PDF文件结构图；Fig. 2 is a PDF file structure diagram after an additional update operation;

图3为PDF文件的交叉引用表具体实例效果图；Fig. 3 is the effect diagram of the specific example of the cross-reference table of the PDF file;

图4为本发明中进行隐藏信息嵌入的流程图；Fig. 4 is the flow chart of carrying out hidden information embedding in the present invention;

图5为本发明中提取隐藏信息的流程图；Fig. 5 is the flowchart of extracting hidden information among the present invention;

图6为本发明具体实施例中原始PDF文件显示效果图；Fig. 6 is a display effect diagram of the original PDF file in a specific embodiment of the present invention;

图7为本发明具体实施例中已嵌入隐藏信息的PDF文件显示效果图；Fig. 7 is the display effect diagram of the PDF file embedded with hidden information in the specific embodiment of the present invention;

图8为本发明具体实施例中已嵌入隐藏信息的PDF文件进行各种注释、标记操作的效果图；FIG. 8 is an effect diagram of various annotation and marking operations performed on a PDF file embedded with hidden information in a specific embodiment of the present invention;

图9为本发明具体实施例中已嵌入隐藏信息的表单类PDF文件的显示效果图；9 is a display effect diagram of a form-type PDF file embedded with hidden information in a specific embodiment of the present invention;

图10为本发明具体实施例中对已嵌入隐藏信息的表单类PDF文件进行编辑后的显示效果图。 Fig. 10 is a display effect diagram after editing a form-type PDF file embedded with hidden information in a specific embodiment of the present invention. the

具体实施方式Detailed ways

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

如图4和图5所示，为本发明的一种基于PDF文件体的PDF文件信息嵌入和提取方法的流程图，所述方法的具体步骤如下：As shown in Figure 4 and Figure 5, it is a kind of flow chart of the PDF file information embedding and extracting method based on PDF file body of the present invention, and the specific steps of described method are as follows:

（S1）如图4所示，在原始PDF文件中进行隐藏信息嵌入，其具体步骤是：(S1) As shown in Figure 4, embedding hidden information in the original PDF file, the specific steps are:

（S11）读入原始PDF文件流；(S11) read in the original PDF file stream;

（S12）按照固定长度对读入隐藏信息进行分段，然后对每个隐藏信息段进行置乱，所述置乱利用混沌映射进行，记录映射参数作为置乱参数，并记录隐藏信息段的段数；(S12) Segment the read-in hidden information according to a fixed length, and then scramble each hidden information segment. The scrambling is performed using chaotic mapping, and the mapping parameters are recorded as scrambling parameters, and the number of hidden information segments is recorded. ;

（S13）在原始PDF文件流中查找并确定原始PDF文件流中的最大对象号，以此确定追加更新中新加对象的对象号；(S13) Find and determine the maximum object number in the original PDF file stream in the original PDF file stream, so as to determine the object number of the newly added object in the append update;

（S14）将最大对象号加1作为新文件体插入的第一个新对象号，将每个隐藏信息段进行编码后作为新文件体的新对象依次写入原始PDF文件中，并生成新对象位置标志；新对象位置标志为所有隐藏信息段所对应的新对象号或者为插入的新文件体中的第一个新对象号和隐藏信息段的段数值；(S14) Adding 1 to the maximum object number as the first new object number inserted into the new file body, encoding each hidden information segment as a new object in the new file body, and sequentially writing it into the original PDF file, and generating a new object Position flag; the new object position flag is the new object number corresponding to all hidden information segments or the first new object number in the inserted new file body and the segment value of the hidden information segment;

（S15）隐藏信息嵌入完毕后，写入新文件体对应的新交叉引用表和新文件尾，完成一次追加更新；至此，带有隐藏信息的PDF文件建立完毕，隐藏信息被嵌入到新文件体中；(S15) After the hidden information is embedded, write the new cross-reference table and the new file tail corresponding to the new file body, and complete an additional update; so far, the PDF file with hidden information is created, and the hidden information is embedded in the new file body middle;

（S16）输出带隐藏信息的PDF文件及输出置乱参数和新对象位置标志作为密钥。(S16) Outputting a PDF file with hidden information and outputting a scrambling parameter and a new object position mark as a key.

（S2）如图5所示，在带有隐藏信息的PDF文件中提取隐藏信息，其具体是：(S2) As shown in Figure 5, the hidden information is extracted from the PDF file with hidden information, specifically:

（S21）读取带隐藏信息的PDF文件流及密钥；(S21) Read the PDF file stream and key with hidden information;

（S22）根据密钥中新对象位置标志，在PDF文件的数据流中查找并确定以追加更新方式写入的新对象；(S22) According to the new object position flag in the key, search and determine the new object written in the way of append update in the data stream of the PDF file;

（S23）提取所确定的新对象内的数据流并对其进行解码；现有技术中，每个PDF对象的内容如果被编码压缩，则PDF对象中会有对应的标识说明编码使用的滤波器，大多数的PDF对象采用Flatedecode压缩算法，本发明中也默认使用现有技术中的算法进行编码，提取隐藏信息时则可根据现有技术中的对应的解码算法对PDF文件进行解码。而加密的隐藏信息，在提取和恢复的时候，会有对应的密钥—置乱参数，该密钥在嵌入隐藏信息时会被记录，以用于提取隐藏信息。(S23) Extract and decode the data stream in the determined new object; in the prior art, if the content of each PDF object is encoded and compressed, there will be a corresponding identifier in the PDF object indicating the filter used for encoding , most of the PDF objects use the Flatecode compression algorithm, and the algorithm in the prior art is also used for encoding by default in the present invention, and the PDF file can be decoded according to the corresponding decoding algorithm in the prior art when extracting hidden information. When the encrypted hidden information is extracted and restored, there will be a corresponding key-scrambling parameter, which will be recorded when the hidden information is embedded, and used to extract the hidden information.

（S24）根据密钥中的置乱参数，将解码后的新对象数据流进行反置乱；(S24) Descrambling the decoded new object data stream according to the scrambling parameters in the key;

（S25）将进行反置乱后的数据流顺序组合并输出，得到隐藏信息。(S25) Sequentially combining and outputting the descrambled data streams to obtain hidden information.

如图6所示是原始PDF文件的显示效果图，图7是利用本发明得到的已嵌入隐藏信息的PDF文件的显示效果图，从图中可以看出嵌入隐藏信息后对PDF文件的显示没有带来任何影响，本发明对隐藏信息具有良好的视觉隐蔽性。As shown in Figure 6, it is the display effect figure of the original PDF file, and Figure 7 is the display effect figure of the PDF file embedded with hidden information obtained by utilizing the present invention, as can be seen from the figure, there is no difference in the display of the PDF file after the embedded hidden information Regardless of the impact, the present invention has good visual concealment for hidden information.

图8是对已嵌入隐藏信息的PDF文件进行各种注释、标记操作后的效果图。该图是使用Adobe Acrobat 9 Professional软件对已嵌入隐藏信息的PDF文件进行注释、标记的结果。通过实验利用本发明对被编辑后的PDF文件提取隐藏信息，提取检测结果正确率为100%，本发明对一般编辑行为是鲁棒的。Fig. 8 is an effect diagram after performing various annotation and marking operations on a PDF file embedded with hidden information. This figure is the result of annotating and marking PDF files with embedded hidden information using Adobe Acrobat 9 Professional software. By using the present invention to extract hidden information from edited PDF files through experiments, the correct rate of extraction and detection results is 100%, and the present invention is robust to general editing behaviors.

图9和图10是对已嵌入隐藏信息的表单类PDF文件进行的编辑行为前后图示，其中图9是嵌入隐藏信息但可以进行任何编辑操作的表单类PDF文件图示，图10是对图9进行了内容编辑保存后得到的文件图示。通过实验利用本发明对被编辑后的表单类文件进行隐藏信息的提取，提取检测结果正确率为100%。因此，本发明对所述编辑行为同样是鲁棒的。因此，本发明对PDF文件的版权认证、真伪辨别等具有很好的实用性。Figure 9 and Figure 10 are diagrams before and after the editing behavior of a form-type PDF file embedded with hidden information. 9 The file icon obtained after content editing and saving. Through experiments, the present invention is used to extract hidden information from edited form files, and the correct rate of extraction and detection results is 100%. Therefore, the invention is also robust to said editing behavior. Therefore, the present invention has good practicability for copyright authentication, authenticity discrimination, etc. of PDF files.