国产老熟女高潮毛片A片仙踪林,欧美喂奶吃大乳,狠狠爱无码一区二区三区,女神的私人医生动漫免费阅读

新聞建站cms系統(tǒng)、政府cms系統(tǒng)定制開發(fā)

廣州網(wǎng)站建設(shè)公司-閱速公司

asp.net新聞發(fā)布系統(tǒng)、報紙數(shù)字報系統(tǒng)方案
/
http://www.tjsimaide.com/
廣州網(wǎng)站建設(shè)公司
您當(dāng)前位置:首頁>網(wǎng)站技術(shù)

網(wǎng)站技術(shù)

C#pdf解析(asp.net)

發(fā)布時間:2019/10/22 16:21:19  作者:Admin  閱讀:511  

廣告:

1. Introduction 介紹

This project allows you to read and parse PDF filse and display their internal structure. The PDF file specification document is available from Adobe. This project is based on “PDF Reference, Sixth Edition, Adobe Portable Document Format Version 1.7 November 2006”. It is an intimidating 1310 pages document. This article provides a concise overview of the specifications. The associated project defines C# classes for reading and parsing a PDF file. To test these classes the attached test program PdfFileAnalyzer allows you to read a PDF file analyzes it and display and save the result. The program breaks the PDF file into individual page descriptions, fonts, images and other objects.

Version 2.0 supports encrypted files. The software is divided into a PDF reader library and a test/demo program.

該項目使您可以閱讀和解析PDF文件,并顯示其內(nèi)部結(jié)構(gòu)。 PDF文件規(guī)范文檔可從Adobe獲得。 該項目基于``PDF參考,第六版,Adobe可移植文檔格式版本1.7 2006年11月''。 它是一個令人生畏的1310頁文件。 本文提供了規(guī)范的簡要概述。 關(guān)聯(lián)的項目定義了用于讀取和解析PDF文件的C#類。 要測試這些類,請使用隨附的測試程序PdfFileAnalyzer讀取PDF文件進(jìn)行分析并顯示并保存結(jié)果。 該程序?qū)DF文件分為單獨的頁面描述,字體,圖像和其他對象。

2.0版支持加密文件。 該軟件分為PDF閱讀器庫和測試/演示程序。

2. Overview 總覽

The PDF file is structured to allow Adobe Acrobat to display and print each page on a variety of screens and printers. If you open the file with a binary editor you will see that most of the file is unreadable. The small sections that are readable look like:

PDF文件的結(jié)構(gòu)允許Adobe Acrobat在各種屏幕和打印機上顯示和打印每個頁面。 如果使用二進(jìn)制編輯器打開文件,則將看到大部分文件都不可讀。 可讀的小部分如下所示:

1 0 obj
<</Lang(en-CA)/MarkInfo<</Marked true>>/Pages 2 0 R
/StructTreeRoot 10 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[4 0 R]/Type/Pages>>
endobj 
4 0 obj
<</Contents 5 0 R/Group <</CS/DeviceRGB /S/Transparency /Type/Group>>
/MediaBox[0 0 612 792] /Parent 2 0 R
/Resources <</Font <</F1 6 0 R /F2 8 0 R>>
/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>
/StructParents 0/Tabs/S/Type/Page>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 2319>>
stream
. . .
endstream
endobj

The file is made of objects nested between “n 0 obj” and “endobj” keywords. The PDF term is indirect objects. The numbers before “obj” are the object number and the generation number. The generation number is always zero. Items enclosed within double angle brackets <<>> are dictionaries. Items enclosed between square brackets [] are arrays. Items starting with slash / are parameters names (i.e. /Pages). In the example above the first item “1 0 obj” is the document catalog or the root object. The catalog has in its dictionary an item “/Pages 2 0 R”. This is a reference to an object that defines tree of pages. In this case, object number 2 has a reference to one page “/Kids[4 0 R]”. This is a one page document. Object number 4 is the only page definition. The page size is 612 by 792 points. In other words 8.5” by 11” (1” is 72 points). The page uses two fonts F1 and F2. They are defined in objects 6 and 8. The page contents are being described in object number 5. Object number 5 has a stream that describes the painting of the page. In the example we have “. . .” as place holder for this description. If you tried to look at the PDF file with binary editor the stream will look as a long block of unreadable random numbers. The reason for it is that you are looking at compressed data. The stream is compressed with ZLib deflate method. This is specified in the dictionary by “/Filter /FlateDecode”. The compressed stream is 2319 bytes long. If you decompress the stream the first few items will look something like this:

該文件由嵌套在“ n 0 obj”和“ endobj”關(guān)鍵字之間的對象組成。 PDF術(shù)語是間接對象。 “ obj”之前的數(shù)字是對象編號和世代編號。世代數(shù)始終為零。包含在雙尖括號<< >>中的項目是詞典。方括號[]之間的項目是數(shù)組。以斜杠/開頭的項目是參數(shù)名稱(i.e. /Pages)。在上面的示例中,第一項“ 1 0 obj”是文檔目錄或根對象。目錄在其詞典中有一個項目“ / Pages 2 0 R”。這是對定義頁面樹的對象的引用。在這種情況下,對象編號2引用一頁“ / Kids [4 0 R]”。這是一頁文件。對象編號4是唯一的頁面定義。頁面大小為612 x 792點。換句話說,是8.5英寸乘11英寸(1英寸是72點)。該頁面使用兩種字體F1和F2。它們在對象6和8中定義。頁面內(nèi)容在對象5中描述。對象5具有描述頁面繪畫的流。在示例中,我們有“。 。 。”作為此說明的占位符。如果您嘗試使用二進(jìn)制編輯器查看PDF文件,則流將看起來像一堆無法讀取的隨機數(shù)。原因是您正在查看壓縮數(shù)據(jù)。使用ZLib deflate方法壓縮流。這在字典中由“ / Filter / FlateDecode”指定。壓縮流的長度為2319個字節(jié)。如果解壓縮流,則前幾項將如下所示:

q
37.08 56.424 537.84 679.18 re
W* n
/P <</MCID 0>> BDC 0.753 g
36.6 465.43 537.96 24.84 re
f*
EMC  /P <</MCID 1/Lang (x-none)>> BDC BT
/F1 18 Tf
1 0 0 1 39.6 718.8 Tm
0 g
0 G
[(GRA)29(NOTECH LI)-3(MIT)-4(ED)] TJ
ET

This is a small sample of page description language. In this example “re” stands for rectangle. The four numbers before it are position and size “X Y Width Height”.

這是頁面描述語言的一小部分。 在此示例中,“ re”代表矩形。 前面的四個數(shù)字是位置和大小“ X Y寬度高度”。

This simplified example demonstrates the general idea behind PDF files. You start with a root object that point to hierarchy of pages. Each page defines resources such as fonts, images and contents streams. Contents streams are made of operators and arguments required to paint the pages. The PdfFileAnalyzer will produce an object summary file. This file contains all the objects without the streams. Each stream will be decoded and saved as a separate file. Page descriptions are saved as text files. Image streams are saved as .jpg or .bmp files. Font streams are saved as .ttf files. Other streams that are binary are saved as .bin files. Text streams are saved as .txt files. Page descriptions go through another parsing process that translates the cryptic one or two letters codes into a pseudo C# source. As an example the page description above is translated to:

這個簡化的示例演示了PDF文件背后的一般思想。 您從一個指向頁面層次結(jié)構(gòu)的根對象開始。 每個頁面定義諸如字體,圖像和內(nèi)容流之類的資源。 內(nèi)容流由繪制頁面所需的運算符和參數(shù)組成。 PdfFileAnalyzer將產(chǎn)生一個對象摘要文件。 該文件包含所有沒有流的對象。 每個流將被解碼并保存為單獨的文件。 頁面描述另存為文本文件。 圖像流另存為.jpg或.bmp文件。 字體流另存為.ttf文件。 其他二進(jìn)制流將另存為.bin文件。 文本流另存為.txt文件。 頁面描述經(jīng)過另一個解析過程,該過程將一個或兩個神秘的字母代碼轉(zhuǎn)換為偽C#源。 例如,以上頁面描述被翻譯為:

q
37.08 56.424 537.84 679.18 re
W* n
/P <</MCID 0>> BDC 0.753 g
36.6 465.43 537.96 24.84 re
f*
EMC  /P <</MCID 1/Lang (x-none)>> BDC BT
/F1 18 Tf
1 0 0 1 39.6 718.8 Tm
0 g
0 G
[(GRA)29(NOTECH LI)-3(MIT)-4(ED)] TJ
ET

The remaining part of this article will go into PDF file structure and the parsing process in more details. The following sections will cover: object definitions, file structure, file parsing, File reading, and using the PdfFileAnalyzer program.

本文的其余部分將詳細(xì)介紹PDF文件結(jié)構(gòu)和解析過程。 以下各節(jié)將介紹:對象定義,文件結(jié)構(gòu),文件解析,文件讀取以及使用PdfFileAnalyzer程序。

3. Object Definitions 對象定義

PDF file is made of objects. Each PDF object has a corresponding class in the PdfFileAnalyzer project. All of these object classes are derived classes from PdfBase class. The source code for objects class definition is BasicObjects.cs. The exact PDF objects definition is available in chapter 3 of the Adobe's PDF specifications.

PDF文件由對象組成。 每個PDF對象在PdfFileAnalyzer項目中都有一個對應(yīng)的類。 所有這些對象類都是PdfBase類的派生類。 對象類定義的源代碼是BasicObjects.cs。 PDF對象的確切定義在Adobe PDF規(guī)范的第3章中提供。

3.1. Basic Objects 基本對象

  • Boolean object is implemented by PdfBoolean class. The PDF definition of Boolean is the same as C#.
  • Integer object is implemented by PdfInt class. The PDF definition is the same as Int32 in C#.
  • Real number object is implemented by PdfReal class. The PDF definition is the same as Single in C#.
  • String object is implemented by PdfStr class. The PDF definition is different than C#. String is made out of bytes not characters. It is enclosed in parenthesis (). The PdfFileAnalyzer saves the PDF string in a C# string including the parenthesis. PDF string is useful for ASCII encoding.
  • Hexadecimal string object is implemented by PdfHex class. It is a string of characters defined by two hex digits per byte and enclosed within angle brackets <>. The PdfFileAnalyzer saves the PDF hex string in C# string including the angle brackets. For PDF readers the string and the hex string objects serve the same purpose. The string (AB) is the equivalent to <4142>. PDF hex string is useful for any encoding.
  • Name object is implemented by PdfName class. Name object are made of forward slash followed by a sequence of characters. For example /Width. Named objects are used as parameters names. The PdfFileAnalyzer saves the name object in C# string including the leading /.
  • Null object is implemented by PdfNull class. The PDF definition of null is basically the same as in C#.
  • 布爾對象由PdfBoolean類實現(xiàn)。布爾值的PDF定義與C#相同。

     

    整數(shù)對象由PdfInt類實現(xiàn)。 PDF定義與C#中的Int32相同。

     

    實數(shù)對象由PdfReal類實現(xiàn)。 PDF的定義與C#中的Single相同。

     

    字符串對象由PdfStr類實現(xiàn)。 PDF定義與C#不同。字符串由字節(jié)而不是字符組成。它括在括號()中。 PdfFileAnalyzer將PDF字符串保存在包含括號的C#字符串中。 PDF字符串對于ASCII編碼很有用。

     

    十六進(jìn)制字符串對象由PdfHex類實現(xiàn)。它是由每個字節(jié)兩個十六進(jìn)制數(shù)字定義的字符串,并括在尖括號<>中。 PdfFileAnalyzer將PDF十六進(jìn)制字符串保存在C#字符串中,包括尖括號。對于PDF閱讀器,字符串對象和十六進(jìn)制字符串對象具有相同的用途。字符串(AB)等效于<4142>。 PDF十六進(jìn)制字符串可用于任何編碼。

     

    名稱對象由PdfName類實現(xiàn)。名稱對象由正斜杠后跟一系列字符組成。例如/ Width。命名對象用作參數(shù)名稱。 PdfFileAnalyzer將名稱對象保存在C#字符串中,包括前導(dǎo)/。

     

    Null對象由PdfNull類實現(xiàn)。 PDF中null的定義基本上與C#中的相同。

3.2. Compound Objects 復(fù)合對象

  • Array object is implemented by PdfArray class. PDF array is a collection of objects enclosed within square brackets []. The objects of one array can be a mix of any type except stream. The PdfFileAnalyzer saves the objects in a C# array of PdfBase class. Since all objects are derived classes of PdfBase there is no problem saving a mix of object types within this array. When array object is converted to a string (ToString() method), the program adds a leading and trailing square brackets. Array can be empty. Example of array with six objects: [120 9.56 true null (string) <414243>].
  • Dictionary object is implemented by PdfDict class. PDF dictionary is a collection of key and value pairs enclosed within double angle brackets <<>>. Dictionary key is a name object and value is any object except stream. The PdfFileAnalyzer saves one key value pair in PdfPair class. The key is a C# string and the value is PdfBase. The PdfDict class has an array of PdfPair classes. Dictionary is accessed by key. Therefore pair ordering is not important. PdfFileAnalyzer sorts the pairs by key value. Example of dictionary with three pairs: <</CropBox [0 0 612 792] /Rotate 0 /Type /Page>>.
  • Stream object is implemented by PdfStream. Streams are used to hold page description language, images and fonts. PDF Stream is made of two parts a dictionary and a stream of bytes. The dictionary defines the stream parameters. One of the stream dictionary entries is /Filter. The PDF document defines 10 types of filters. PdfFileAnalyzer supports 4 filters. These 4 filters are the only ones I found to be in general use. The compression filter FlateDecode is the most used filter by current PDF writers. FlateDecode supports ZLib deflate decompression. The LZWDecode compression filter was used a few years ago. In order to read older PDF files, this program supports this filter. ASCII85Decode filter converting printable ASCII to binary. DCTDecode for JPEG image compression. The PdfFileAnalyzer implement decompression for the first three. The DCTDecode stream is saved as is with file extension .jpg. It is an image file that can be viewed.
  • Object stream was introduced in PDF 1.5. It is a stream that contains multiple indirect objects (described below). Stream objects described above are compressed one stream at a time. Object stream compresses all the included streams in one compressed section.
  • Cross-reference stream was introduced in PDF 1.5. It is a stream that contains cross-reference table described later in the article.
  • Inline image object is implemented by PdfInlineImage. It is a stream within a stream. Inline image is part of page description language. It is made of three operators BI-begin image, ID-image data and EI-end image. The area between BI and ID is an image dictionary and the area between ID and EI is the image data.
  • 數(shù)組對象由PdfArray類實現(xiàn)。 PDF數(shù)組是括在方括號[]中的對象的集合。一個數(shù)組的對象可以是除流以外的任何類型的混合。 PdfFileAnalyzer將對象保存在PdfBase類的C#數(shù)組中。由于所有對象都是PdfBase的派生類,因此在此數(shù)組中保存混合對象類型沒有問題。當(dāng)數(shù)組對象轉(zhuǎn)換為字符串(ToString()方法)時,程序?qū)⑻砑忧皩?dǎo)和尾隨方括號。數(shù)組可以為空。具有六個對象的數(shù)組的示例:[120 9.56 true null(字符串)<414243>]。
  • 字典對象由PdfDict類實現(xiàn)。 PDF詞典是括在雙尖括號<< >>中的鍵和值對的集合。字典鍵是名稱對象,值是除流以外的任何對象。 PdfFileAnalyzer在PdfPair類中保存一對鍵值對。關(guān)鍵是一個C#字符串,值是PdfBase。 PdfDict類具有PdfPair類的數(shù)組。字典通過鍵訪問。因此,配對排序并不重要。 PdfFileAnalyzer按鍵值對對進(jìn)行排序。具有三對字典的示例:<< / CropBox [0 0 612 792] / Rotate 0 / Type / Page >>。
  • 流對象由PdfStream實現(xiàn)。流用于保存頁面描述語言,圖像和字體。 PDF Stream由字典和字節(jié)流兩部分組成。字典定義了流參數(shù)。流字典條目之一是/ Filter。 PDF文檔定義了10種類型的過濾器。 PdfFileAnalyzer支持4個過濾器。這4個過濾器是我發(fā)現(xiàn)普遍使用的唯一過濾器。壓縮過濾器FlateDecode是當(dāng)前PDF編寫者最常用的過濾器。 FlateDecode支持ZLib放氣解壓縮。 LZWDecode壓縮過濾器是在幾年前使用的。為了讀取較舊的PDF文件,該程序支持此過濾器。 ASCII85解碼過濾器,將可打印的ASCII轉(zhuǎn)換為二進(jìn)制。 JPEG圖像壓縮的DCTDecode。前三個的PdfFileAnalyzer實現(xiàn)解壓縮。 DCTDecode流按原樣保存,文件擴展名為.jpg。這是一個可以查看的圖像文件。
  • 對象流是在PDF 1.5中引入的。它是包含多個間接對象的流(如下所述)。上述流對象一次壓縮一個流。對象流在一個壓縮段中壓縮所有包含的流。
  • 交叉引用流是在PDF 1.5中引入的。它是一個流,其中包含本文后面介紹的交叉引用表。
  • 內(nèi)嵌圖像對象由PdfInlineImage實現(xiàn)。它是流中的流。嵌入式圖像是頁面描述語言的一部分。它由三個運營商BI開頭圖像,ID圖像數(shù)據(jù)和EI結(jié)束圖像組成。 BI和ID之間的區(qū)域是圖像字典,ID和EI之間的區(qū)域是圖像數(shù)據(jù)。

3.3. Indirect Objects 間接對象

  • Indirect object is implemented by PdfIndirectObject. It is the main building block of a PDF document. An indirect object is any object encased between “n 0 obj” and “endobj”. Other objects can refer to indirect object by specifying “n 0 R”. The “n” is the object number. The “0” is the generation number. This program does not support generation number other than 0. The PDF specification allows for other numbers. The idea behind multi-generation is to allow PDF modifications by keeping the original file and appending changes.
  • Object reference is a way of referring to indirect objects. For example /Pages 2 0 R is a dictionary entry in the catalog object. It is a pointer to /Pages object. The pages object is indirect object number 2.
  • 間接對象由PdfIndirectObject實現(xiàn)。 它是PDF文檔的主要構(gòu)建塊。 間接對象是包含在“ n 0 obj”和“ endobj”之間的任何對象。 通過指定“ n 0 R”,其他對象可以引用間接對象。 “ n”是對象編號。 “ 0”是世代號。 該程序不支持0以外的世代號。PDF規(guī)范允許其他數(shù)字。 多代背后的想法是通過保留原始文件并附加更改來允許PDF修改。
  • 對象引用是引用間接對象的一種方式。 例如,/ Pages 2 0 R是目錄對象中的詞典條目。 它是指向/ Pages對象的指針。 pages對象是間接對象2。

3.4. Operators and keywords 運算符和關(guān)鍵字

  • Operators and keywords are not considered PDF objects. However, the PdfFileAnalyzer program has a PdfOp and a PdfKeyword classes that are derived classes of PdfBase. During the parsing process the parser creates a PdfOp or a PdfKeyword for each valid sequence of characters. Appendix A Operator Summary of the Adobe's PDF file specification lists all the operators. The list is made of 73 operators. Here are some examples of operators: BT-begin text object, G-set gray level for stroking operations, m-move to, re-rectangle and Tc-set character spacing. Examples of keywords: stream, obj, endobj, xref.
  • 運算符和關(guān)鍵字不被視為PDF對象。 但是,PdfFileAnalyzer程序具有PdfOp和PdfKeyword類,它們是PdfBase的派生類。 在解析過程中,解析器為每個有效字符序列創(chuàng)建一個PdfOp或PdfKeyword。 附錄A Adobe PDF文件規(guī)范的運算符摘要列出了所有運算符。 該列表由73個操作員組成。 以下是一些運算符的示例:BT開頭的文本對象,用于筆劃操作的G-set灰度級,m-move to,re-rectangular和Tc-set字符間距。 關(guān)鍵字示例:stream,obj,endobj,xref。

4. File Structure 檔案結(jié)構(gòu)

PDF file is made of four parts: header, body, cross-reference and trailer signature.

  • Header: The header is the file signature. It must be %PDF-1.x where x is 0 to 7.
  • Body: The body area contains all the indirect objects.
  • Cross-reference: The cross-reference is a table of file position pointers to all indirect objects. There are two types of cross reference tables. The original style made of ASCII characters. The new style is a stream within an indirect object. The information is encoded as binary numbers. At the end of the cross-reference table there is a trailer dictionary. A file can have more than one cross-reference area.
  • Trailer signature: The trailer signature is made of: keyword “startxref”, byte offset to the last cross-reference table, and end signature %%EOF. Please note: trailer dictionary is part of cross-reference area.
  • PDF文件由四部分組成:標(biāo)題,正文,交叉引用和預(yù)告片簽名。
  • 標(biāo)頭:標(biāo)頭是文件簽名。 它必須是%PDF-1.x,其中x是0到7。
  • 主體:主體區(qū)域包含所有間接對象。
  • 交叉引用:交叉引用是指向所有間接對象的文件位置指針的表。 交叉引用表有兩種類型。 原始樣式由ASCII字符組成。 新樣式是間接對象中的流。 該信息被編碼為二進(jìn)制數(shù)。 交叉引用表的末尾有一個預(yù)告字典。 一個文件可以具有多個交叉引用區(qū)域。
  • 預(yù)告片簽名:預(yù)告片簽名由以下內(nèi)容組成:關(guān)鍵字“ startxref”,到最后一個交叉引用表的字節(jié)偏移量和結(jié)束簽名%% EOF。 請注意:預(yù)告片字典是交叉引用區(qū)域的一部分。

5. File Parsing 文件解析

The PDF file is a sequence of bytes. Some of the bytes have special meaning.

White space is defined as: null, tab, line feed, form feed, carriage return and space.

Delimiters are defined as: (, ), <, >, [, ], {, }, /, %, and white space characters.

File parsing is done with PdfParser class. To start the parsing process the program sets file position to the area to be parsed. ParseNextItem() is the method that extract the next object.

The parser skips white space and comments. If next byte is “(“ the object is a string. If next byte is “[“ the object is an array. If next two bytes are “<<“ the object is a dictionary. If next byte is “<“ the object is a hex string. If next byte is “/“ the object is a name. If the next byte is none of the above the parser accumulates the following bytes until a delimiter is found. The delimiter is not part of the current token. The token can be integer, real number, operator or keyword. In the case of integer, the program will search further for object reference “n 0 R” or indirect object “n 0 obj” where n is the integer. The returned value from ParseNextItem() is the appropriate object as per section 4. Object Definitions. The object class is returned as PdfBase class.

In the case of array or dictionary, the program will perform recursive calling of the ParseNextItem() to parse the internal objects of the array or dictionary.

PDF文件是一個字節(jié)序列。一些字節(jié)具有特殊含義。

空格定義為:null,制表符,換行符,換頁符,回車符和空格。

分隔符定義為:(,),<,>,[,],{,},/,%和空格字符。

文件解析是通過PdfParser類完成的。為了開始解析過程,程序?qū)⑽募恢迷O(shè)置為要解析的區(qū)域。 ParseNextItem()是提取下一個對象的方法。

解析器跳過空白和注釋。如果下一個字節(jié)是“(”,則該對象是一個字符串。如果下一個字節(jié)是“ [”,則該對象是一個數(shù)組。如果后兩個字節(jié)是“ <<”,則該對象是字典。如果下一個字節(jié)是“ <”,則該對象是一個十六進(jìn)制字符串。如果下一個字節(jié)是“ /”,則對象是一個名稱。如果下一個字節(jié)不是上述內(nèi)容,則解析器將累積以下字節(jié),直到找到分隔符為止。該分隔符不是當(dāng)前標(biāo)記的一部分。令牌可以是整數(shù),實數(shù),運算符或關(guān)鍵字。對于整數(shù),程序?qū)⑦M(jìn)一步搜索對象引用“ n 0 R”或間接對象“ n 0 obj”,其中n是整數(shù)從ParseNextItem返回的值()是第4節(jié)中合適的對象。對象定義。對象類作為PdfBase類返回。

如果是數(shù)組或字典,則程序?qū)arseNextItem()進(jìn)行遞歸調(diào)用以解析數(shù)組或字典的內(nèi)部對象。

6. File Reading 文件讀取

PdfReader class is the main class of PDF file analysis. The entry method is OpenPdfFile(String FileName, string Password = null). The program opens the PDF file for binary reading (one byte at a time).

File analysis starts with checking the header signature %PDF-1.x where x is 0 to 7 and the trailer end signature %%EOF. One would think that all PDF writers would put the header at position zero of the file and the trailer at the very end of the file. Unfortunately it is not the case. The program has to search for these two signatures at the two ends of the file. If the header signature is not at position zero, all indirect objects file position pointers have to be adjusted.

Just before the trailer signature there is a pointer to the start of the last cross-reference table.

The parser sets file position for cross-reference table. If the next object is “xref” keyword we have the original style cross reference. Otherwise, it is the new stream bases cross reference. The file can have more than one cross reference table. The file can have both the new and old style of tables. Each table is a list of object numbers and file position pointers to the starting point of indirect reference objects. For each active object the program creates a PdfIndirectObject object and saves it in ObjectArray. The object is empty except for object number and position. For original cross-reference table the position is relative to the file. For the stream type cross-reference the position is relative to a parent indirect object stream.

During this process if indirect object has generation number other than zero, program execution will be aborted. PdfFileAnalyzer does not support multi-generation.

At the end of the cross-reference table we have a trailer dictionary. In order to include this dictionary in the analysis we create a dummy indirect object with negative object number and save the dictionary in it.

The program looks for four particular entries in the trailer dictionary. If /Encrypt is found, the file will be decrypted. Next the program looks for /Root the object number of the catalog object. If /XRefStm entry exist, we have both types of cross reference. Finally if /Prev exist we have another cross-reference table to process.

After the cross-reference processing is done we have an array of all indirect objects. The available information at this stage of the process is object number and position. Next, the program loops through the array and reads and parses each indirect object. This process sets the object value. If the object is a stream, only the dictionary part is being parsed. The reason is that the stream length might not be known at this time. In addition to the object, the system sets object type and subtype members for dictionary and stream objects if these two values are available.

Next the program loops through all objects and process stream objects. Stream objects have object type equal to “/ObjStm”. The program reads the stream associated with these objects and breaks the streams to multiple indirect objects.

Next the program searches all dictionary objects and stream dictionary objects for object reference objects. The program is looking for key value pairs such as: “/name n 0 R”. If a pair like that is found, the program checks the object type. If the object type was not set during object parsing phase, the type is set to the /name value.

The next step is to read all streams that were not read before. The system reads the stream from the file. Each stream is decoded and saved to an appropriate file. The PdfFileAnalyzer supports the following filters: /FlateDecode, /LZWDecode, /ASCII85Decode and /DCTDecode. Text file will have extension .txt, binary files .bin, image files .jpg or .bmp, font files .ttf and cross-reference file .xref. The /FlateDecode is ZLib Deflate compression method.

The next step is to build page contents. The program follows the page tree starting from the root. Page objects are not stream objects. In other words, page description commands are not available directly within the page object. Page objects directories have a /Contents key value pair. If this pair is missing, the page is blank. The value of the contents entry can be a single reference or an array of references. The program will create a dummy contents stream for the page from the one or multiple contents streams. The page contents dummy streams are saved in PageObj_xx.txt and in PageSource_xx.txt. The former file is the actual page description contents for the page. The later file is the same information converted to pseudo C# source code. Section 2. Overview has examples of these two files.

The page contents stream is made of arguments and operators. For example rectangle will be four real numbers followed by re. Inline image is the exception to this rule. It is described above in Section 3. Object Definitions.

Finally, the program produces the object summary file ObjectSummary.txt. The file shows all indirect objects information without the streams.

PDF文件是一個字節(jié)序列。一些字節(jié)具有特殊含義。

空格定義為:null,制表符,換行符,換頁符,回車符和空格。

分隔符定義為:(,),<,>,[,],{,},/,%和空格字符。

文件解析是通過PdfParser類完成的。為了開始解析過程,程序?qū)⑽募恢迷O(shè)置為要解析的區(qū)域。 ParseNextItem()是提取下一個對象的方法。

解析器跳過空白和注釋。如果下一個字節(jié)是“(”,則該對象是一個字符串。如果下一個字節(jié)是“ [”,則該對象是一個數(shù)組。如果后兩個字節(jié)是“ <<”,則該對象是字典。如果下一個字節(jié)是“ <”,則該對象是一個十六進(jìn)制字符串。如果下一個字節(jié)是“ /”,則對象是一個名稱。如果下一個字節(jié)不是上述內(nèi)容,則解析器將累積以下字節(jié),直到找到分隔符為止。該分隔符不是當(dāng)前標(biāo)記的一部分。令牌可以是整數(shù),實數(shù),運算符或關(guān)鍵字。對于整數(shù),程序?qū)⑦M(jìn)一步搜索對象引用“ n 0 R”或間接對象“ n 0 obj”,其中n是整數(shù)從ParseNextItem返回的值()是第4節(jié)中合適的對象。對象定義。對象類作為PdfBase類返回。

如果是數(shù)組或字典,則程序?qū)arseNextItem()進(jìn)行遞歸調(diào)用以解析數(shù)組或字典的內(nèi)部對象。

PDF文件是一個字節(jié)序列。一些字節(jié)具有特殊含義。

空格定義為:null,制表符,換行符,換頁符,回車符和空格。

分隔符定義為:(,),<,>,[,],{,},/,%和空格字符。

文件解析是通過PdfParser類完成的。為了開始解析過程,程序?qū)⑽募恢迷O(shè)置為要解析的區(qū)域。ParseNextItem()是提取下一個對象的方法。

如果下一個字節(jié)是“(”,則該對象是一個字符串。如果下一個字節(jié)是“ [”,則該對象是一個數(shù)組。如果后兩個字節(jié)是“ <<”,則該對象是字典。如果下一個字節(jié)是“ <”,則該對象是一個十六進(jìn)制字符串。如果下一個字節(jié)是“ /”,則對象是一個名稱。如果下一個字節(jié)不是上述內(nèi)容,則解析器將累積以下字節(jié),直到找到分隔符為止。該分隔符不是當(dāng)前標(biāo)記的一部分。令牌可以是整數(shù),實數(shù),運算符或關(guān)鍵字。對于整體,程序?qū)⑦M(jìn)一步搜索對象引用“ n 0 R”或間接對象“ n 0 obj”,其中n是整體從ParseNextItem返回的值()是第4節(jié)中合適的對象。對象定義。對象類作為PdfBase類返回。

如果是副本或字典,則程序?qū)arseNextItem()進(jìn)行遞歸初始化以解析數(shù)組或字典的內(nèi)部對象。

6.文件讀取

PdfReader類是PDF文件分析的主要類。輸入方法為OpenPdfFile(String FileName,string Password = null)。該程序?qū)⒋蜷_PDF文件以進(jìn)行二進(jìn)制讀取(一次讀取一個字節(jié))。

文件分析首先檢查標(biāo)頭簽名%PDF-1.x,其中x為0到7,以及尾標(biāo)結(jié)束簽名%% EOF。有人會認(rèn)為所有PDF編寫者都會將標(biāo)頭放在文件的零位置,而將標(biāo)頭放在文件的末尾。不幸的是事實并非如此。該程序必須在文件的兩端搜索這兩個簽名。如果標(biāo)題簽名不在零位置,則必須調(diào)整所有間接目標(biāo)文件位置指針。

在預(yù)告片簽名之前,有一個指向最后一個交叉引用表開始的指針。

解析器設(shè)置交叉引用表的文件位置。如果下一個對象是“ xref”關(guān)鍵字,我們將使用原始樣式交叉引用。否則,它是新的流基礎(chǔ)交叉引用。該文件可以具有多個交叉引用表。該文件可以具有新樣式表和舊樣式表。每個表都是對象編號和指向間接引用對象起點的文件位置指針的列表。程序為每個活動對象創(chuàng)建一個PdfIndirectObject對象并將其保存在ObjectArray中。除對象編號和位置外,該對象為空。對于原始交叉引用表,位置是相對于文件的。對于流類型交叉引用,位置相對于父級間接對象流。

在此過程中,如果間接對象的世代號不為零,則程序執(zhí)行將中止。 PdfFileAnalyzer不支持多代。

在交叉引用表的末尾,我們有一個預(yù)告片字典。為了在分析中包括該詞典,我們創(chuàng)建了一個帶有負(fù)對象號的虛擬間接對象,并將該詞典保存在其中。

該程序在預(yù)告片字典中查找四個特定的條目。如果找到/ Encrypt,則文件將被解密。接下來,程序查找/ Root目錄對象的對象號。如果/ XRefStm條目存在,則我們有兩種類型的交叉引用。最后,如果/ Prev存在,我們還有另一個交叉引用表要處理。

交叉引用處理完成后,我們將得到所有間接對象的數(shù)組。在該過程的此階段可用的信息是對象編號和位置。接下來,程序循環(huán)遍歷數(shù)組,并讀取和解析每個間接對象。此過程設(shè)置對象值。如果對象是流,則僅解析字典部分。原因是此時流長度可能未知。除對象外,如果這兩個值可用,則系統(tǒng)還會為字典和流對象設(shè)置對象類型和子類型成員。

接下來,程序循環(huán)遍歷所有對象并處理流對象。流對象的對象類型等于“ / ObjStm”。程序讀取與這些對象關(guān)聯(lián)的流,并將流拆分為多個間接對象。

下一個

7. TestPdfFileAnalyzer Program

The PdfFileAnalyzer application was developed to test the PDF file parsing classes. If you want to test the executable program outside the development environment, create a PdfFileAnalyzer directory and copy the TestPdfFileAnalyzer.exe program and the PdfFileAnalyser.dll class library into this directory and run the program. If you run the project from the Visual C# development environment, make sure you define a working directory in the Debug tab of the project properties. This program was developed using Microsoft Visual C# 2019.

Start the program. The available options are: Open PDF File, and Recent Files.

On first program execution you must run Setup and define project directory. This directory will hold all sub-directories that will be created for each PDF file being analyzed.

Open button will display a standard file selection dialog. Navigate to the PDF file you want to analyze.

The PdfFileAnalyzer screen will change to object summary screen:

開發(fā)PdfFileAnalyzer應(yīng)用程序以測試PDF文件解析類。 如果要在開發(fā)環(huán)境之外測試可執(zhí)行程序,請創(chuàng)建一個PdfFileAnalyzer目錄,然后將TestPdfFileAnalyzer.exe程序和PdfFileAnalyser.dll類庫復(fù)制到該目錄中并運行該程序。 如果從Visual C#開發(fā)環(huán)境中運行項目,請確保在項目屬性的“調(diào)試”選項卡中定義了一個工作目錄。 該程序是使用Microsoft Visual C#2019開發(fā)的。

啟動程序。 可用選項包括:打開PDF文件和最近的文件。

在第一次執(zhí)行程序時,您必須運行安裝程序并定義項目目錄。 該目錄將包含將為每個要分析的PDF文件創(chuàng)建的所有子目錄。

打開按鈕將顯示一個標(biāo)準(zhǔn)文件選擇對話框。 導(dǎo)航到要分析的PDF文件。

PdfFileAnalyzer屏幕將更改為對象摘要屏幕:

Each row represents an indirect PDF object. Each column is: 每行代表一個間接PDF對象。 每列是

  • Object No. The indirect object number. In the case of trailer dictionary, the object number is a dummy number, it is negative but on the screen it shows as TRn.
  • Object. The type of object as per Section 4. Object Definitions.
  • Type. If the object is a dictionary or a stream, the type is the value of /Type dictionary pair. If the object is not a dictionary or the dictionary does not contain /Type, the displayed value comes from an indirect reference to this object.
  • Subtype. If the object is a dictionary or a stream and if the dictionary contains /Subtype entry it is displayed in this column.
  • Parent Object No. If the indirect object is part of object stream (see Section 3.2. Compound Objects), this column is the object number of the object stream.
  • Parent Index. If the indirect object is part of object stream, this number is the index number within the parent object stream.
  • Object Position. For indirect object files that are not object stream type; this is the object position within the PDF file. Indirect objects that are part of object stream; this is the position within the parent. Position is given in decimal and hexadecimal for programmers who would like to view the PDF file in binary editor.
  • Stream Position and Stream Length. The position and length of the stream. The position is relative to the file or the parent in the same way as object position above.
  • 對象編號。間接對象編號。對于尾部詞典,對象號是一個虛擬數(shù),它是負(fù)數(shù),但在屏幕上顯示為TRn。
  • 賓語。對象的類型,請參見第4節(jié)。對象定義。
  • 類型。如果對象是字典或流,則類型是/ Type字典對的值。如果該對象不是字典,或者該字典不包含/ Type,則顯示的值來自對該對象的間接引用。
  • 子類型。如果對象是字典或流,并且如果字典包含/ Subtype條目,則該對象顯示在此列中。
  • 父對象號。如果間接對象是對象流的一部分(請參見第3.2節(jié)“復(fù)合對象”),則此列是對象流的對象號。
  • 父級索引。如果間接對象是對象流的一部分,則此數(shù)字是父對象流內(nèi)的索引號。
  • 對象位置。對于不是對象流類型的間接對象文件;這是PDF文件中的對象位置。作為對象流一部分的間接對象;這是父母中的職位。對于要在二進(jìn)制編輯器中查看PDF文件的程序員,位置以十進(jìn)制和十六進(jìn)制給出。
  • 流位置和流長度。流的位置和長度。該位置相對于文件或父級,與上面的對象位置相同。

To view the ObjectSummary.txt file, press the Summary button. Below is an example of the start of this file. 要查看ObjectSummary.txt文件,請按摘要按鈕。 下面是此文件開始的示例。

PDF file name: interactiveform_DATA.pdf

Trailer Dictionary
------------------
<</DecodeParms<</Columns 5/Predictor 12>>/Filter/FlateDecode/ID[<f681c578264452c4ab65398fdc7c0daa><b4
25aedbd5c8c544a84d960c3f738458>]/Index[3 1 7 1 18 1 100 5 108 2 116 1 123 1 126 1 128 1 134 1 136 1 173
11]/Info 18 0 R/Length 71/Prev 116/Root 20 0 R/Size 184/Type/XRef/W[1 3 1]>>

Indirect Objects
----------------
Object number: 1
Object Value Type: Stream
File Position: 67126 Hex: 10636
Stream Position: 67201 Hex: 10681
Stream Length: 695 Hex: 2B7
Object Type: /ObjStm
<</Filter/FlateDecode/First 22/Length 695/N 4/Type/ObjStm>>

Object number: 2
Object Value Type: Stream
File Position: 67915 Hex: 1094B
Stream Position: 67990 Hex: 10996
Stream Length: 354 Hex: 162
Object Type: /ObjStm
<</Filter/FlateDecode/First 33/Length 354/N 5/Type/ObjStm>>

Object number: 3
Object Value Type: Stream
File Position: 91134 Hex: 163FE
Stream Position: 91193 Hex: 16439
Stream Length: 21616 Hex: 5470
Object Type: /Metadata
Object Subtype: /XML
<</Length 21616/Subtype/XML/Type/Metadata>>

To view the details of an indirect object either select a row and press the View button or double click on a row. The object analysis screen will be displayed.

For all non stream objects, the first three buttons are disabled. The only information available is the object itself. You can view it in text or hexadecimal formats.

For stream objects the first button name is the object type. The first two buttons object type and Stream allow you to toggle between viewing the object or the stream. The Hex and Text allow you to view in binary or text format. If the stream is image, the image will be displayed rather than text. If the stream is a cross-reference stream, the text format shows four columns: (1) object number, (2) type (0-unused, 1-normal object, 2-stream object), (3) position for type 1 and parent for type 2 and (4) parent index number. If the stream is binary (i.e. font), it can be viewed in hexadecimal only.

Page object is treated as a stream object. The text displayed is the concatenation of all contents objects. In addition, the Source button allows you to view the page description language in what appears as C# code.

Images (.jpg and .bmp) can be rotated and scaled.

要查看間接對象的詳細(xì)信息,請選擇一行并按“查看”按鈕,或雙擊一行。將顯示對象分析屏幕。

對于所有非流對象,前三個按鈕均被禁用。唯一可用的信息是對象本身。您可以以文本或十六進(jìn)制格式查看它。

對于流對象,第一個按鈕名稱是對象類型。前兩個按鈕對象類型和流允許您在查看對象或流之間切換。十六進(jìn)制和文本允許您以二進(jìn)制或文本格式查看。如果流是圖像,則將顯示圖像而不是文本。如果流是交叉引用流,則文本格式顯示四列:(1)對象編號,(2)類型(0未使用,1普通對象,2流對象),(3)類型1的位置類型2和(4)父級索引號的父級。如果流是二進(jìn)制(即字體),則只能以十六進(jìn)制形式查看。

頁面對象被視為流對象。顯示的文本是所有內(nèi)容對象的串聯(lián)。此外,“源代碼”按鈕允許您以C#代碼形式查看頁面描述語言。

圖像(.jpg和.bmp)可以旋轉(zhuǎn)和縮放。

Page indirect object example. 頁面間接對象示例。

Object number: 22
Object Value Type: Dictionary
File Position: 13810 Hex: 35F2
Object Type: /Page
<</Annots 97 0 R/ArtBox[0 0 612 792]/BleedBox[0 0 612 792]/Contents 81 0 R/CropBox[0 0 612 792]/MediaBox
[0 0 612 792]/Parent 16 0 R/Resources<</ColorSpace<</CS0 137 0 R>>/ExtGState<</GS0 138 0 R>>/Font<</C0_0
143 0 R/T1_0 146 0 R/T1_1 149 0 R/T1_2 151 0 R>>/ProcSet[/PDF/Text]/Properties<</MC0<</Metadata 91 0 R>>>>/Shading
<</Sh0 153 0 R>>>>/Rotate 0/TrimBox[0 0 612 792]/Type/Page>>

Content stream example. 內(nèi)容流示例。

Object number: 22
Object Value Type: Dictionary
File Position: 13810 Hex: 35F2
Object Type: /Page
<</Annots 97 0 R/ArtBox[0 0 612 792]/BleedBox[0 0 612 792]/Contents 81 0 R/CropBox[0 0 612 792]/MediaBox
[0 0 612 792]/Parent 16 0 R/Resources<</ColorSpace<</CS0 137 0 R>>/ExtGState<</GS0 138 0 R>>/Font<</C0_0
143 0 R/T1_0 146 0 R/T1_1 149 0 R/T1_2 151 0 R>>/ProcSet[/PDF/Text]/Properties<</MC0<</Metadata 91 0 R>>>>/Shading
<</Sh0 153 0 R>>>>/Rotate 0/TrimBox[0 0 612 792]/Type/Page>>

8. History 歷史

  • 2012/08/25: Version 1.0, Original revision.
  • 2013/04/10 Version 1.1. Support for world regions that define comma as decimal separator.
  • 2014/03/10 Version 1.2 Fix problem related to PDF files with Cross Reference Stream
  • 2015/04/02 Version 1.3 Remove error messages related to unimplemented stream compression filters.
  • 2019/06/14 Version 2.0 The software is divided into two projects, a library and a test program. Encrypted files are supported.
  • 2019/06/19 Version 2.1 Minor changes to sofware.
  • 2012/08/25:版本1.0,原始修訂。
  • 2013/04/10版本1.1。 支持將逗號定義為小數(shù)點分隔符的世界區(qū)域。
  • 2014/03/10版本1.2解決了與具有交叉引用流的PDF文件有關(guān)的問題
  • 2015/04/02版本1.3刪除與未實現(xiàn)的流壓縮過濾器有關(guān)的錯誤消息。
  • 2019/06/14版本2.0該軟件分為兩個項目,一個庫和一個測試程序。 支持加密文件。
  • 2019/06/19版本2.1對軟件的較小更改。

License 許可

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

本文以及所有相關(guān)的源代碼和文件均已獲得代碼項目開放許可(CPOL)的許可

廣告:

相關(guān)文章
C#
pdf解析
cms新聞系統(tǒng)購買咨詢
掃描關(guān)注 廣州閱速軟件科技有限公司
掃描關(guān)注 廣州閱速科技
主站蜘蛛池模板: 祁东县| 台中市| 南京市| 中西区| 合作市| 绥宁县| 奉新县| 长海县| 师宗县| 句容市| 开平市| 仪陇县| 蓬安县| 丰都县| 普宁市| 三门峡市| 清镇市| 安顺市| 乳山市| 鲁山县| 金门县| 舞钢市| 海安县| 即墨市| 桑日县| 木兰县| 晋宁县| 缙云县| 阳曲县| 辛集市| 若尔盖县| 广河县| 德化县| 塔城市| 自贡市| 龙口市| 星座| 荔波县| 海盐县| 炉霍县| 安庆市|