Understanding the Portable Document Format (PDF)

Free Delivery on all Books at the Book Depository
Preface:
I wish to acknowledge that this article was written with full reference to http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf. Most of all that I have learned about PDFs are from the above reference. If you are really interested take time to read it. Surprisingly, it is easy and interesting to read! I am writing this tutorial out of my interest in knowing the PDF specification. My quest started when I tried hard but failed to extract text from a simple PDF file that contained a single page of text. Please let me know (steve@printmyfolders.com) if you find any errors. I have relied on the PDF specification (link on page top) to create this tutorial. This tutorial covers PDF files conforming to the ISO 32000-1 specification (Pages vi to viii in the PDF32000_2008.pdf give more information on this).
 
 
PDF files are interesting. If you were to open up a PDF file in a text editor like Notepad, it may look like junk and probably not very interesting. But it will make sense once you understand that PDF files follows a pattern or a set of rules.
 
"At the core of PDF is an advanced imaging model derived from the PostScript® page description language. This PDF Imaging Model enables the description of text and graphics in a device-independent and resolution-independent manner. To improve performance for interactive viewing, PDF defines a more structured format than that used by most PostScript language programs. Unlike Postscript, which is a programming language, PDF is based on a structured binary file format that is optimized for high performance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactive viewing and document interchange." (Quoted from Page vii of PDF32000_2008.pdf).
 
The basic building blocks of a PDF files are objects. There are eight types of objects that are commonly used in PDF files. Before we look at them we will briefly look at the character set of PDF. There are 3 types of characters - regular, delimiter and white-space characters.
 
White-space characters: Null, Horizontal Tab, Line Feed, Form feed, Carriage return and space. White-space characters separate names and other objects from each other. Interestingly, PDF treats all white-space characters outside a comment, string or stream the same. Outside a comment, string or stream PDF considers any sequence of consecutive white-space characters as one character. What this means is that you may have 5 spaces but in reality it is considered as one. Note that this does not apply to white-space characters within strings, streams and comments. The Carriage return and Line feed are considered as end-of-line markers (EOL). Carriage return followed immediately by a Line feed is considered as one EOL marker.
 
Delimiter characters: (, ), <, >, [, ], {, }, / and % (4 pairs and 2 unique). These are used in the objects which we would look at later. They basically delimit (mark the boundary or border) for entities.
 
Regular character: All characters other than White-space and Delimiter characters including those that are not part of the standard ASCII character set.

An interesting fact to note is that a PDF can consist entirely of just ASCII characters or can consist of ASCII characters and binary data (that are outside the group of ASCII characters). Most PDf files that are encrypted or contain images will have binary data (images are represented in binary). PDF files that contain binary data get corrupted when edited or even opened and saved in normal text editors like Notepad. 
 

You may also wonder why you don't see any text or its equivalent when opening a PDF file in a text editor or even binary editor. There may be two reasons. The first and most common reason is that the content stream (where the text is stored/kept) is encoded (transformed/changed) to conserve space. This is what happens with most files.The second reason could be that the PDF file is encrypted purposely to keep the text secure.
 
Objects:
 
1. Boolean values: The keyword true or false

2. Numeric values: There are two types; integer & real. Integers are numbers without any decimal points and can have a + or – symbol preceding it. For instance the number 10 is an example of an integer. Real numbers must have a decimal point. 10.5, 0.0, +5.0, -1.0. Real numbers cannot be expressed in exponential format. 

3. Strings: Strings contain characters (can be zero characters as well). They can be literal characters within parenthesis or hexadecimal data within angle brackets. Notice that the parenthesis and angle brackets are delimiter characters that we learnt earlier. 

(I love Java (and PDF))

There are escape characters that can be used. Refer to the PDF specification for more details. The sequence \ddd where ddd is an octal character can be used to represent characters outside the ASCII character set.

Here is an example of a String represented with hexadecimal numbers. 

<48454C4C4F>

Each pair is taken as a value. In the above example, the hexadecimal value 48 is decimal 72 which is the ASCII equivalent of H. Likewise 45 is E, 4C is L and 4F is O. The above string is same as the string (HELLO). If the final hexadecimal digit is on its own (without another digit to make a pair) a zero is attached to the end.

<4845C> will be considered as <4845C0>

4. Names: Names consist of a sequence of characters (except null). A forward slash / must be used to introduce the name. In case hash (#) is part of the name, use # followed by its hexadecimal code 23. To represent characters using their hexadecimal value use # followed by the hexadecimal value. All characters that are not regular characters have to be represented by the # followed by their hexadecimal value. Please refer to the PDF specification for more details.

/MyName represents MyName

/My#20Name represents My Name

/All#20#23Numbers represents All #Numbers

5. Arrays: These are similar to the arrays found in computer languages but these arrays are different in that they can contain different object types (including strings, names and even other arrays). Arrays are represented within square brackets. 
 
[false 170 85.5 (Hello) /My#20Name]
 
Arrays are single dimensional but can include other arrays which can hold arrays themselves.
 
6. Dictionaries:
These are similar to an actual dictionary, where a description follows a word. The description here can be any object (including another dictionary) but the name is always a Name object that we just discussed. The key (Name) is unique (there cannot be two similar Names). A dictionary is represented within << >>.

<< /Name (Steve)

/Age 99

/Address << /Number 1234

                /Street (Java Street)

                /Suburb (Java Town)

                /PostCode (4321)

                >>

>>  

7. Streams: Streams represent large data that cannot fit into a String. An image, for instance, can be represented as a stream. As you will see later, the contents of each page in PDF is represented as a 'Contents' stream. It consists of a dictionary followed by the keyword ‘stream’, newline, the stream’s data and the keyword ‘endstream’.
 
 
dictionary

stream

……….

endstream

The dictionary contains the length (number of bytes) of the stream. This must have a ‘Length’ entry to mention the length (number of bytes) of the stream. An error occurs if the stream has more bytes of data that the length mentioned in the dictionary. As of PDF 1.2, the stream’s data can be in an external file and the dictionary will hold the details of the external file. In such a case, the data between the keywords ‘stream’ and ’endstream’ is ignored (by the reader). There are other optional entries for the stream dictionary but discussing these are beyond the scope of this tutorial.

Having looked at the basic objects, let's go further and look at other components of a PDF file.

 
8. Null object - Refers to a non existent object and denoted by the keyword null.
 
 
Comments: The comment is represented by the percentage sign i.e. %. This is commonly used to describe the version of PDF specification used in creating the PDF file.
%PDF-1.7
 
Indirect Objects
An object (for example, a string) that has been given a unique object identifier, that other objects can use to refer to itself is called an indirect object. This is different from the 'Name' object that we looked at earlier. Looking at any PDF file in its raw form (for instance, by opening it in a text editor) you will notice lots of indirect objects.
 

The identifier has two parts. The first part is the object number that can be any positive integer. The second part is a generation number – the first ‘version’ of the PDF file gets 0. The updated version gets a higher number and so on. You declare an indirect object within the keywords ‘obj’ and ’endobj’. Be aware that the combination of the object number and generation number has to be unique. 

1 0 obj

(my biography)

endobj

The object number is 1 and the generation number is 0. The indirect object is the string object (my biography).  If another object wanted to refer to this object, it would quote the object number, followed by space, generation number, followed by space and then R.

1 0 R

It is not an error if one was to refer to a non existing indirect object.

File Structure: A PDF file has the following structure
 

1.       A Header (not more than a line) showing the PDF specification this file follows.

2.       A File Body that has the objects of the file

3.       A Cross-reference table that has info about indirect objects

4.       A Trailer that shows the location of the cross-reference table and special objects within the body of the file.

File Header: This denotes the PDF specification version of the PDF file. %PDF- followed by version number 1.N, where N is a digit between 0 & 7.

%PDF-1.2

As mentioned earlier this is actually a comment that is used to specify the PDF version.

 

Beginning with version 1.4 the document's catalog dictionary is used instead of this. If the file has binary data, there will be at least four binary characters, immediately after the header. This is to show PDF reading applications (like Adobe reader) that the PDF has binary data. Again, when opening some PDF files in their raw form (as in a text editor) you may notice the four binary characters just after the first comment.

File Body: The File body consists of indirect objects (discussed earlier). These objects represent text and other details (like font type etc.) used in displaying PDF.  As of version 1.5, the body can also contain object streams. 

Cross-Reference Table: This table is similar to a directory. It contains the location of each object within the PDF file. By looking at the entries in this table, the PDF reading application (for example, Adobe Reader), can easily locate an object within the file. This saves time as the object is accessed in a random manner (rather than reading every line of the file). The cross-reference table can have one or more sections. Each of these sections can have one or more subsections. 

Each section begins wth the word 'xref'. Following this line are two numbers separated by a single space. The first number is the object number of the first of the series of objects listed below it. The second number refers to the number of entries in that subsection. For a PDF file that has been created for the first time or a PDF file that has not been incrementally updated, there shall be only one subsection and the object numbering starts with 0. Notice that the object numbers have to be consecutive. In the example below, it is safe to assume that entries for objects 1, 2, 3 & 4 will follow. 

xref

0 5  

Following this are the entries for each object. Each entry shall be exactly 20 bytes long. The entries are of the format

nnnnnnnnnn ggggg meol

nnnnnnnnnn - This is a ten digit value. This reveals how far the object is from the start of the file. For instance, the value 100 denotes that the object is 100 bytes from the start of the file. 

ggggg - 5-digit generation number

m - can be either 'n' or 'f'. 'n' denotes that the object is still in use and 'f' denotes that the object has been deleted and is free. 

eol - end of line. Consists of 2 chars.

The ten digits, followed by space, followed by five digits, followed by space, followed by a single character and the eol make exactly 20 digits. If the first two numbers are not long enough, to be ten and five digits respectively, zeroes are added to the front. 

Let's come back to the 0 5 that we saw in the example earlier.The 0 denotes the object number of the first object in this subsection. The value 5 denotes that there are 5 entries (including the one for 0) and that the remaining four entries are for objects with object numbers 1, 2, 3 and 4. The first entry at the cross-reference table is for object 0. Object 0 will have 0000000000 as its first ten digits (if there are no other free objects) and will always have 65535 as its 5-digit generation number. It also shall have 'f' as the character.  

xref

0 2
0000000000 65535 f
...........
 If there are object(s) that have been deleted and are free then the ten digit number will be changed to denote the nth entry of the next free object. To make it easy to understand let's look at an example.
xref
0 4
0000000003 65535 f
0000000015 00000 n
0000000075 00000 n
0000000000 00005 f
 
 
The 0 4 denotes that there are four entries - Entry for object 0 followed by entries for objects 1, 2 & 3. The first ten digits (0000000003) of the first entry for object 0 points to the next free object, which is, object 3. If there had been another free object, then the 4th entry will have the object number of the next free object. In this case, as there are no other free objects it points back to object 0. Objects 1 & 2 are 15 & 75 bytes (respectively) away from the start of the file. This basically informs lets the PDF application know that object 3 is free and therefore it can be used to refer to another data.
 
Let's look at another example
  
xref
0 4
0000000003 65535 f
0000000015 00000 n
0000000075 00000 n
0000000000 00005 f
9 2
0000000099 00000 n
0000000150 00000 n
 
In the above case, in addition to objects 0, 1, 2, 3 there are two other objects 9 & 10.
 
An object cannot be entered in more than one subsection within a section.
 
When an indirect object is deleted, its entry is marked as free (by changing the n to f) and linked to the linked list of free objects. It's generation number is increased by 1. The object's generation number gets updated each time the object gets deleted and can go upto a maximum of 65,535. For instance, an indirect object that was referenced as 1 0 will become 1 1 when reused.
 
Trailer: The end of a PDF file is read first by the PDF reading application. The trailer holds information about the location and details of the Cross-reference table. The trailer has three parts. The first part has the keyword trailer followed by a dictionary that holds values for certain fields.  
The second part has the keyword startxref, and in the next line, a number. The number denotes how far (in bytes) the keyword xref (of the last section of the cross-reference table) is from the start of the file. The very next line has the value %%EOF to denote the end of the file.  
A random PDF taken from my computer has this trailer. Looking at this trailer I can assume that the xref of the last section of the cross-reference table is found 361441 bytes from the beginning of the file. 
 
trailer
<< /Size 62 /Root 1 0 R /Info 2 0 R
/ID [(X$X@>66...)(X$X@>66...)]
>>
startxref
361441
%%EOF
 
 
The following keys are mandatory for the trailer dictionary.
 
Size - Total number of entries in the cross reference table (combination of original & update sections). The value has to be an integer. In the example above there are 62 entries including an entry for object 0.
 
Root - Is an indirect reference to the PDF's catalog (which we will learn later). In the example above, I can assume that the indirect object 1 is the catalog.
 
Some keys are mandatory when certain capabilities are used. We may look at the Info & ID keys later. 
 
Incremental Updates: The team that developed the PDF specification was smart enough to include a special feature. When a PDF gets updated, the changes are added to the end of the file rather than updating the original content. This feature saves time (as the whole file need not be modified). However, this also makes me wonder what happens when many changes take place. Does the size of the PDF file increase massively?
 
When a PDF gets an incremental update, in addition to the data being added, a new cross-reference section is created. This new section contains entries for all the objects that were deleted, replaced or changed. As we learnt earlier deleted objects have the 'f' letter at the end of the cross-reference entry. This means that if say object 5, existed before and was deleted during the update the new cross section will have the same entry but with 'f' as the last character in the entry for object 5.
 
When the PDF file gets updated, along with a new cross-reference section a new trailer is added. This contains all the entries from the previous trailer but will have a different value for the Prev entry in the dictionary. The Prev entry will have the location of the previous cross-reference section.
 
%%EOF will continue to be the last line for the new trailer as well. Hopefully we will discuss this in detail later.
 
PDF Document Structure:
The structure of a PDF file is like the different levels of hierarchy found in a typical company. Similar to the CEO, the Document Catalog dictionary sits at the top of the hierarchy.
 
As we saw earlier a PDF reading application will look at the trailer of the PDF first. The trailer will have a Root entry that has the location of the catalog. This is similar to a person (PDF reading application) approaching the CEO (Document Catalog) after finding her contact details via the contact section (Trailer) on the company's website (PDF file).
 
Document Catalog: The Document Catalog is a dictionary that refers to other objects that define the PDF file. Basically, the Document Catalog is like the centre from where every information about the PDF file can be found. Being a dictionary it consists of various keys. We will  for the time being only look at the mandatory keys.
 
 
Type - will always be Catalog (type Name)
Pages - An indirect reference to the object that is the root of the page tree (will look at this later)
 
A PDF file that I created using a free PDF creating software has this Catalog Dictionary
 
 
1 0 obj
<</Type /Catalog /Pages 3 0 R
>>
endobj
 
You will notice that each of these dictionaries always start with a '/Type' entry that descirbes what type of dictionary it is. In this case, it is a 'Catalog' dictionary.
 
An application that reads the above Catolog dictionary will know that it needs to read the 'Pages' dictionary (indirect object 3) to get information about the pages in this PDF file.
 
 
Page Tree:  Page Tree is the name of the structure used to describe the pages in a PDF file. It has two type of nodes - page tree nodes and page objects. Each page in a PDF file is represented as a Page object. Each of these objects is called as 'leaf' node in the Page Tree.
 
Page Tree Nodes: The mandatory keys are
 
 
Type - will always be Pages for a Page Tree node
Parent - the page tree node which is this node's parent. Not allowed in root node.
Kids - an array referring to the children of this node. The children can only be page tree nodes or page objects
Count - the number of page objects that are descendants of this node
 
The PDF that I had created earlier has this page tree (remember that the Catalog Dictionary was pointing to indirect object 3).
 
 
3 0 obj
<< /Type /Pages /Kids [
4 0 R
] /Count 1
/Rotate 0>>
endobj
 
This Page tree node has only one kid which is object 4. The Parent key is missing and therefore this is the root node.
 
As the /Count is 1, we can safely assume that there is only 1 page under this Page tree (which based on the /Kids array is indirect object 4.
 
As menioned earlier, you will notice that this dictionary too has an entry '/Type' that reveals what type of dictionary it is.
 
Page Objects: This is a dictionary that reveals the page itself characteristics. Some of the keys are
 
 
Note: Most of the keys are new to me. I have purposefully left out keys that make no sense to me at this moment. As I learn more about the PDF specification I will hopefully cover them in detail.
 
 
Type - Will always be Page
Parent - An indirect reference to the parent of this page
LastModified - Date and time when this page was last modified
Resources - The resources required by this page. This usually refers to the font used on this page and other info.
MediaBox - A rectangle that defines the boundary inside which the page has to be displayed.
Contents - A content stream that describes the contents of this page.
Rotate - In multiples of 90. Rotates the page by the number of degrees before displaying.
Thumb - A stream object that gives the thumbnail image for this page.
Dur - the number of seconds the page will be displayed in presentations before automatically moving on to the next page.
Trans - A dictionary advising what transition to use when displaying the page during presentation.
Annots - This is an array of dictionaries containing references to all the annotations for this page
AA - This is the short form for additional-actions. This dictionary defines the actions that need to be taken when the file is open or closed.
Metadata - A stream that contains metadata for this page
 
 
Here is a grab from a sample PDF that I created using a free PDF creating software.
 
 
4 0 obj
<</Type/Page/MediaBox [0 0 595 842]
/Rotate 0/Parent 3 0 R
/Resources<</ProcSet[/PDF /Text]
/ExtGState 10 0 R
/Font 11 0 R
>>
/Contents 5 0 R
>>
endobj
 
3 0 obj
<< /Type /Pages /Kids [
4 0 R
] /Count 1
/Rotate 0>>
endobj
 
1 0 obj
<</Type /Catalog /Pages 3 0 R
>>
endobj
 
As you can see Object 1 is the catalog that directs the PDF reading application to the root of the page tree (Object 3). Object 3, the root node had only one kid (Object 4) and obviously cannot have a parent. Object 4 is 'displayed' within a rectangle (0 0 595 842) and is not rotated (Rotate 0) and has Object 3 as its parent. It's 'resources' as well as its contents (Object 5) are included. Here is Object 5 from my file.
 
As we had discussed earlier, the stream in this object starts with a dictionary that shows the length of the stream (which is stored in Object 6). We will discuss more about Content streams further down.
 
 
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
xMK.1.鶵u*j.czi2& 7KnSK..Z?]."6.3w>^&s@MQ...K.d\>}q...    ).|.ѣ'o1lA.ۥ
S-lE,.C.W.&#xf01d2;YGKM\vjEG.'|F[j.:..2f2Ź^.uujNWnjY::si/.L9,ČGPY1k/.%'f!endstream
endobj
 
 
Page attributes are inherited: Here is an interesting fact. Certain attributes in a page can be inherited from its parent or any of its ancestors in the page tree. The eliminates the need to keep repeating similar attributes for every child, grandchild etc. If an ancestor defines a value for an attribute, that value can be replaced or changed by the child.
 
Name Dictionary: Rather than referring to the objects by their references, some objects can be referred to by their names. The link between the names and their references is stored in the PDF file's name dictionary. One of the optional keys in the Catalog, Names is used to used to specify the Name Dictionary. Please refer to the PDF specification for more details.
 
Content Streams: This is a stream (an object in PDF, if you remember) that has instructions on how to display text & graphics on the corresponding page. In Object 5, of my PDF file, mentioned earlier and repeated below we can see a stream. This stream gives the PDF reader (for instance Adobe Reader) the instructions on how to display.

5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
xMK.1.鶵u*j.czi2& 7KnSK..Z?]."6.3w>^&s@MQ...K.d\>}q...    ).|.ѣ'o1lA.ۥ
S-lE,.C.W.&#xf01d2;YGKM\vjEG.'|F[j.:..2f2Ź^.uujNWnjY::si/.L9,ČGPY1k/.%'f!endstream
endobj
 
The data in the stream makes no sense because the data has been decoded (changed its form) by the filter. There are various filters that can be used. Filters are used to compress data and save space. In the following sections we will look more in detail about the structure of these data and understand how they form instructions for the PDF reading application to display the page.
 
Note that unlike other objects in a PDF file, the instructions in the object stream are read and followed sequentially (one after the other).
 
Before proceeding further we will try to create a simple PDF file from what we have learnt so far.
 
Sample PDF file: Here is a sample PDF file that I created with help from the specification. You can copy this file from here and save it in a text editor like notepad. Save it with a filename but with a file extension "pdf". In notepad, you will have to save as "filename.pdf" (Quotes inclusive). You can then view it with a PDF reader (for instance using Acrobat Reader).

 
 
Feedback and suggestions welcome at steve@printmyfolders.com
 

 

Comments