Skip to content

1.1.2 Syntax

Felix Schütt edited this page Jul 6, 2017 · 11 revisions

As any file format, the data inside a PDF has a special syntax. These are well-defined rules how the data is written into the file. Before we look closer at the contents of a PDF file, we have to learn these rules.

Everything in the PDF body is structured in so-called "objects". In the "Hello World" PDF, the document body starts after the %%PDF-1.4 and end before the xref (not include these lines!). An "object" in the context of a PDF is just any kind of information. It does not have anything to do with objects used in programming languages.

There are several types of these objects which can contain various types of information and each one is serialized differently. However, there is one rule that all objects have to adhere to: Two objects have to be seperated by one or more whitespace items, except if the start of the next object is obvious from the context the object is used in.

A general warning: Everything in PDF is case-sensitive. You always have to match the exact capitalization.

Whitespace

In PDF, a whitespace is defined as the space key (ASCII 0x20), the tab key (ASCII 0x09) and the line break. Latter can both be defined as the UNIX "\n" (ASCII 0x0A), the (until Mac OS 9) Apple "\r" (ASCII 0x0D) or both (Windows "\r\n").

Theoratically you can also use the NUL sign (ASCII 0x00) and the page feed (ASCII 0x0C). These characters are however not widespread in everyday-use PDFs.

All objects should be seperated by whitespace. Type and count of whitespace is not relevant.

Numbers

Numbers can be written as integers (referred to as "Integer") or as floating-point numbers (referred to as "Real"). Integers are simply written with the numbers 0 - 9, negative numbers with a ASCII "-". Other characters are not allowed, especially no whitespace or comma / punctuation.

Correct:

1234 -1234

Wrong:

1'234 1234-

Floating-point numbers can use a decimal dot (ASCII ".") to denote the decimal places. Comma (",") or exponential notation is not allowed.

Correct:

123.4 0.1234

Wrong:

123,4 1.234e2

Names

Names are short strings that are used to describe document-internal keys or values. They are created by an ASCII "/", followed by a text in the code space from ASCII 0x20 - 0x7E, (printable characters) except for whitespace and the characters #,(,),<,>,[,],{,},/ and %.

Example:

/Type /MediaBox

Sometimes you may have to set names that contain spaces or problematic characters. These characters can be encoded with a pound sign ("#"), followed by the hex code of the characters. The encoding should be ASCII or UTF-8 (usually programs which can display names are usually prepared to show both encodings).

Louis Grand	/Louis#20Grand
Page#34	        /Page#2334
Höflinger	/H#C3#B6flinger

As a further restriction: The code for the NUL character (#00) is illegal and the codes for whitespace (#09, #0A, #0C, #0D, #20) may not follow immediately to the leading forward slash of a name (no /#20Name).

Strings

Strings are texts that are made to be displayed in any way. They are written in round braces. The text itself may contain opening and closing braces itself, as long as those braces are balances. Otherwise braces have to be escaped with a backslash. Backslashes themselves have to be written as \\ to display a \.

Examples:

(Hello World)                        % Hello World
(Text (with balanced) braces)        % Text (with balanced) braces
(Text (with \(unbalanced) braces)    % Text (with (unbalanced) braces
(Text \\with \\backslashes     % Text \with \backslashes

The text encoding is a difficult topic. Strings for metadata have to be encoded in PDFDoc. This encoding is a variant of ISO Latin-1, in which some control characters were replaced by printable signs. Problematic is the " " sign (ASCII 0x0A). In ISO Latin-1, it looks like a space, in PDFDoc however is denotes the Euro sign ("€").

Strings which are created to be displayed on paper are encoded in the encoding of the font they are written it. This can lead to pain and misery - however, there's a trick: In our "Hello World" PDF, we locked the font encoding with /Encoding /WinAnsiEnconding to WinAnsi. This is Adobes name for "Windows 1252-western Europe".

"Windows 1252" is (same as PDFDoc) a variant of ISO Latin-1, in which some control character are converted to printable characters. Other than PDFDoc, however, the decision was made to keep the existing, printable characters. So if we write any character into the PDF using "Windows 1252" and keep our metadata as printable characters ISO Latin-1 characters (except for 0x0A) we can consider the file as "encoded as Windows 1252".

Later on, we will go into greater detail on how we can support Unicode and multi-byte characters with different code space ranges independent of fonts.

Hex strings

Sometimes we have to encode strings which can't be encoded via ASCII, ISO Latin-1, PDFDoc or Windows-1252. For these cases you can use hex strings. These are written is angled brackets (< and >) and contain two-byte hex codes of the bytes. Whitespace is ignored.

Examples:

<48 61 6C 6C 6F 20 57 65 6C 74>
<48616C6C6F2057656C74>
<48616C6C6F20
57656C74>

Boolean and Null

null is used sparingly. true and false encode ... well, if a value is set to true or false. They are simply written as the literals null, true and false.

Arrays

Arrays are collections of objects that belong together. Contained objects are written between square brackets - as usual, seperated by whitespace. Since the array itself is an object, nested arrays are possible - as long as the braces are balanced.

Examples:

[1 2 3]
[/One /Two /Three]
[/One /array [/with /a /nested] /array]

Dictionaries

Dictionaries contain collections of objects, too, however every object has a name. The order of objects is not important (in contrary to arrays). Dictionaries are written between two opening angled brackets and two closing ones (<< and >>). Within a dictionary the names alternate with PDF objects, seperated by whitespace, like a key - value pair.

The keys must be PDF name objects. The value can be of any type except null. Is a null entry present, the entry is regarded as if it was never set.

Examples:

<< /Title (Hello World) /Author (John Doe) >>

<<
/Title (Hello World)
/Author (John Doe)
>>

Dictionaries follow the same rules as arrays: Dictionaries are themselves objects and can contain other dictionaries or arrays or vice versa.

Streams

Streams are used to encode larger data blocks into PDF objects, usually compressed. They are typically used for page content descriptions, images and embedded fonts or color profiles.

A stream begins with a dictionary. This dictionary must at least have the /Length entry set, containing the written length of the stream in the PDF (measured in bytes). A line break and the word stream must follow the dictionary, followed by a line break, the actual data, another line break and the keyword endstream.

The second line break is needed because on older Apple systems it wouldn't be clear where the whitespace ends and the data begins.

Examples:

<<
/Length 51
>>
stream
This is a short example stream. Your image data would go here.
endstream

Indirect objects

An indirect object is an object that has a unique ID (possible with every object, usual only for dictionaries and streams). Streams are only allowed as indirect objects.

An indirect object is prefixed by the object ID, the generation ID and the keyword obj. After the object the keyword endobj has to follow immediately. The object ID is globally incrementing number. The generation ID is usually 0, only used when content has been edited, for example, for objects who have been deleted and whose IDs can be overwritten later.

Example:

1 0 obj
<<
/Title (Hello World)
/Author (John Doe)
>>
endobj

2 0 obj
(Other objects can be indirect objects, too)
endobj

References

References are not objects, but placeholders (like pointers). They are used to refer to an indirect object. Multiple references to one object are allowed, minimizing file size.

References consist of the object ID and the generation ID of the indirect object, followed by an uppercase "R".

Examples:

1 0 obj
(indirect text)
endobj

2 0 obj
<<
/Direct (direct text)
/Indirect 1 0 R
>>
endobj

References can be stored in arrays - sometimes a bit confusing, since they look like three seperate objects at first glance. You'll have to look closely to see the difference:

These are 9 numbers: [1 0 8 2 0 8 3 0 8]

This however are 3 references: [1 0 R 2 0 R 3 0 R]

Instructions

In page content streams, you'll usually find instructions. These are specialized keywords which can execute certain actions. For example, the placement of text.

As in PostScript, these instructions follow the postfix notation - meaning that the instruction keyword comes after the parameters.

Example:

72 746 Td (Hello World) Tj

In this example, the first instruction is to call Td with the parameters 72 and 726. The instruction is used to place text at certain coordinates. Afterwards we call the instruction Tj, which takes the literal string "Hello World" and creates in on the virtual paper.

Instructions are very case sensitive - differently cased instructions can mean different things and they can be easily confused, for example S and s.

Next up: File structure