Skip to content

1.1.2 Syntax

Felix Schütt edited this page Jul 5, 2017 · 11 revisions

As any file format, the data inside a PDF has a special syntax. These are well-defined rules how the data is written into the file. Before we look closer at the contents of a PDF file, we have to learn these rules.

Everything in the PDF body is structured in so-called "objects". In the "Hello World" PDF, the document body starts after the %%PDF-1.4 and end before the xref (not include these lines!). An "object" in the context of a PDF is just any kind of information. It does not have anything to do with objects used in programming languages.

There are several types of these objects which can contain various types of information and each one is serialized differently. However, there is one rule that all objects have to adhere to: Two objects have to be seperated by one or more whitespace items, except if the start of the next object is obvious from the context the object is used in.

A general warning: Everything in PDF is case-sensitive. You always have to match the exact capitalization.

Whitespace

In PDF, a whitespace is defined as the space key (ASCII 0x20), the tab key (ASCII 0x09) and the line break. Latter can both be defined as the UNIX "\n" (ASCII 0x0A), the (until Mac OS 9) Apple "\r" (ASCII 0x0D) or both (Windows "\r\n").

Theoratically you can also use the NUL sign (ASCII 0x00) and the page feed (ASCII 0x0C). These characters are however not widespread in everyday-use PDFs.

All objects should be seperated by whitespace. Type and count of whitespace is not relevant.

Numbers

Numbers can be written as integers (referred to as "Integer") or as floating-point numbers (referred to as "Real"). Integers are simply written with the numbers 0 - 9, negative numbers with a ASCII "-". Other characters are not allowed, especially no whitespace or comma / punctuation.

Correct:

1234 -1234

Wrong:

1'234 1234-

Floating-point numbers can use a decimal dot (ASCII ".") to denote the decimal places. Comma (",") or exponential notation is not allowed.

Correct:

123.4 0.1234

Wrong:

123,4 1.234e2

Names

Names are short strings that are used to describe document-internal keys or values. They are created by an ASCII "/", followed by a text in the code space from ASCII 0x20 - 0x7E, (printable characters) except for whitespace and the characters #,(,),<,>,[,],{,},/ and %.

Example:

/Type /MediaBox

Sometimes you may have to set names that contain spaces or problematic characters. These characters can be encoded with a pound sign ("#"), followed by the hex code of the characters. The encoding should be ASCII or UTF-8 (usually programs which can display names are usually prepared to show both encodings).

Louis Grand	/Louis#20Grand
Page#34	        /Page#2334
Höflinger	/H#C3#B6flinger

As a further restriction: The code for the NUL character (#00) is illegal and the codes for whitespace (#09, #0A, #0C, #0D, #20) may not follow immediately to the leading forward slash of a name (no /#20Name).

Strings

Strings are texts that are made to be displayed in any way. They are written in round braces. The text itself may contain opening and closing braces itself, as long as those braces are balances. Otherwise braces have to be escaped with a backslash. Backslashes themselves have to be written as \\ to display a \.

Examples:

(Hallo Welt)                         % Hello World
(Text (with balanced) braces)        % Text (with balanced) braces
(Text (with \(unbalanced) braces)    % Text (with (unbalanced) braces
(Text \\with \\backslashes     % Text \with \backslashes

The text encoding is a difficult topic. Strings for metadata have to be encoded in PDFDoc. This encoding is a variant of ISO Latin-1, in which some control characters were replaced by printable signs. Problematic is the " " sign (ASCII 0x0A). In ISO Latin-1, it looks like a space, in PDFDoc however is denotes the Euro sign ("€").

Strings which are created to be displayed on paper are encoded in the encoding of the font they are written it. This can lead to pain and misery - however, there's a trick: In our "Hello World" PDF, we locked the font encoding with /Encoding /WinAnsiEnconding to WinAnsi. This is Adobes name for "Windows 1252-western Europe".

"Windows 1252" is (same as PDFDoc) a variant of ISO Latin-1, in which some control character are converted to printable characters. Other than PDFDoc, however, the decision was made to keep the existing, printable characters. So if we write any character into the PDF using "Windows 1252" and keep our metadata as printable characters ISO Latin-1 characters (except for 0x0A) we can consider the file as "encoded as Windows 1252".

Later on, we will go into greater detail on how we can support Unicode and multi-byte characters with different code space ranges independent of fonts.

Hex strings

Sometimes we have to encode strings which can't be encoded via ASCII, ISO Latin-1, PDFDoc or Windows-1252. For these cases you can use hex strings. These are written is angled brackets (< and >) and contain two-byte hex codes of the bytes. Whitespace is ignored.

Examples:

<48 61 6C 6C 6F 20 57 65 6C 74>
<48616C6C6F2057656C74>
<48616C6C6F20
57656C74>

Boolean and Null

null is used sparingly. true and false encode ... well, if a value is set to true or false. They are simply written as the literals null, true and false.

Arrays

Arrays are collections of objects that belong together. Contained objects are written between square brackets - as usual, seperated by whitespace. Since the array itself is an object, nested arrays are possible - as long as the braces are balanced.

Examples:

[1 2 3]
[/One /Two /Three]
[/One /array [/with /a /nested] /array]
Clone this wiki locally