Getting Started with PDFExplorer
Requirements: Secure PDF
Introduction
The PDFExplorer component in Secure PDF provides low-level access to the internal PDF document structure, enabling developers to not only inspect documents but also edit them on an individual PDF object basis.
This guide covers how to use this component to navigate the PDF object tree as well as add, modify, and remove the various types of objects. Before continuing, it is recommended to download the latest version of Secure PDF to follow along with this guide.
Object Types and Document Structure
The PDF specification defines eight object types:
- Name
- String
- Real
- Integer
- Boolean
- Array
- Dictionary
- Stream
In PDFExplorer, name, string, real, integer, and boolean objects are categorized as "primitive" objects, and array, dictionary, and stream objects are categorized as "container" objects.
Before accessing individual objects with the component, it is important to understand how they are structured in the document. PDFExplorer aims to distinguish between the logical and physical representations of objects.
The logical representation is that a PDF document is a tree of objects that can be traversed to extract data. For example, every document contains a document catalog that references a next-level object /Pages, which in turn references individual pages via a /Kids array. So to get a page, you would first look for the /Root object in the document trailer, then proceed to its /Pages element, and then work with the /Kids array.
Then, there is the physical structure that consists of all the objects that constitute the document. Every object is recorded as either:
- A direct (in-place) object (e.g., /Numbers [1 2 3 888]),
- An indirect (numbered) object, or
- A reference to an indirect object (e.g., /Numbers 8 0 R).
The way the objects are physically stored is generally independent from their logical structure. If you are looking for a page, it is of little importance whether each object that you need to traverse to reach it is stored in-place, in one of the indirect objects, or in a compressed object stream.
Note that most heavy objects (such as streams and dictionaries) are recorded in PDF files as indirect objects, with other objects referencing them. An indirect object is a global object that is uniquely identified by its object number followed by its generation number (e.g., 1 0 obj).
Navigating the Document
To navigate the object tree, first provide the input document as a file (InputFile), byte array (InputData), or stream (SetInputStream) and call the Open method. This method will populate the RootObjects collection with the existing objects in the document trailer, as the trailer is considered to be the root of the logical object tree. The keys in the document trailer will typically be /Size, /Info, /Root, /ID, and /Encrypt for encrypted documents.
These objects can then be used as a starting point for the document tree navigation, which is done using the Select method. This method and others operate the following syntax for specifying objects in the document:
- Slashes separate levels of hierarchy, like in file paths.
- The "root" slash (/) points to the document trailer dictionary.
- A path that does not start with a slash specifies an indirect object in the list of global numbered objects.
- The asterisk character (*) specifies all objects at the provided path.
Examples:
Consider the following PDF document:
%PDF-1.4
%cmmt
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<< /Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj
<< /Type /Page
/Parent 2 0 R
/MediaBox [ 0 0 612 792 ]
/Resources << /ProcSet 4 0 R >>
>>
endobj
4 0 obj
[ /PDF ]
endobj
xref
0 5
0000000000 65535 f
0000000015 00000 n
0000000065 00000 n
0000000125 00000 n
0000000234 00000 n
trailer
<< /Size 5
/Root 1 0 R
>>
startxref
259
%%EOF
Select would return the following results for the respective paths:
- / - a dictionary object that corresponds to the trailer dictionary.
- /Root - a dictionary object that corresponds to the dictionary at 1 0 obj, with its Disposition field set to Reference (as this is a reference to an indirect object).
- /Size - an integer object whose Value is 5 and Disposition is Direct (as this is a direct, in-place object).
- /Root/Type - a name object whose Value is Catalog (Disposition = Direct).
- /Root/Pages - a dictionary object that corresponds to the dictionary at 2 0 obj (Disposition = Reference).
- /Root/Pages/Kids - an array object (Disposition = Direct).
- /Root/Pages/Kids[0] - a dictionary object that corresponds to the dictionary at 3 0 obj (Disposition = Reference).
- /Root/Pages/Kids[0]/MediaBox - an array object with four integer elements (Disposition = Direct).
- /Root/Pages/Kids[0]/MediaBox[2] - an integer object whose Value is 612 (Disposition = Direct).
- 3 0 obj - a dictionary object that corresponds to the dictionary at 3 0 obj (Disposition = Indirect).
- 3 0 obj/Type - a name object whose Value is Page (Disposition = Direct).
- 3 0 obj/Parent - a dictionary object that corresponds to the dictionary at 2 0 obj (Disposition = Reference).
Once Select returns, the selected object(s) will be available in the SelectedObjects collection.
Adding and Modifying Objects
The below sections contain instructions for adding and modifying each type of object. Note that each of the following Add* methods returns the path of the newly added object in the document, making it easy to access the PDFObject object later using the Select method. These objects' values can then be adjusted to ensure the PDF document meets your requirements.
Primitive Objects
A primitive object is a non-container object that represents a name, string, real (double), integer, or boolean value. Primitive objects are typically stored in-place and referenced directly. Use the AddPrimitive method to add a direct primitive object and the AddObject method (with the Indirect parameter set to true) to add an indirect primitive object:
// Adding a direct string object to the /Info dictionary
string stringPath = pdfexplorer.AddPrimitive("/Info", "Creator", "Microsoft Word");
// Adding an indirect boolean object to the root
string booleanPath = pdfexplorer.AddObject("", 5, "", "true", true);
5 0 obj
<<
...
/Creator (Microsoft Word)
>>
endobj
...
6 0 obj
true
endobj
The value of a primitive object can then be modified if desired:
pdfexplorer.Select(stringPath, true);
pdfexplorer.SelectedObjects[0].Value = "nsoftware.SecurePDF";
pdfexplorer.Select(booleanPath, true);
pdfexplorer.SelectedObjects[0].Value = "false";
5 0 obj
<<
...
/Creator (nsoftware.SecurePDF)
>>
endobj
...
6 0 obj
false
endobj
Array and Dictionary Objects
Unlike primitives, arrays and dictionaries are objects that contain other objects. Elements within array objects are arranged sequentially and have implicit zero-based indices, whereas dictionary objects contain named key-value pairs that are unordered. Use the AddContainer method to add a direct or indirect array or dictionary object:
// Adding a direct array object to the first page's /Page dictionary
string arrayPath = pdfexplorer.AddContainer("/Root/Pages/Kids[0]", "CropBox", false, false);
// Adding an indirect dictionary object to the root
string dictPath = pdfexplorer.AddContainer("", "", true, true);
3 0 obj
<< /Type /Page
...
/CropBox [
]>>
endobj
...
7 0 obj
<<
>>
endobj
An array or dictionary object can then be modified by adding elements to it. The example below populates the /CropBox array with four integer objects and adds a /Type key to the newly created dictionary.
string cropBox0Path = pdfexplorer.AddPrimitive(arrayPath, "", "0");
string cropBox1Path = pdfexplorer.AddPrimitive(arrayPath, "", "0");
string cropBox2Path = pdfexplorer.AddPrimitive(arrayPath, "", "612");
string cropBox3Path = pdfexplorer.AddPrimitive(arrayPath, "", "792");
string typePath = pdfexplorer.AddPrimitive(dictPath, "Type", "/SampleType");
3 0 obj
<< /Type /Page
...
/CropBox [
0
0
612
792
]>>
endobj
...
7 0 obj
<<
/Type /SampleType
>>
endobj
Stream Objects
A stream object is a compound object consisting of a dictionary and a sequence of bytes. Stream objects are always indirect and are used to store data such as images, fonts, and other resources. Use the AddStream method to add a stream object:
// Adding a stream object to the root
byte[] image1Data = File.ReadAllBytes("image1.png");
string streamPath = pdfexplorer.AddStream("", "", image1Data);
8 0 obj
<<
/Length 6317
>>stream
... % binary data for image1.png
endstream
endobj
To modify a stream object, use the SetObjectData or SetObjectStream method:
byte[] image2Data = File.ReadAllBytes("image2.png");
pdfexplorer.SetObjectData(streamPath, image2Data);
// or pdfexplorer.SetObjectStream(streamPath, new MemoryStream(image2Data));
8 0 obj
<<
/Length 197
>>stream
... % binary data for image2.png
endstream
endobj
Object References
An (indirect) object reference is a reference to an indirect object from another object. Its syntax consists of the destination object's object number, its generation number, and R (e.g., 1 0 R). Use the AddReference method to add a reference to an existing object:
// Creating a reference to the stream at 8 0 obj and adding it to the dictionary at 7 0 obj
string path = pdfexplorer.AddReference("7 0 obj", "Image", "8 0 obj");
7 0 obj
<<
/Image 8 0 R
/Type /SampleType
>>
endobj
The contents of the destination object can be modified using the path returned by AddReference in the same way as any other indirect object—the reference will remain intact because the object and generation numbers of the destination object will not be affected.
Removing Objects
The RemoveObject method can be used to remove an object from the document. While this method will invalidate the former path of the object itself, if it was an indirect object any references to it will not be removed.
pdfexplorer.RemoveObject("7 0 obj/Image");
7 0 obj
<<
/Type /SampleType
>>
endobj
When finished adding, modifying, or removing objects, call the Close method to close the document and save the changes to either OutputFile, OutputData, or the stream set in SetOutputStream.
We appreciate your feedback. If you have any questions, comments, or suggestions about this article please contact our support team at support@nsoftware.com.