Hyland Connect

Patrick_Sweeney · ‎09-05-2016

Have you ever tried outputting a PDF as an Image file using OnBase? If you have you may have noticed that the file size likely increased, sometimes dramatically. And the degree of size increase may not have been consistent between two different PDFs converted to Image! To add insult to injury the contents of the output Image file and the original PDF often look identical. It is reasonable to want an explanation for this kind of size increase.

At the root of this issue is that PDF files and Image files are fundamentally different. Much of what is seen when viewing a PDF is actually not part of the PDF document but is rather rendered by the PDF viewer based on instructions from the PDF document. Image files on the other hand don’t have the concept of instructions for rendering and are instead a collection of bytes which have to define the entire image. As an analogy, if an Image file were instead a music file then a PDF would contain the sheet music for the piece.

In this post we hope to explain why the fundamental differences between the document types results in an increase in file size when exporting a PDF file to an Image file.

PDF vs Image

PDF

PDFs (Portable Document Format) come in several varieties. For the sake of simplicity let’s only consider the two most common and, for our purposes, relevant cases: normal PDFs and Image Only PDFs.

Normal PDFs work by defining content and embedding it in a document along with instructions for how it should be rendered. Viewer programs like Adobe Reader execute these instructions in order to display the embedded content with the correct settings and in the right location. In effect the document that you see when viewing a PDF is compiled and assembled by the viewer, where the viewer actually supplies much of the data (e.g., font styles).

This has several advantages from a storage standpoint. PDFs don’t have to contain information for whitespace - the size of the document is defined and it is up to the viewer to fill the empty space. Text doesn’t need to be stored as pixel information. It can be stored and parsed as text and rendered in set fonts, sizes, and locations. A side effect of this is that PDF text can be magnified indefinitely without losing clarity. Characters will always seem continuous and sharp. We’ll see in a little bit that this does not hold true for Images. PDFs can also embed images to be displayed in the page. These embedded images suffer many of the same penalties as an Image file. An important distinction though is that it is a discrete element. It won’t affect the rendering of other items on the page – a benefit that an Image page does not receive.

Image Only PDFs are different in that their entire content is image data. In practice these files don’t enjoy many of the above advantages of a normal PDF and are much closer in size and concept to their Image counterparts. As a result PDF to image conversion of these files don’t often see a drastic size increase.

Image

Image files define a collection of pixels, each pixel set to a particular color. Put enough pixels together and they start to form shapes that we can recognize. In an image even the text you see is actually a collection of colored pixels. Zoom in far enough and the individual pixels become readily visible.

This means both that Image files only need to contain color information for each pixel and that Image viewers only need to be able to translate that color information to render the specified color for each pixel. There is no concept of additional instructions and no separate processing steps for different parts of an image – all the pixels on a page are processed the same way.

The color information for pixels is essentially a number that the Image viewer can use to select a color to display. For black and white images the color information can be represented by two numbers – 0 and 1. Or a single bit. The total collection of available colors on an image is called the color space for that image. Consequently the bit-depth of an image is how many bits are necessary to uniquely identify any color in the color space. If a color space has 16 available colors then you’d need 16 numbers to uniquely identify them, or 4 bits. How many bits are required to uniquely identify each color in the color space is called the bit-depth for that page. If an image page has an 8 bit-depth that means that every pixel on the page is represented by an 8 bit number.

It may be obvious at this point why this way of storing the pixel data is inconvenient from a storage perspective. In an image file the whitespace that doesn’t exist in a pdf now must be rendered with the same amount of data as any other colored space. Furthermore the number of bits that are required to represent any pixel, including whitespace pixels, is dictated by the bit-depth of the page. To understand why that is such an important point, consider a black and white document. Each pixel is represented by a single bit. If color is added to the image then the bit-depth of the document will have to be promoted to support the newly added color. If just an 8-bit icon is added then every pixel will have to be represented using 8 bits, suddenly increasing the file size dramatically!

PDF to Image Conversions

Based on the fundamental differences between PDFs and Images we can draw a few conclusions to help explain the increase in size often observed when converting PDF files to images.

The whitespace of a document has significantly different impacts on file size for PDFs and Images. Since PDFs don’t have to store the whitespace information then it imparts no size penalty to the resulting file. By contrast images have to represent whitespace like any other pixel. The amount of data that will be needed to represent each whitespace pixel will also be determined by the bit-depth of the page. In a conversion from PDF to Image the inclusion of whitespace as data alone can show significant increases in size.

Likewise the color information present in a PDF file can cause potential size increases when converting the document to an image. As mentioned above, elements of a PDF file are independent of each other – they can have different color spaces, fonts, etc. And in contrast every pixel of an Image page is processed in the same way. When converting from a PDF to an Image the page must be taken as a whole, and the entirety of it upscaled to the greatest bit depth of any individual element. As another example imagine a PDF that is 95% composed of a black and white image of bit-depth 1. The remaining 5% of the document is a small 32 bit color icon. A conversion to Image would need to preserve the 32 bit color space and would need to convert every pixel to using 32 bits for their representation. This would of course result in a large file size increase.

Image vs Multi-page Image

At this point PDF to Image conversions may be looking a little grim, with output potentially magnifying the file size by an order of magnitude or more. If you’ve ever noticed PDFs converting to larger images though you might have noticed that the image file sizes can be much more varied than that. A single page PDF output as an image may grow by 400% but a 100 page PDF output as an image might only grow by 4%. There are various factors that can contribute to these discrepancies like compression quality settings and color spaces. But a very relevant element of this example is the page count.

As we’ve already noted PDF to image conversion can run into several size penalties, most often associated with having to render every pixel and upscaling those pixels to higher bit depths. There is an important nuance when converting a multi-page PDF to a multi-page image or a series of image files. That nuance is that the conversion is carried out on a page by page basis.

Remember the example of the document with the small color icon? On its own the page containing that color information will need to be upscaled to accommodate the color space of the icon and could grow to an alarming degree when converted to an image. However if the icon only exists on the first page of a 100 page document it will only affect the first page. If the other 99 pages are black and white text then each individual page may only grow in size by a very small amount during the image conversion – each pixel only needs a single bit for representation. Overall, despite the increase in size of the first page, the output image file size may not be that much larger than the source PDF file size. Depending on the document the impact of a single page may be negligible when the document is taken as a whole.

Conclusion

The ways PDF files are created, stored, and rendered often lead to very small file sizes. When converting those PDFs to images, especially in cases where colors or images with large bit-depths are involved, the resulting image file may need to expand in size, sometimes dramatically. This is due largely to the fundamental differences in how PDFs and images are stored and rendered. Color PDFs in particular are vulnerable to converting to very large image files. The silver lining is that the PDF to image conversion happens on a page by page basis, and so larger documents are more resistant (but not immune) to this observed size increase. We hope that helped clarify a confusing and hard to pin down issue. Join us next week when we discuss locking practices in the Unity API.