Capturing TIFF metadata Posted By Adam Reilly on April 16, 2010
![]()
Image Courtesy of: http://www.cksinfo.com/clipart/construction/tools/magnifyingglasses/magnifying-glass-black-handle.png
Building from the same basic structure as the file system metadata gatherer (http://logik.com/whats_new/entry/capturing_file_system_metadata/), we can incorporate functionality to pull information from within the file. Once documents have been reviewed and produced, it is very common for them to be converted from their native or ‘dynamic’ form into a more static page-oriented form such as a TIF image. When the number of pages in a production approaches the millions, it becomes impossible to check every file for small details like compression, page orientation and resolution. Using the ‘for’ loop from the previous example and incorporating a third-party will make it possible to quickly generate a useful summary of all TIF images in a folder.
PIL
The Python Imaging Library (PIL - http://www.pythonware.com/products/pil/index.htm) is a general purpose image manipulation library for Python. It has classes and methods to parse, load and manipulate images in several different formats. We will demonstrate a very small part of the overall functionality, and it is well worth glancing through the documentation to figure out what else is possible.
All of the required files are bundled into a Windows installer which can be obtained from the Pythonware website (http://www.pythonware.com/products/pil/index.htm). Be sure to download the version that is appropriate for your particular installation of Python. Once the installer has completed, we’re able to import the Image class with
from PIL.TiffImagePlugin import TiffImageFile
We’ll modify the Glob loop so that it will only grab files with a TIF extension and then use that list as the variable in the for loop. Note the slight difference between this example and the prior one (you’ll probably find yourself repeating or reusing patterns from time to time).
fileList = glob.glob(base_path +
"*.tif")
Then, we can use the image’s open method which takes a path and creates an in-memory representation of the image stored at that path.
im = Image.open(file)
If nothing’s gone wrong, the only thing left to do is access properties and
methods of the im variable to get a summary of properties for every TIF
image in the folder. In this case, we’ll be interested in the compression,
resolution orientation and the number of pages. PIL has built-in methods to
handle most of these pieces of information.
|
Field
| Information/Format
|
im.field
|
Returns a string containing the format of the current image. “TIFF”
|
im.size
|
Returns the dimensions of the image as an ordered pair of pixel
|
im.info
|
Returns a dictionary with different fields depending on the image
|
Page Count
PIL does not have a built-in method or property for counting the number of pages in a file, so we’ll have to define our own. First, we’ll take a brief detour into a general programming topic called “Exception Handling.” The Image class in PIL has a method called seek() which accepts an integer as an argument and attempts to open that page. Trying to seek to page 35 in a one page document will cause the script to enter a special state known as an exception.
Look before you leap
Exceptions occur when programs do something that is unexpected or undefined. For instance, many languages have the notion of a “divide by zero” exception in case code causes it to do so. Exceptions are different from program crashes in that code which is likely to raise an exception can be wrapped inside special blocks of code which will try to perform the operation, detect an exception if it occurs and then execute cleanup code in order to allow the program to keep executing without crashing. In Python, this special code is known as a try/except block.
Since we have no way of determining where a particular document ends, we can take advantage of the fact that seek throws an exception. Essentially, we’ll just keep trying to move to the next page until an exception is raised. The following function keeps track of the number of pages successfully accessed with a counter variable.
def tifPageCount(tif):
pageCount = 1
try:
while(1):
tif.seek(pageCount)
pageCount += 1
except EOFError:
pass
return pageCount
Putting it all together
Here’s the full working code:
# imports functionality to enumerate files
from PIL.TiffImagePlugin import TiffImageFile
import glob
# imports functionality to read command line arguments
import sys
import Image
# Using a provided Image object, continually seek to the next page until
# an EOFException is raised. Keep track of the successfully encountered
# pages with a counter variable
def tifPageCount(tif):
pageCount = 1
# Code in this block will execute until the end of the image file
# is reached
try:
while(1):
tif.seek(pageCount)
pageCount += 1
except EOFError:
pass
#try/except has completed, return the count
return pageCount
if __name__ == "__main__":
base_path = sys.argv[1] + "\\"
# The glob module will find all files on a certain path
# which match the pattern provided (in this case we’ll use
# *.tif to match only tif images
fileList = glob.glob(base_path + "*.tif")
# Store the delimiter in this variable for convenience
d = "|"
# iterate over the list of tiff files
for file in fileList:
# create an image object
im = Image.open(file)
# pull releevant information out of the image object
imgFmt = str(im.format)
imgSize = str(im.size)
imgInfo = str(im.info)
# Call the page counting method
numPages = str(tifPageCount(im))
# access filed
print(file + d + imgFmt + d + imgSize + d + imgInfo + d + numPages)
Running this code in a folder with single-page tifs folder yields the following results:
ABC0131816.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1
ABC0131817.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1
ABC0131818.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1
ABC0131819.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1
ABC0131820.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1
ABC0131821.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1
This information can be used to quickly identify any abnormalities with compression, resolution or page orientation. Additionally, it is useful in determining page counts within a folder. This could easily be ingested into a database or column-oriented processing program like Excel as an effective and thorough QC technique.
Post A Comment