Capturing TIFF metadata Posted By Adam Reilly on April 16, 2010


Image Courtesy of: http://www.cksinfo.com/clipart/construction/tools/magnifyingglasses/magnifying-glass-black-handle.png

Building from the same basic structure as the file system metadata gatherer (http://logik.com/whats_new/entry/capturing_file_system_metadata/), we can incorporate functionality to pull information from within the file.  Once documents have been reviewed and produced, it is very common for them to be converted from their native or ‘dynamic’ form into a more static page-oriented form such as a TIF image.  When the number of pages in a production approaches the millions, it becomes impossible to check every file for small details like compression, page orientation and resolution.  Using the ‘for’ loop from the previous example and incorporating a third-party will make it possible to quickly generate a useful summary of all TIF images in a folder.

PIL

The Python Imaging Library (PIL - http://www.pythonware.com/products/pil/index.htm) is a general purpose image manipulation library for Python.  It has classes and methods to parse, load and manipulate images in several different formats.  We will demonstrate a very small part of the overall functionality, and it is well worth glancing through the documentation to figure out what else is possible. 

All of the required files are bundled into a Windows installer which can be obtained from the Pythonware website (http://www.pythonware.com/products/pil/index.htm).  Be sure to download the version that is appropriate for your particular installation of Python.  Once the installer has completed, we’re able to import the Image class with  

from PIL.TiffImagePlugin import TiffImageFile

We’ll modify the Glob loop so that it will only grab files with a TIF extension and then use that list as the variable in the for loop.  Note the slight difference between this example and the prior one (you’ll probably find yourself repeating or reusing patterns from time to time).

fileList = glob.glob(base_path +
"*.tif")

Then, we can use the image’s open method which takes a path and creates an in-memory representation of the image stored at that path.

im = Image.open(file)


If nothing’s gone wrong, the only thing left to do is access properties and methods of the im variable to get a summary of properties for every TIF image in the folder.  In this case, we’ll be interested in the compression, resolution orientation and the number of pages.  PIL has built-in methods to handle most of these pieces of information.

 

 

 

 

 

 

 

 

Field

 

 

Information/Format

 

 

im.field

 

 

Returns a string containing the format of the current image.  “TIFF”
  in all these examples

 

 

im.size

 

 

Returns the dimensions of the image as an ordered pair of pixel
  values

 

 

im.info

 

 

Returns a dictionary with different fields depending on the image
  type

 

 

Page Count

PIL does not have a built-in method or property for counting the number of pages in a file, so we’ll have to define our own.  First, we’ll take a brief detour into a general programming topic called “Exception Handling.”  The Image class in PIL has a method called seek() which accepts an integer as an argument and attempts to open that page.  Trying to seek to page 35 in a one page document will cause the script to enter a special state known as an exception.

Look before you leap

Exceptions occur when programs do something that is unexpected or undefined.  For instance, many languages have the notion of a “divide by zero” exception in case code causes it to do so.  Exceptions are different from program crashes in that code which is likely to raise an exception can be wrapped inside special blocks of code which will try to perform the operation, detect an exception if it occurs and then execute cleanup code in order to allow the program to keep executing without crashing.  In Python, this special code is known as a try/except block. 

Since we have no way of determining where a particular document ends, we can take advantage of the fact that seek throws an exception.  Essentially, we’ll just keep trying to move to the next page until an exception is raised.  The following function keeps track of the number of pages successfully accessed with a counter variable. 

def tifPageCount(tif):

 

    pageCount = 1

    try:

        while(1):

            tif.seek(pageCount)

            pageCount += 1

    except EOFError:

        pass

 

    return pageCount

 

Putting it all together

Here’s the full working code:

# imports functionality to enumerate files

from PIL.TiffImagePlugin import TiffImageFile

import glob

 

# imports functionality to read command line arguments

import sys

 

import Image

 

# Using a provided Image object, continually seek to the next page until

# an EOFException is raised.  Keep track of the successfully encountered

# pages with a counter variable

def tifPageCount(tif):

 

    pageCount = 1

   

    # Code in this block will execute until the end of the image file

    # is reached

    try:

        while(1):

            tif.seek(pageCount)

            pageCount += 1

    except EOFError:

        pass

 

    #try/except has completed, return the count

    return pageCount

 

 

if __name__ == "__main__":

 

    base_path = sys.argv[1] + "\\"

 

    # The glob module will find all files on a certain path

    # which match the pattern provided (in this case we’ll use

    # *.tif to match only tif images

    fileList = glob.glob(base_path + "*.tif")

 

    # Store the delimiter in this variable for convenience

    d = "|"

   

    # iterate over the list of tiff files

    for file in fileList:

       

        # create an image object

        im = Image.open(file)

 

        # pull releevant information out of the image object

        imgFmt = str(im.format)

        imgSize = str(im.size)

        imgInfo = str(im.info)

       

        # Call the page counting method

        numPages = str(tifPageCount(im))

       

        # access filed

        print(file + d + imgFmt + d + imgSize + d + imgInfo + d + numPages)

 

Running this code in a folder with single-page tifs folder yields the following results:

ABC0131816.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1

ABC0131817.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1

ABC0131818.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1

ABC0131819.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1

ABC0131820.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1

ABC0131821.tif|TIFF|(2550, 3300)|{'compression': 'group4', 'dpi': (300, 300)}|1

 

This information can be used to quickly identify any abnormalities with compression, resolution or page orientation.  Additionally, it is useful in determining page counts within a folder.  This could easily be ingested into a database or column-oriented processing program like Excel as an effective and thorough QC technique.

 

Comments

Post A Comment

Categories

Sep 2010

S M T W T F S
     1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30    

Sign me up for Logik news!