LFP…WTF? Posted By Adam Reilly on March 4, 2010

Image Source: xkcd

In this post, we’ll build on the previous post’s technique of iterating through a file line-by line.  LFP files are an extremely common form of data interchange as document sets trade hands in litigation.  Their popularity is probably due in part to their simplicity.  As a review, LFP files are plain text files, where each record is a comma-delimited, newline terminated collection of five fields.  Find more details on the file format or fields here (http://platinumlit.wik.is/%28LFP%29_iPRO_Load_File).

Since the record structure is fairly simple and predictable, MS Access, Excel or SQL databases are popular choices for manipulating or exploring LFP files.  These tools are certainly appropriate for the job; however, it is possible to exceed the storage capacity of Excel and even Access in certain extreme cases.  At a minimum, each of these approaches requires a certain amount of overhead associated with importing the data.  Python can offer a dramatic speedup for large LFP files or tasks (QC, reporting, etc.) that need to be performed repeatedly.  We will work through a few such cases in the remainder of this post.

Sample Data

For the next several examples, we’ll use a small fictitious dataset comprised of the following records (if only they could all be this simple).  The set consists of ten single-page TIFF images taken from three documents.

IM,ABC00001,D,0,@DEF1022104;DEF10221042\0000;ABC00001.TIF;2

IM,ABC00002, ,0,@DEF1022104;DEF10221042\0000;ABC00002.TIF;2

IM,ABC00003, ,0,@DEF1022104;DEF10221042\0000;ABC00003.TIF;2

IM,ABC00004, ,0,@DEF1022104;DEF10221042\0000;ABC00004.TIF;2

IM,ABC00005, ,0,@DEF1022104;DEF10221042\0000;ABC00005.TIF;2

IM,ABC000006,D,0,@DEF1022104;DEF10221042\0000;ABC000006.TIF;2

IM,ABC000007, ,0,@DEF1022104;DEF10221042\0000;ABC000007.TIF;2

IM,ABC000008, ,0,@DEF1022104;DEF10221042\0000;ABC000008.TIF;2

IM,ABC00009,D,0,@DEF1022104;DEF10221042\0000;ABC00009.TIF;2

IM,ABC00010, ,0,@DEF1022104;DEF10221042\0000;ABC00010.TIF;2

Reporting: Document Statistics

Just as in the previous example, we’ll need to open the file for reading with the following line:

datFile = open("..\\testData\\sample.lfp",'r')

Then, we’ll use a ‘for’ loop to iterate through each line of the file:

for line in datFile:

Finally, we’ll perform some string manipulation to transform each record into its individual fields.  The rstrip method removes the newline from the end of each line, and split breaks a string into substrings, based on the supplied delimiter (a comma in this case).  This is similar to Excel’s “Text to columns” function.

fields = line.rstrip("\r\n").split(",")

If line contains “IM,ABC00001,D,0,@DEF1022104;DEF10221042\0000;ABC00001.TIF;2” the operations will proceed in the following steps, from left to right:

     
  1. rstrip   will remove the newline from the end of the file.
  2.  
  3. split(“,”) will identify all commas in the string and   build a list according to the delimiters
  4.  
  5. Finally, fields will be set equal to a list containing the   following fields, (Notice that field number start at 0):

                   
    Fields[0]Fields[1]Fields[2]Fields[3]Fields[4]
    IMABC00001D0@DEF1022104;DEF10221042\0000;ABC00001.TIF;2

 


With this basic construct, we can now begin to add code to discover and track features from the data.  Many features can be tracked simultaneously.  For instance, it’s common to want to know how many pages and documents are represented by a particular LFP file.  Page count for this data can be captured by initializing a counter variable outside of the loop and incrementing it with each line.  Similarly, document count can be obtained by incrementing a counter every time a non-empty value in the third field is encountered.

numDocs = 0

numPages = 0

for line in datFile:

  fields = line.rstrip("\r\n").split(",")

  numPages += 1

  if(fields[2] != “ ”):

  numDocs+=1

When the loop is finished iterating, numDocs and numPages should contain the appropriate values.

QC: Finding Abnormalities

If you look at the data, you will notice that there is one document which is seemingly named with a different convention than the others.  Files starting with ABC000006 through ABC000008 are zero-padded to six places instead of five.  This can be easily detected and fixed with Python.

We’ll start out by assuming that all Bates numbers in this production should have uniform prefixes and padding length.  If that’s the case, then every bates number in the file should be the same length, and adding code to detect otherwise is a simple matter, using Python’s built-in len() function.

if(len(fields[1]) > 8):

  nonConformNum = fields[1]

       

  print(str(numPages) + ": " + nonConformNum)

This code checks for any Bates numbers that are comprised of more than eight characters (three characters of prefix plus 5 of padding).  If any are encountered, the script will print the current value of numPages (which will be equivalent to the line number at any step in the loop) and the non-conforming Bates number.  This is helpful, because it alerts us to the presence of non-conforming values and provides line numbers or values to aid the search.  From this point, it’s only a little extra work to add code which fixes the problem.

Writing Files: Outputting the Fix

We’ve already determined a way to find non-conforming lines and established that the errors are ‘cosmetic’ and can be safely fixed without any further investigation.  Since the piece of the string that we want to modify is in the middle, we can’t use simple functions such as left and right truncation available in programs like Excel and Access.  We’ll need to take advantage of Python’s advanced string subscripting operator, which provides a compact notation for extracting a piece of a string.  We’ve seen in prior examples that one element in a Python list is accessed by placing a number with [] brackets.  Python also allows use of a range to return a sub list.  For example, we could isolate the prefix (the first three characters in a list of nine characters) by specifying the range 0:3.

batesPrefix = nonConformNum[0:3]  #Will store ‘ABC’

We can use this same principle to capture the numerical portion of the Bates number and apply some extra commands to format it correctly at the same time.

batesNumber = nonConformNum[4:].lstrip("0").zfill(5)

A before, the compound statement to the right of the equal sign starts at the left and works to the right, one method at a time.  It does the following three things inline:

               
CommandDescriptionData
nonConformNum[4:].Select all characters in the nonConformNum   string, starting with the fourth Character ABC000006
lstrip("0").Strip all 0’s from the beginning   of the resulting string6
zfill(5)Add the correct zero padding (five   digits total) to the stripped-down version of the string00006


When all three statements have completed, batesNumber now contains the correctly padded numerical portion of the Bates number.  These commands could be broken into multiple lines, but it is slightly more compact to represent them as a compound statement on one line, and we don’t need to save any of the intermediate results. 

All that’s left is to add code to handle the output of our corrected data.  Assuming we’ll want to capture results in a new file, we’ll use a slight variation on the open statement which we’ve been using to open source data.  This will need to be specified before the loop.

correctedFile = open(‘..\\testData\\sample_corrected.lfp’,'w')

This is almost identical to previous uses of the open command, with the exception of the ‘w’ parameter that is passed to the function.  This tells Python that the file should be opened for Writing. If the file does not already exist at this location, Python will create it and open it as a blank file.  If not, its contents will be deleted and it will be opened as a blank file.  (Note: be *very* careful when opening files for writing in Python, as any pre-existing data will be LOST).  correctedFile will now be available for writing within the loop.

Before presenting the full code, we’ll present the join method, which save a lot of typing if you’re outputting simple delimited records, like those found in an LFP.  The syntax might look a little strange if you’re new to object-oriented programming, but it’s intuitive as long as you remember what you’re trying to accomplish. 

",".join(fields)

Join takes a list as its argument and flattens it by gluing each item together, using the string in double quotes between the fields.  I takes data which once resided in compartmentalized and separate cells and flattens it into one string, with a marker to delineate the old boundaries.  This is not unlike saving an Excel file to a CSV.

Putting it all together

Here’s the full working code:

if __name__ == "__main__":

  # Open the LFP file for reading

  lfpFile = open("..\\testData\\sample.lfp",'r')

  # Initialize counters outside of the line-by-line

  # iteration,  These variables will keep track of

  # LFP features as the program steps through each line

  # of the file

  numDocs = 0

  numPages = 0

  # Variables to track QC steps while stepping through

  # the file

  nonConformingBates=0

  correctedFile = open("..\\testData\\sample_corrected.lfp",'w')

  print("Incorrect lines:")

  print("================")

  # Use a for loop to step though each line of the file

  for line in lfpFile:

      # this line applies two functions to the line

      # variable in order to normalize it for the remaiing

      # steps.  Method calls start inside, and work out left to

      # right in order.

      #  1) rstrip -> removes the newline character from each line

      #  2) split -> scans the string for supplied delimiter and

      #          breaks it into substrings as it finds them

      fields = line.rstrip("\r\n").split(",")

      # Each line in this file corresponds to one page in the set

      numPages += 1

      # Non-empty field 2 means the start of a new document

      if(fields[2] != " "):

        numDocs+=1

      # QC check to detect Bates numbers that are longer than 7 c

      # characters

      if(len(fields[1]) > 8):

        # Assign the incorrect Bates number to a string

        nonConformNum = fields[1]

       

        # Print the line number and bad number for reporting

        print(str(numPages) + ": " + nonConformNum)

        # isolate the bates prefix by selecting the first three

        # characters of the sequence

        batesPrefix = nonConformNum[0:3]

        # pull out the numerical portion of the beg Bates number

        # and format it with the correct number of zeros

        batesNumber = nonConformNum[4:].lstrip("0").zfill(5)

        # Overwrite the incorrectly padded number in the field

        # list

        fields[1] = batesPrefix + batesNumber

        # use the join method to merge all fields together with commas

        correctedFile.write(",".join(fields) + "\n")

        #back to the top of the loop

      else:

        # this case will be reached if the beg bates has the correct

        # number of characters, thus no procesing is necessary

        # it can simply be copied over to the new file

        correctedFile.write(line)

        #back to the top of the loop

  # Display the final values of the variables

  print()

  print("Summary:")

  print("========")

  print("Number of Documents: " + str(numDocs) + ", Number of Pages: " + str(numPages))

Running this code with the sample input yields the following results:

Incorrect lines:

================

6: ABC000006

7: ABC000007

8: ABC000008

Summary:

========

Number of Documents: 3, Number of Pages: 10

 

Comments

Post A Comment

Categories

Jul 2010

S M T W T F S
       1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Sign me up for Logik news!