How to check a file for duplicate lines - part 2 Posted By Adam Reilly on February 15, 2010

This will just be a quick update to the last post.  In the previous version of the duplicate record detector the input file is specified statically (or “Hard Coded”) inside the file.  This means that the source code must be modified each time that users want to run analysis on a new load file. 

Unlike compiled languages like C++ or Java, Python doesn’t have a lengthy build cycle associated with making changes.  While this isn’t too inconvenient, your users might not be comfortable directly modifying source code and there’s also the potential to introduce bugs by changing the wrong line.  Fortunately, Python provides a method for passing data to a program via the command line.

The System Module

Python has a built-in module for interacting with the file system called system.  System is full of useful methods, but we’ll just be using the argument passing functionality. Libraries are imported into Python scripts via the import statement.

Here’s the full working code:

import system

import hashlib

import collections
# Defines a function that takes a string as its argument and returns the

#  hexadecimal representation of its MD5 checksum

#  In: A string

#  Out: A string of hex characters corresponding to the checksum

def calculate_md5(inStr):

  #create an instance of the md5 object from

  #python’s hashlib

  md5Obj = hashlib.md5()
  #Convert the string to a series of raw bytes

  #assuming that it’s UTF-8 encoded

  md5Obj.update(bytes(inStr,‘utf8’))
  #Render the object as a hex encoded md5 hash value

  return md5Obj.hexdigest()
if __name__ == “__main__”:
  #Default factory method which creates an empty

  #dictionary of lists

  lineDict = collections.defaultdict(list)

 

  #keep a counter variable to track which line

  #of the file we’re on

  i=1
  #Create an iterable file object

  # use the system library to pull arguments in from the command line

  datFile = open(system.argv[1], ‘r’)

   

  #cycle through each line of the file

  for line in datFile:

    #Calculate the checksum of the record

      lineHash = calculate_md5(line)

      #Either create a new entry in the dictionary

      #or append to the list of lines with the same

      #check sum

      lineDict[lineHash].append(i)

   

      #Advance the counter to move to the next line

      i+=1

   

  # Finally, some code to print out the results

 

  #Print a title

  print(“Duplicate Lines”)

  #Cycle through each slot or ‘key’ in the dictionary

  for entry in lineDict:

      #If the length of the list is 2 or greater

      #print it out

      if len(lineDict[entry]) > 1:

        print(lineDict[entry])

A few notes on DOS

This code can be run from the DOS prompt with the following command:

C:\> python findDupLines.py “C:\Path to\theFile.txt”

There are a few important points to keep in mind when running Python scripts from the command prompt.  The most important is that Python can find your script.  The easiest way to ensure this is to change directories to the location where your script resides.  In the example above, the findDupLines.py script would have to be located at the root of my C: drive.  Also notice the double quotes (“) surrounding the C:\Path to\theFile.txt.  These are necessary because of the space characters in the path.  If any folders in your path contain spaces, double quotes are mandatory.

Comments

Post A Comment

Categories

Jul 2010

S M T W T F S
       1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Sign me up for Logik news!