How to check a file for duplicate lines - part 2 Posted By Adam Reilly on February 15, 2010
This will just be a quick update to the last post. In the previous version of the duplicate record detector the input file is specified statically (or “Hard Coded”) inside the file. This means that the source code must be modified each time that users want to run analysis on a new load file.
Unlike compiled languages like C++ or Java, Python doesn’t have a lengthy build cycle associated with making changes. While this isn’t too inconvenient, your users might not be comfortable directly modifying source code and there’s also the potential to introduce bugs by changing the wrong line. Fortunately, Python provides a method for passing data to a program via the command line.
The System Module
Python has a built-in module for interacting with the file system called system. System is full of useful methods, but we’ll just be using the argument passing functionality. Libraries are imported into Python scripts via the import statement.
Here’s the full working code:
import system
import hashlib
import collections
# Defines a function that takes a string as its argument and returns the
# hexadecimal representation of its MD5 checksum
# In: A string
# Out: A string of hex characters corresponding to the checksum
def calculate_md5(inStr):
#create an instance of the md5 object from
#python’s hashlib
md5Obj = hashlib.md5()
#Convert the string to a series of raw bytes
#assuming that it’s UTF-8 encoded
md5Obj.update(bytes(inStr,‘utf8’))
#Render the object as a hex encoded md5 hash value
return md5Obj.hexdigest()
if __name__ == “__main__”:
#Default factory method which creates an empty
#dictionary of lists
lineDict = collections.defaultdict(list)
#keep a counter variable to track which line
#of the file we’re on
i=1
#Create an iterable file object
# use the system library to pull arguments in from the command line
datFile = open(system.argv[1], ‘r’)
#cycle through each line of the file
for line in datFile:
#Calculate the checksum of the record
lineHash = calculate_md5(line)
#Either create a new entry in the dictionary
#or append to the list of lines with the same
#check sum
lineDict[lineHash].append(i)
#Advance the counter to move to the next line
i+=1
# Finally, some code to print out the results
#Print a title
print(“Duplicate Lines”)
#Cycle through each slot or ‘key’ in the dictionary
for entry in lineDict:
#If the length of the list is 2 or greater
#print it out
if len(lineDict[entry]) > 1:
print(lineDict[entry])
A few notes on DOS
This code can be run from the DOS prompt with the following command:
C:\> python findDupLines.py “C:\Path to\theFile.txt”
There are a few important points to keep in mind when running Python scripts from the command prompt. The most important is that Python can find your script. The easiest way to ensure this is to change directories to the location where your script resides. In the example above, the findDupLines.py script would have to be located at the root of my C: drive. Also notice the double quotes (“) surrounding the C:\Path to\theFile.txt. These are necessary because of the space characters in the path. If any folders in your path contain spaces, double quotes are mandatory.
Post A Comment