How to check a file for duplicate lines Posted By Adam Reilly on February 10, 2010
In this edition of “eDiscovery-related Python Tricks,” we’ll cover some fundamental techniques and operations that you’ll likely find yourself using repeatedly. Suppose you’ve been given the task of merging load files from several productions together.
You’re fairly sure that merging several files together has left the load file with duplicative lines, but the file is large and this would be difficult to determine manually. While this example may seem a little contrived, it will provide a simple setup for laying foundation that will likely be re-used when we get to more interesting examples.
Opening files
The first thing we’ll need to do is tell Python where it can find the file it will be reading. This is accomplished with the open function. In Python open accepts a path and a few arguments in order to return a file object which the program can manipulate. Don’t worry too much if this terminology doesn’t make sense, you’ll get a feel for it. Assuming that the file resides at C:\loadFile\bigLoadFile.dat, we’d write the following code to open the file for reading:
datFile = open(‘C:\\loadFile\\bigLoadFile.dat’, ‘r’)
This line searches the specified path for a file with the given name. You’ll notice that the path slashes are doubled (i.e., C:\\ instead of C:\). The “\” character is special in Python, and is used to designate non-printable characters. A double slash “\\” is Python’s way of denoting a folder separator, so that this code will run in a Windows environment. The second argument ‘r’ tells Python that this file will be open *only* for reading.
Iterating line-by-line
Since we’ve determined that we’re interested in finding duplicative lines, we’ll need a way to access each line within the file. Python provides a convenient method for doing this by making its file object “iterable”. This is the computer science-y way of saying that file objects have a next() method that will pull the next line out of a file until the end of the file is reached. This can be wrapped into a construct called a “for loop” which will cause a Python program to execute a block of code until a certain condition is reached. Code to access a file line by line would look like this:
for line in datFile:
“for” is a special word within Python programs which marks the beginning of the loop. “line” is an arbitrarily-chosen variable name which will hold the result of pulling successive lines out of datFile.
Dictionaries and MD5 Calculation
Now that we’ve set up the basic structure for looping through the entire file, it’s time to give some thought to the duplicate detection strategy. Since there is no limit to the number of records a load file can contain, nor any size limit on rows, we’ll want to come up with an efficient way to capture and represent the data. Additionally, use Python’s dictionary data structure to keep track of lines that we’ve seen so far.
lineDict = collections.defaultdict(list)
This line takes advantage of the Python collections library to create a dictionary which stores a list in each of its open slots. We will calculate an MD5 value for each record in the DAT file. Any identical MD5 values will mean that lines are exactly identical, so keeping track of line numbers and updating the dictionary accordingly should give a summary of identical lines. The only missing piece is a way to calculate an MD5 hash value, given a string. The following lines will be used to accomplish this:
def calculate_md5(inStr):
md5Obj = hashlib.md5()
md5Obj.update(bytes(inStr,‘utf8’))
return md5Obj.hexdigest()
Briefly, we’re taking advantage of another Python library to generate the hash values and returning them as a string.
Putting it all together
Here’s the full working code:
import hashlib
import collections
# Defines a function that takes a string as its argument and returns the
# hexadecimal representation of its MD5 checksum
# In: A string
# Out: A string of hex characters corresponding to the checksum
def calculate_md5(inStr):
#create an instance of the md5 object from
#python’s hashlib
md5Obj = hashlib.md5()
#Convert the string to a series of raw bytes
#assuming that it’s UTF-8 encoded
md5Obj.update(bytes(inStr,‘utf8’))
#Render the object as a hex encoded md5 hash value
return md5Obj.hexdigest()
# Main Program starts HERE!!!!!
if __name__ == “__main__”:
#Default factory method which creates an empty
#dictionary of lists
lineDict = collections.defaultdict(list)
#keep a counter variable to track which line
#of the file we’re on
i=1
#Create an iterable file object
datFile = open(‘C\\loadFile\\bigLoadFile.dat’, ‘r’)
#cycle through each line of the file
for line in datFile:
#Calculate the checksum of the record
lineHash = calculate_md5(line)
#Either create a new entry in the dictionary
#or append to the list of lines with the same
#check sum
lineDict[lineHash].append(i)
#Advance the counter to move to the next line
i+=1
# Finally, some code to print out the results
#Print a title
print(“Duplicate Lines”)
#Cycle through each slot or ‘key’ in the dictionary
for entry in lineDict:
#If the length of the list is 2 or greater
#print it out
if len(lineDict[entry]) > 1:
print(lineDict[entry])
Running this code with sample input:
1. fee
2. foo
3. foo
4. fee
5. few
6. few
7. few
8. few
9. fine
10. pine
Yields the following output:
Duplicate Lines
[5, 6, 7, 8]
[2, 3]
[1, 4]
As you can see, the script correctly captures and summarizes duplicate lines within the file. Notice that number ten does not appear in the sample, because it is unique. Despite the fact that the sample input is small, the approach should scale up to very large files. Using the MD5 value instead of the actually string allows the algorithm to store a small, fixed amount of data per record regardless of its size, so it’s unlikely that you’ll hit memory limitations.
Again, even if you don’t ever anticipate having to perform this specific task, it provides the shell for tasks that have to be performed repeatedly. I hope that you’ve found this useful, and stay tuned for more eDiscovery-related Python Tricks.
Post A Comment