Capturing File System Metadata Posted By Adam Reilly on April 4, 2010
This script will be a little shorter than some of the previous examples. However, it represents a fairly common use case within the field of eDiscovery. As data moves from party to party in the collection/preservation stage of a matter, related files are often lumped into folders according to organizational need. Summaries of the information in these folders are often crucial to everything from formulating a review strategy to determining timelines. In this post, we’ll look at a technique for capturing file system metadata and collecting it for reporting purposes.
Glob
There are many different ways to traverse the file system with Python. One of the simplest methods uses the Glob module (http://docs.python.org/library/glob.html). The somewhat unintuitive name is a throwback to early Unix days, and refers to the process of finding all strings that match a particular pattern. To make Glob available to your script, it is first necessary to add the appropriate import statement to the top of the source file.
import glob
Then, it’s possible to make calls to the glob method within the glob module (just go with it) in order to start building lists of filenames. Once files are stored in lists, we’re free to start capturing information we care about. In order to build the list, we’ll use the following syntax:
fileList = glob.glob(base_path + "*")
Notice that we are supplying a “*” wildcard along with a base path. This causes the script to navigate to the folder we specify and make a list of any filenames that fit the pattern. This pattern matches everything, but we could just as easily use stricter conditions such as “*.tif” (all TIFF images) or “*\\natives\\*” (only files in a natives folder), depending on our specific task. Note that Glob can only operate on file or folder names, and it will only return results for the current folder, not its subfolders.
Stat
The stat module (http://docs.python.org/library/stat.html) is named after yet another throwback to Unix. It is the name of a system call which was used to retrieve very detailed information about files in a file system. We will use a subset of its capability to capture MAC (modified, accessed, created) times and size for every file in a directory. Stat is an object in the OS module, which must be imported with:
from os import stat
This line allows calls to the stat function like the following
fs_metadata = os.stat(path_to_file)
fs_metadata receives the result of the stat function, which consists of several pieces of metadata from the file whose full path is supplied as an argument. For purposes of this demonstration, we will access and save the size in bytes and the various times associated with each file in a folder. Once the assignment has occurred, it is possible to access various fields of information using the dot notation. For instance, accessing the files size is accomplished by accessing the “st_size” field.
sizeInBytes = fs_metadata.st_size
This will save a positive integer for later, when we print to a summary. Accessing the file’s MAC times is similar, as we can see from the example of accessing modified time (created and accessed will be demonstrated in the full source listing at the end of the article.
modTime = fs_metadata.st_mtime
Working with Timestamps
There is one final loose end to tie up before the report will be satisfactory. Times reported by the stat module are stored internally as timestamps, or the number of seconds that has elapsed since a specific date. If we were to print any of the MAC times without modification, they would look something like “1258124917.17”. While this is perfectly suitable for sorting or comparison, it’s not very intuitive for human consumption. Fortunately, it’s fairly easy to implement a function which takes floating point numbers and converts them to a wide variety of date strings. Indeed, Python has date, time and datetime classes which split these entities into accessible fields and provide many methods for manipulating them. For brevity and simplicity, we will convert our MAC times to the ISO format combined date and time format (http://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations). This captures both the date and time and combines them into a string which will sort correctly in Excel.
After importing the datetime module, we can write a function which performs the necessary Math to convert a floating point number to a into a datetime object and calls it’s isoformat() method.
def floatToTime(timestamp):
return datetime.fromtimestamp(timestamp).isoformat()
Putting it all together
Here’s the full working code:
# imports functionality to enumerate files
import glob
# imports functionality to harvest filesystem metadata
import os
from stat import *
# imports functionality for converting times
from datetime import datetime
# imports functionality to read command line arguments
import sys
# Function which takes floating-point style timestamps and converts
# to an ISO-stlye string (YYYY-MM-DDTHH:MM:SS.Ms). These dates will
# sort properly if imported into a column-oriented data store like
# MS Excel
def floatToTime(timestamp):
# use fromtimestamp method to convert the floating point
# number into a ‘datetime’ object, then call that
# object’s isoformat() method to give back a formatted
# string
return datetime.fromtimestamp(timestamp).isoformat()
if __name__ == "__main__":
base_path = sys.argv[1] + "\\"
# The glob module will find all files on a certain path
# which match the pattern provided (in this case we’ll use
# * to match everything
fileList = glob.glob(base_path + "*")
# fileList has a list of files and folders that match the pattern
# we will iterate over each in this for loop
for file in fileList:
# stat takes the full path to a file and returns an
# object that contains many useful pieces of filesystem
# metadata.
fs_metadata = os.stat(file)
# This if statement guards the print statement so that fs_metadata
# will only be printed if the entry that we’re on is NOT a directory
# In other words, information should only be printed out for files.
if not S_ISDIR(fs_metadata.st_mode):
# We’ll capture the file size by accessing a field of the
# fs_metadata object.
sizeInBytes = fs_metadata.st_size
# We’ll access three fields from the fs_metadata object to
# capture Modified, Accessed and Created times from the filenames
# in the list
# Note: the times are stored as a floating-point timestamp, so
# we will use the conversion function to make it slightly more
# human-readable
modTime = floatToTime(fs_metadata.st_mtime)
accTime = floatToTime(fs_metadata.st_atime)
creTime = floatToTime(fs_metadata.st_ctime)
# Finally, we’ll print all the values into a delimited format that
# programs like Excel should be able to read easily
print(file + "|" +
str(sizeInBytes) +"|" +
modTime + "|" +
accTime + "|" +
creTime)
Running this code in the “C:\Python31\DLLs” folder yields the following results:
C:\Python31\DLLs\bz2.pyd|68096|2009-08-17T17:03:50|2009-10-20T16:37:52.281250|2009-08-17T17:03:50
C:\Python31\DLLs\py.ico|19790|2007-12-06T08:47:58|2009-10-20T16:37:52.265625|2007-12-06T08:47:58
C:\Python31\DLLs\pyc.ico|19790|2007-12-06T08:47:58|2009-10-20T16:37:52.265625|2007-12-06T08:47:58
C:\Python31\DLLs\pyexpat.pyd|152576|2009-08-17T17:04:36|2009-10-20T16:37:52.281250|2009-08-17T17:04:36
C:\Python31\DLLs\select.pyd|11776|2009-08-17T17:04:46|2009-10-20T16:37:52.281250|2009-08-17T17:04:46
C:\Python31\DLLs\sqlite3.dll|302080|2009-08-13T19:57:14|2009-10-20T16:37:52.328125|2009-08-13T19:57:14
C:\Python31\DLLs\tcl85.dll|867328|2008-11-06T20:29:16|2009-10-20T16:37:52.343750|2008-11-06T20:29:16
C:\Python31\DLLs\tclpip85.dll|8192|2008-06-12T18:15:40|2009-10-20T16:37:52.343750|2008-06-12T18:15:40
This data can be redirected from the command prompt or written to a file and imported cleanly into Excel. Note that we used “|” as a delimiter, as it cannot appear in Windows path strings.
Post A Comment