Category: how_tos

Polish up your scripts with Optparse

Polish up your scripts with Optparse

If you've ever written an especially useful or popular script, you've noticed that features tend to creep into the codebase as you encounter variations in the input. As the code evolves to handle more and more variation, you may notice that distinct 'modes' of operation arise. One way to accomodate these different modes is to use values hard-coded into the source. Examples such as field delimiters, input path, recursive operation and output paths are often wired directly into the operation of quickly-written scripts. Read more

Capturing TIFF metadata

Capturing TIFF metadata

Building from the same basic structure as the file system metadata gatherer (http://logik.com/whats_new/entry/capturing_file_system_metadata/), we can incorporate functionality to pull information from within the file. Once documents have been reviewed and produced, it is very common for them to be converted from their native or ‘dynamic’ form into a more static page-oriented form such as a TIF image. When the number of pages in a production approaches the millions, it becomes impossible to check every file for small details like compression, page orientation and resolution. Using the ‘for’ loop from the previous example and incorporating a third-party will make it possible to quickly generate a useful summary of all TIF images in a folder.Read more

Capturing File System Metadata

Capturing File System Metadata

This script will be a little shorter than some of the previous examples. However, it represents a fairly common use case within the field of eDiscovery. As data moves from party to party in the collection/preservation stage of a matter, related files are often lumped into folders according to organizational need. Summaries of the information in these folders are often crucial to everything from formulating a review strategy to determining timelines. In this post, we’ll look at a technique for capturing file system metadata and collecting it for reporting purposes.Read more

LFP…WTF?

LFP…WTF?

In this post, we’ll build on the previous post’s technique of iterating through a file line-by line. LFP files are an extremely common form of data interchange as document sets trade hands in litigation. Their popularity is probably due in part to their simplicity. As a review, LFP files are ... Read more

How to check a file for duplicate lines - part 2

How to check a file for duplicate lines - part 2

This will just be a quick update to the last post. In the previous version of the duplicate record detector the input file is specified statically (or “Hard Coded”) inside the file. This means that the source code must be modified each time that users want to run analysis on a new load file.

Unlike compiled languages like C++ or Java, Python doesn’t have a lengthy build cycle associated with making changes. While this isn’t too inconvenient, your users might not be comfortable directly modifying source code and there’s also the potential to introduce bugs by changing the wrong line. Fortunately, Python provides a method for passing data to a program via the command line... Read more

How to check a file for duplicate lines

How to check a file for duplicate lines

In this edition of “eDiscovery-related Python Tricks,” we’ll cover some fundamental techniques and operations that you’ll likely find yourself using repeatedly. Suppose you’ve been given the task of merging load files from several productions together. You’re fairly sure that merging several files together has left the load file with duplicative lines, but the file is large and this would be difficult to determine manually. While this example may seem a little contrived, it will provide a simple setup for laying foundation that will likely be re-used when we get to more interesting examples...Read more

How to make a quick-n-dirty histogram

How to make a quick-n-dirty histogram

Most people know that Microsoft excel has the capability to produce a wide variety of charts in order visualize data. However, if you find yourself needing to summarize more rows than excel can load or you need to use SQL to provide more flexible data manipulation, Microsoft Access also provides a function called "pivot charts" which allows users to generate quick visual summary of queries.

We'll start by importing a sample set of data which was obtained from the Internet. The data is in the form of Comma Separated Values, or CSV which is a common data interchange format.Read more

How to hook it up right

How to hook it up right

Prevent data spoilation by using a simple write-blocking device. They are fairly cheap (~$270 @ tableau.com ) and well worth the price considering spoiling data may just ruin your whole day. Connecting a hard drive to a computer seems simple enough. But if you want to avoid modifying the metadata on the drive you will need to use a write-blocking device that prevents the hard drive from updating the metadata. This is very important, especially for legal discovery where metadata should always be preserved to avoid spoilation.Read more

How to do some powerful dos commands part1

How to do some powerful dos commands part1

Despite it's simple appearance, the humble Command shell can be an extremely powerful tool for automating repetitive or difficult system tasks. Many people are scared away by the lack of GUI elements, but this can be a tremendous asset in terms of making processes consistent and repeatable.

The first command we'll look at may be familiar, most people have seen, heard of or learned the dir command at some point. When run without any arguments, it prints a list of files in the current directory along with some file-system metadata. You may not be aware that dir can be run with several flags and parameters that can modify it's behavior. For instance, typing dir *.txt will filter the list of files according a pattern, in this case it will only list files with a txt extension...Read more

Page 1 of 1 pages

Categories

Sign me up for Logik news!

Did you know?

  • That Adobe Photoshop files contain multiple layers of information, most of which are hidden from view and cannot be seen without the use of Photoshop?

  • That MAPI = Messaging Application Programming Interface, and it allows access to email content and metadata?

  • That a standard DVD-R single layer can only hold ~4.7GB of information and takes ~30minutes to fully burn, whereas copying 4.7GB of files to a hard drive will only take 5 minutes?

  • That many of the off-the-shelf eDiscovery programs can only extract a limited number of embedded files?

  • That hard drives can deteriorate in a few years if not used, because the disks need to spin?

  • That removing near duplicate documents without first reviewing them could risk missing important information?

  • That Bloomberg email systems can also contain instant messages and that all of the data is in simple text format?

  • That most near-dupe technologies can not group foreign language documents together?

  • That retrieving passwords by asking the person who created the password is usually much faster than trying to break it with software?

  • That Microsoft Exchange (.edb) databases can be easily opened by a variety of software products?

  • That Microsoft XLS files can contain hidden spreadsheets?

  • That MS Excel documents can have charts layered on top of each other, hiding potentially relevant data?

  • That Microsoft Outlook PST files can contain foreign language characters even if the PST file isn’t Unicode?

  • That Google Gmail emails can be downloaded to Microsoft Outlook using a POP3 or IMAP connection?

  • That Lotus Notes has a soft delete option that activates when you open a NSF and it will automatically delete emails marked with soft delete?