Maximum Page Count

Maximum Page Count

Just the facts please

  • 270GBs of NSFs and eDocs
  • 400 search terms
  • 300k docs delivered natively
  • 16,000,000 estimated pages (@ 53 pages/doc)
  • 140,000 documents needed to produce
  • 11,100,000 pages TIFFd (@ 79 pages/doc)
  • Less than 3 weeks to process 270GB
  • Less than 1 week to TIFF 11.1 million pages
  • Less than 1 week to endorse/deliver 11.1 million pages
  • = 1.4 miles of pages
  • = 2.2m sticks of butter

Challenge:

It’s amazing how such a relatively small amount of documents can explode into 1.4 miles of pages (we’ll explain in a moment). That’s the challenge our client, a worldwide document hosting company, was faced with recently. What’s even more amazing is that deadlines stay the same, regardless of how many pages your eDiscovery project generates. It’s probably not very fair or logical, but we don’t make the rules, so take it up with Uncle Sam.

After we completed a quick turnaround (less than 3 weeks) to natively process 270GBs of Lotus Notes (.NSF) email and Microsoft Office documents (eDocs) for our hosting partner, the request to start productions came right after that. De-duplication and running search terms on the emails dramatically reduced the data to a mere 300,000 documents for review – a fairly small amount of documents considering the original 270GB volume of the data. But the real telling statistic was the number of estimated pages Gridlogik™ collected. With more than 16,000,000 estimated pages and only 300,000 documents (an average of 53 pages per document), the page counts were off-the-charts. This count was no where near “industry averages.”

So what’s an average amount? A common question we hear is “How many pages are in a document?” quickly followed by “Ok, then how many pages in a gigabyte?” The answers are simple and surprisingly boring: “Between 3 to 10 pages per document,” and “It depends.” Using these averages, the 300,000 documents would produce anywhere between 900,000 and 3,000,000 pages. Yeah… wow. Lucky for us, we have Gridlogik, which extracts the native page count of every document before printing. This is extremely helpful for our clients’ discovery process, where knowing page counts ahead of time truly matters. Ultimately, deadlines rarely change even if the case explodes with pages.

Our client received some “tough love” from the court, needing to produce relevant documents in less than one month. Our client, well aware of the 16,000,000 pages, turned to us for a solution. Smart client.

So, what’s the problem:


What we did:

We knocked it out of the ballpark.

Since we were the processing provider on this project, getting up-to-speed on the production requirements was easy. We had all the native documents, metadata and tag lists ready to go. However, TIFFing and endorsing the 140,000 documents in the required tight time-frame meant that we needed a super-software / herculean effort. Oh, we have that solved, it’s called Gridlogik.

Gridlogik was originally designed to be a TIFFing powerhouse, using all native applications to print complex and large file types like spreadsheets. The problem with this data, however, was that it was mostly text-based source-code files that were responsive and needed to be TIFFd. Text files are usually small in file size, but can generate thousands of pages. TIFFing a large text document meant the print driver needed to be extremely fast. Ok, so we thought on our feet and solved the hurdle: we modified one of our higher-speed print drivers to support streaming print spools. This enabled us to TIFF a 5,000 page text file in under 3 minutes. To put that into perspective, using a queue-based print driver to print a 5,000 page document might take 3 hours or more to complete.

Gridlogik did it in 3 minutes. That’s 1,666 pages per minute. Lightning fast. Insane fast. With this added modification we were able to convert all 140,000 documents into 11,100,000 TIFFs in about a week. It took another week to complete the multi-level endorsements and deliver the data on 10 hard drives with 2 backup copies for our client. We really love what we do.

The results:


So, how did we get 1.4 miles of pages? Here’s the math:

(((11,100,000 / 3,000) x 2) / 5,280)

And while we’re on a roll, here’s other random equivalents:

More cases

Case Studies

Did you know?

  • That if you redact a document, you should re-OCR the document before producing the text of that document?

  • That transferring sensitive data via a device (like a hard drive) in a cardboard box (like a bankers box) is highly susceptible to promoting disk failure?

  • That efficient and timely pre-trial eDiscovery is a huge strategic advantage in litigation?

  • That Guidance EnCase images can be opened and mounted by other forensic software’s?

  • That Google Gmail emails can be downloaded to Microsoft Outlook using a POP3 or IMAP connection?

  • That Microsoft Word 97-2002 documents can contain deleted data hidden within the binary of the file if “allow fast saves” are enabled?

  • That there is no realistic way to redact native files without first converting the file to an image?

  • That Google just started performing OCR on PDF documents to make them Google-searchable in late 2008?

  • That the European Union’s Directive on Data Protection mandates that any non-EU recipient of EU-based personal data must provide the required levels of privacy protection? Logik is Safe Harbor Certified.

  • That Microsoft Outlook PST files can contain foreign language characters even if the PST file isn’t Unicode?

  • That Microsoft Outlook doesn’t actually compress data, so how can it possibly expand after processing?

  • That Lotus Notes (in comparison to Microsoft Outlook) emails usually contain a very high number of embedded images in the body text of the email, like desktop screen-shots?

  • That estimating page counts based on file-type and file-size is arbitrary and can lead to wildlly inaccurate estimates?

  • That transporting your sensitive evidence in an unsafe container, like a cardboard box, is ok until that box is dropped on the floor or lands in a puddle?

  • That burning data to a disc, like a DVD or CD, has a much higher probability to be corrupted, versus copying the files to a hard drive?