Maximum Page Count

Maximum Page Count

Just the facts please

  • 270GBs of NSFs and eDocs
  • 400 search terms
  • 300k docs delivered natively
  • 16,000,000 estimated pages (@ 53 pages/doc)
  • 140,000 documents needed to produce
  • 11,100,000 pages TIFFd (@ 79 pages/doc)
  • Less than 3 weeks to process 270GB
  • Less than 1 week to TIFF 11.1 million pages
  • Less than 1 week to endorse/deliver 11.1 million pages
  • = 1.4 miles of pages
  • = 2.2m sticks of butter

Challenge:

It’s amazing how such a relatively small amount of documents can explode into 1.4 miles of pages (we’ll explain in a moment). That’s the challenge our client, a worldwide document hosting company, was faced with recently. What’s even more amazing is that deadlines stay the same, regardless of how many pages your eDiscovery project generates. It’s probably not very fair or logical, but we don’t make the rules, so take it up with Uncle Sam.

After we completed a quick turnaround (less than 3 weeks) to natively process 270GBs of Lotus Notes (.NSF) email and Microsoft Office documents (eDocs) for our hosting partner, the request to start productions came right after that. De-duplication and running search terms on the emails dramatically reduced the data to a mere 300,000 documents for review – a fairly small amount of documents considering the original 270GB volume of the data. But the real telling statistic was the number of estimated pages Gridlogik™ collected. With more than 16,000,000 estimated pages and only 300,000 documents (an average of 53 pages per document), the page counts were off-the-charts. This count was no where near “industry averages.”

So what’s an average amount? A common question we hear is “How many pages are in a document?” quickly followed by “Ok, then how many pages in a gigabyte?” The answers are simple and surprisingly boring: “Between 3 to 10 pages per document,” and “It depends.” Using these averages, the 300,000 documents would produce anywhere between 900,000 and 3,000,000 pages. Yeah… wow. Lucky for us, we have Gridlogik, which extracts the native page count of every document before printing. This is extremely helpful for our clients’ discovery process, where knowing page counts ahead of time truly matters. Ultimately, deadlines rarely change even if the case explodes with pages.

Our client received some “tough love” from the court, needing to produce relevant documents in less than one month. Our client, well aware of the 16,000,000 pages, turned to us for a solution. Smart client.

So, what’s the problem:


What we did:

We knocked it out of the ballpark.

Since we were the processing provider on this project, getting up-to-speed on the production requirements was easy. We had all the native documents, metadata and tag lists ready to go. However, TIFFing and endorsing the 140,000 documents in the required tight time-frame meant that we needed a super-software / herculean effort. Oh, we have that solved, it’s called Gridlogik.

Gridlogik was originally designed to be a TIFFing powerhouse, using all native applications to print complex and large file types like spreadsheets. The problem with this data, however, was that it was mostly text-based source-code files that were responsive and needed to be TIFFd. Text files are usually small in file size, but can generate thousands of pages. TIFFing a large text document meant the print driver needed to be extremely fast. Ok, so we thought on our feet and solved the hurdle: we modified one of our higher-speed print drivers to support streaming print spools. This enabled us to TIFF a 5,000 page text file in under 3 minutes. To put that into perspective, using a queue-based print driver to print a 5,000 page document might take 3 hours or more to complete.

Gridlogik did it in 3 minutes. That’s 1,666 pages per minute. Lightning fast. Insane fast. With this added modification we were able to convert all 140,000 documents into 11,100,000 TIFFs in about a week. It took another week to complete the multi-level endorsements and deliver the data on 10 hard drives with 2 backup copies for our client. We really love what we do.

The results:


So, how did we get 1.4 miles of pages? Here’s the math:

(((11,100,000 / 3,000) x 2) / 5,280)

And while we’re on a roll, here’s other random equivalents:

More cases

Case Studies

Did you know?

  • That instant messages are discoverable information and are slowly taking over email as the dominant form of business communication?

  • That just because someone says they are Unicode compliant, doesn’t necessarily mean they can truly handle foreign language data?

  • That Bloomberg email systems keeps attachments disconnected from the actual email and in a compressed .tar.gz file?

  • That PSTs with a size of 256kb or less likely have no data in them or are not actual PST containers?

  • That all Microsoft Office document formats can contain embedded files and that those files too can contain embedded files?

  • That if you redact a document, you should re-OCR the document before producing the text of that document?

  • That “Size” and “Size on Disk” are two different measurements if you right-click properties file(s) or folder(s)?

  • That Mozilla Thunderbird emails can be easily processed by most eDiscovery applications?

  • That copying 5GB of tiny files is much slower than copying 1 large 5GB file?

  • That Microsoft Word 97-2002 documents can contain deleted data hidden within the binary of the file if “allow fast saves” are enabled?

  • That the internet header of an email can tell you a lot about where the email came from and who it went to?

  • That Lotus Notes (in comparison to Microsoft Outlook) emails usually contain a very high number of embedded images in the body text of the email, like desktop screen-shots?

  • That PDF files have multiple levels of security, where you can open a PDF, but might not be able to print it?

  • That page-counts represent the amount of content needed to review and without that information, your document review projects will be skewed?

  • That the European Union’s Directive on Data Protection mandates that any non-EU recipient of EU-based personal data must provide the required levels of privacy protection? Logik is Safe Harbor Certified.