Data Tsunami

Data Tsunami

Just the facts please

  • Four terabytes of Japanese data
  • English and Japanese search terms
  • 14,000,000 pages for review
  • 8,000,000 pages produced to ITC
  • $500,000 in savings
  • < 2 months to complete
  • Happy client, happy customer

Challenge:

One of the world’s largest producers of wind turbines needed to collect, process, analyze, review and produce over four terabytes of “real” data (email and office files) to the ITC in a matter of months. What they had was a windfall of data full of different encodings, email formats (Lotus, Outlook, EML, text-based), and Japanese proprietary document formats. They clearly needed help. Our client, one of the top three IP law firms on the planet, was tasked with managing this complex process from beginning to end. The data was collected in Japan and the US from over 100 people. Due to the volume of data, keywords in both English and Japanese (multiple encodings) were approved and needed to be applied to the large data set, post processing of course—a huge effort that needed help. Our client came to Logik to get the work done quickly and accurately.

So, what’s the problem?


What we did:

Great project management is needed for a project of this size and scope. The first thing we did was assemble a team to work directly with our law firm client and the upper-management from the customer to devise a realistic schedule. Normally, four terabytes of data trickles in as the data is collected over time – we were able to get all of the data delivered within a month’s time. The schedule we created allowed us to provide massive rolling deliveries of data (hundreds of thousands of documents), meaning the client was never without documents to review (always a good thing).

The results:


More cases

Case Studies

Did you know?

  • That Apple Macintosh files usually don’t have file extensions?

  • That AutoCad documents should be viewed in native, not TIFF, format because of their 3-dimensional layouts?

  • That right-clicking on a file in Windows will alter the Last Accessed Date?

  • That estimating page counts based on file-type and file-size is arbitrary and can lead to wildlly inaccurate estimates?

  • That producing in native format isn’t all that it’s cracked up to be, and sometimes producing in tiff with metadata can be faster and easier?

  • That converting Lotus Notes databases to MS Outlook will lose important metadata and formats?

  • That USB 3.0 is coming in 2009 and is 10 times faster than the current USB 2.0?

  • That when requesting another party’s metadata, timing is everything?

  • That the “All Documents” view in Lotus Notes doesn’t always reveal ALL the documents, because it is a query and can be modified?

  • That you can use a mapped drive letter (e.g. X:\) to gain access to a Windows file that has accidentally gone over the 256 character limit?

  • That a journalist at the New York Times OCRd 4 terabytes of TIFF images in under 24 hours with the use of Amazon’s EC2 cloud services?

  • That Outlook Express .EML files can contain foreign language characters?

  • That Google just started performing OCR on PDF documents to make them Google-searchable in late 2008?

  • That many enterprise search applications don’t extract embedded files?

  • That page-counts represent the amount of content needed to review and without that information, your document review projects will be skewed?