Japanese Data Tsunami

Japanese Data Tsunami

Just the facts please

  • Four terabytes of Japanese data
  • English and Japanese search terms
  • 14,000,000 pages for review
  • 8,000,000 pages produced to ITC
  • $500,000 in savings
  • < 2 months to complete
  • Happy client, happy customer

Challenge:

One of the world’s largest producers of wind turbines needed to collect, process, analyze, review and produce over four terabytes of “real” data (email and office files) to the ITC in a matter of months. What they had was a wave of data full of different encodings, email formats (Lotus, Outlook, EML, text-based), and Japanese proprietary document formats. They clearly needed help. Our client, one of the top three IP law firms on the planet, was tasked with managing this complex process from beginning to end. The data was collected in Japan and the US from over 100 people. Due to the volume of data, keywords in both English and Japanese (multiple encodings) were approved and needed to be applied to the large data set, post processing of course—a huge effort that needed help. Our client came to Logik to get the work done quickly and accurately.

So, what’s the problem?


What we did:

Great project management is needed for a project of this size and scope. The first thing we did was assemble a team to work directly with our law firm client and the upper-management from the customer to devise a realistic schedule. Normally, four terabytes of data trickles in as the data is collected over time – we were able to get all of the data delivered within a month’s time. The schedule we created allowed us to provide massive rolling deliveries of data (hundreds of thousands of documents), meaning the client was never without documents to review (always a good thing).

The results:


More cases

Case Studies

Did you know?

  • That a thorough data map can help you to implement your data retention policy, and can equip you for your “meet and confer” conference?

  • That Apple Macintosh files usually don’t have file extensions?

  • That removing near duplicate documents without first reviewing them could risk missing important information?

  • That a standard DVD-R single layer can only hold ~4.7GB of information and takes ~30minutes to fully burn, whereas copying 4.7GB of files to a hard drive will only take 5 minutes?

  • That many of the off-the-shelf eDiscovery programs can not detect the encoding of documents and thus can not properly handle foreign language character sets?

  • That you could probably save your clients hundreds of thousands of dollars in eDiscovery costs by hosting the documents within your own firm?

  • That voice-mails have come into the picture with discovery/records retention?

  • That it would take a team of 1,000 attorneys 100 years to review a petabyte of information?

  • That if you redact a document, you should re-OCR the document before producing the text of that document?

  • That efficient and timely pre-trial eDiscovery is a huge strategic advantage in litigation?

  • That you can easily reduce the amount of information to review by doing a domain name analysis on your data (e.g. remove all @amazon.com )?

  • That the internet header of an email can tell you a lot about where the email came from and who it went to?

  • That not all OCR software is created equal and that many don’t work very well?

  • That many of the off-the-shelf eDiscovery programs can only extract a limited number of embedded files?

  • That Google Gmail emails can be downloaded to Microsoft Outlook using a POP3 or IMAP connection?