Japanese Data Tsunami
Just the facts please
- Four terabytes of Japanese data
- English and Japanese search terms
- 14,000,000 pages for review
- 8,000,000 pages produced to ITC
- $500,000 in savings
- < 2 months to complete
- Happy client, happy customer
Challenge:
One of the world’s largest producers of wind turbines needed to collect, process, analyze, review and produce over four terabytes of “real” data (email and office files) to the ITC in a matter of months. What they had was a wave of data full of different encodings, email formats (Lotus, Outlook, EML, text-based), and Japanese proprietary document formats. They clearly needed help. Our client, one of the top three IP law firms on the planet, was tasked with managing this complex process from beginning to end. The data was collected in Japan and the US from over 100 people. Due to the volume of data, keywords in both English and Japanese (multiple encodings) were approved and needed to be applied to the large data set, post processing of course—a huge effort that needed help. Our client came to Logik to get the work done quickly and accurately.So, what’s the problem?
- Four terabytes of emails and office files = tens of millions of documents pre-search
- Japanese documents have multiple encodings, making search and detection extremely difficult – plus the words need to be “tokenized” for accurate search
- ITC has tight deadlines and expects perfect productions without error
- Choosing a vendor that uses “extracted size” billing would double or triple the cost
- So…which documents are English and which are Japanese, Chinese, or Korean again?
What we did:
Great project management is needed for a project of this size and scope. The first thing we did was assemble a team to work directly with our law firm client and the upper-management from the customer to devise a realistic schedule. Normally, four terabytes of data trickles in as the data is collected over time – we were able to get all of the data delivered within a month’s time. The schedule we created allowed us to provide massive rolling deliveries of data (hundreds of thousands of documents), meaning the client was never without documents to review (always a good thing).The results:
- Using language detection, we were able to flag all non-English documents with their respective language (e.g. Japanese, Chinese, Korean, etc.), thus facilitating a more efficient document review
- We delivered ~14,000,000 pages, post search, in native + TIFF format to our client for review
- Over 8,000,000 pages were flagged as responsive, numbered, endorsed and provided to the ITC
- Production of the 8,000,000 pages took less than 24 hours for us to complete, ready to be delivered to the ITC
- All data was processed, searched and delivered in under 2 months on a rolling delivery schedule, easily making the tight ITC deadline
- Against other bids for this project, we saved the client over $500,000 in processing fees
More cases