Transcript
“DEEP DIVE EDISCOVERY PROCESSING”
WWW.JNNGROUP.COM 225 South Lake Ave 3rd Floor, Pasadena, CA 91101 T: (626) - 788-9638 F: (626) - 788-9630
JNN GROUP, INC.
DEEP DIVE PROCESS A| FULL PROCESSING OF ELECTRONICALLY STORED INFORMATION Processing involves extracting all searchable information, including metadata and full text, from documents and email and formatting them for lawyer review. JNN Group processes email from Exchange and Outlook, including files in PST, MBOX, EML, and MSG format; Excel, PowerPoint, and Word documents; PDFs; some CAD and CAM files; images, including JPGs, GIFs, PNGs, and TIFFs; and audio and video files. No preprocessing is required before loading native data into JNN Group platform. Files that JNN Group does not process, for example, poprietary databases, are not full-text searchable and cannot be imaged by JNN Group, but can still be searched by their metadata, downloaded, tagged, and produced. Processing is a twelve-step process: 1. Extract files from all containers and compressed archives. JNN Group extracts files from containers contained in other containers all the way down to arbitrary depth. For example, JNN Group will extract an Excel file embedded in a Word document contained in a ZIP file attached to an email contained in a PST. In addition to ordinary containers and compressed archives, JNN Group can extract files from forensic images. 2. Remove system files. JNN Group removes “system files,” like the copy of Windows or Word collected as part of an image of a custodian’s computer. The National Institute of Standards and Technology (NIST) publishes the industrystandard list of system files. Removing these files is sometimes called deNISTing. 3. Extract all text and metadata. JNN Group extracts all available text and metadata from native documents. This includes, for example, the full text of a Word document, the text in all the cells of an Excel spreadsheet, the sent date of an email, the created and last modified dates of files on a file system, and the like. The extracted text and metadata become fully searchable in JNN Group. 4. OCR all images. JNN Group runs optical character recognition (OCR) on all images, including scanned documents, and uses the results of OCR as the full text for the images. JNN Group rotates images and can be set for foreign languages to make the quality of the OCR as high as possible. JNN Group uses the Tesseract OCR engine, which is an OCR engine developed by H-P that was purchased and open sourced by Google. JNN Group also recognizes printed or imaged email from its OCR and creates parent–child and conversation relationships for email recognized in this way. 5. Normalize time zones. Documents and email collected in different parts of the world have time expressed in different time zones, often with no notation of the time zone. JNN Group normalizes all time zones to the time zone of the reviewer or to a single time zone for the matter so that documents and email appear in the correct order without reviewers having to convert time zones mentally. 6. Detect duplicates. Each file that JNN Group ingests is called an “instance.” JNN Group detects duplicate instances by comparing the hashes of certain parts of each instance. A hash is a fixed-length alphanumeric string generated by a hashing function from certain input data. Input data that has the same hash is the same. For email, JNN Group detects duplicates by comparing the hashes of the concatenated sender address, date sent, normalized subject, normalized message body, and the hash of each attachment. This is so that, for example, JNN Group detects two copies of an email message, one collected from its sender and the other from its recipient, as duplicates even though the one collected from the recipient has additional header information indicating when and by what address path it was delivered. For
WWW.JNNGROUP.COM
2
JNN GROUP, INC.
DEEP DIVE PROCESS images with load files, JNN Group detects duplicates by comparing the hash of the native plus the image. This is so that, for example, JNN Group does not detect the same native produced multiple times in a single production, like an attachment attached to multiple emails, as a duplicate if it is produced with different images, for example, with different Bates stamps. For all other files, JNN Group detects duplicates by comparing the hash of the entire file. Essentially, JNN Group treats two instances as duplicates if they look identical in the review window and when printed out. 7. Detect near duplicates. JNN Group detects near duplicates by conducting a paragraph-wise text comparison of documents. Documents with substantial paragraph-wise overlap are flagged as near duplicates. Reviewers can see the number of near duplicates that a document has in the search results summary grid and can navigate from a document to its near duplicates in the review window. 8. Generate near-native rendering of all documents for review. One of the keys to JNN Group’s speed is creating near native renderings of all documents during processing so that reviewers don’t have to wait for these to be created during review. During review, JNN Group displays these stored near-native renderings in the browser. JNN Group can create multiple near-native renderings for documents, for example, if the document contains redlines or otherwise can be viewed in multiple ways in its native format. 9. Create parent–child relationships. JNN Group creates parent–child relationships between emails and their attachments and between documents and their embedded objects. JNN Group shows the number of children that a document has in the search results summary grid and lets you navigate from parents to children and children to parents in the review window. You can also search for documents with or without children. 10. Create email conversations. JNN Group normalizes email subjects by removing blank space and prefixes like Re: and Fwd:. Then our software groups as conversations all emails that share the same normalized subject and have at least one participant in common. This is an algorithm that errs in favor of grouping emails together. 11. Create search indices and review database. JNN Group creates 17 or more different search indices on ingest to make searching and generating statistics along a variety of dimensions as fast as possible. JNN Group stores native files and its near-native renderings on fast network-attached storage (NAS) and stores metadata and extracted or OCR text as well as the search indices in a document-based noSQL database on a database server. 12. Generate a complete ingest report. JNN Group generates an ingest report for every file it ingests. The ingest report shows the file’s hash, when it was ingested, its custodian, file length, file path, and container path, how JNN Group treated it, and whether there were any ingestion problems, for example, if the file was password protected and the password was not supplied. The consolidated ingest report for an entire database is available for down- load on the analytics page in JNN Group. 13. JNN Group processing speed depends on the kind of data. Dense container files, like PSTs, take longer per GB than flat files like Word documents or PDFs. On average, JNN Group can process about 150,000 documents, not pages, per hour. JNN Group processes the full Enron set, which is 60 GB of PSTs, in about 4 hours. JNN Group processing can add documents to a database while the database is live. Because processing involves multiple passes through the data, you will see new documents appear first and then, after all the documents have appeared, you will see conversation counts and other second-pass information appear.
WWW.JNNGROUP.COM
3
JNN GROUP, INC.
DEEP DIVE PROCESS B| FREQUENTLY ASKED QUESTIONS 1) QUESTION: Can or should I preprocess data before loading it into JNN Group? ANSWER: No preprocessing is required before loading data into JNN Group. We JNN Group discourage preprocessing native data because preprocessing alters, destroys, or obfuscates information that is better extracted by JNN Group processing. 2) QUESTION: Can I load productions or exports from other eDiscovery software into JNN Group? ANSWER: Yes. Productions from other parties and exports from other ediscovery software are compatible with JNN Group if they are in native, PDF, TIFF, or JPG format, single-page or multi-page, accompanied by a load file in any industry-standard format. We recommend a DAT load file with accompanying OPT. Note that JNN Group will process this kind of data so that, for example, duplicates not detected by other eDiscovery software or noted in the incoming load file will be detected by JNN Group. As a result, the document counts in JNN Group may be different than the document counts in the load file or in other eDiscovery software. JNN Group can optionally use a load-file-supplied hash field for deduplication, in which case only those entries with the same hash in the load file will be deduplicated. 3) QUESTION: Can I cull data before processing? ANSWER: JNN Group processes all data and allows you to conduct early case assessment and cull data from the full-featured review tool. Because there is no separate charge for early case assessment versus full JNN Group processing, you have access to fully processed data and a full-featured review tool when making culling decisions. 3) QUESTION: How is data size measured for billing purposes? ANSWER: JNN Group measures data size after expanding any top-level compressed file, for example, a ZIP file containing all the data, but before any other processing. Some JNN Group processing, like deduplication, reduces the data size; other JNN Group processing, like generating near-native renderings up front, increases the data size. On average, the size of the data on JNN Group servers is 2x the size of the data before processing. But because JNN Group bills on the size of the data before processing, you know what the data size on your invoice will be before you submit your data to JNN Group. This billing transparency is one of JNN Group strengths.
WWW.JNNGROUP.COM
4