Predictive Coding

Predictive Coding

black_box_SVMPredictive coding uses a supervisedmachine learning (a type of artificial intelligence) to assist an attorney in the review and classification of electronically stored information. Predictive coding type software analyzes whole documents in a dataset, not just keywords, and uses advanced mathematics, including near-infinite-dimensional vector space probability analysis, to order, compare, and rank them.

In predictive coding driven CARs attorneys train a computer to find documents identified by the attorney as a target, typically as relevant to a particular lawsuit, or some other classification, such as privileged. The next illustration shows a six-step process of using this review tool. The gears indicate  the interlocking nature of the ESI production processes used with predictive coding. (Note: illustration derived from a standard Microsoft diagram on project management, although the words are ours.) Later the same process will be described in slightly greater detail using  eight-steps.

Please remember that before you begin to actually carry out a predictive coding project as described, you need to plan for it. This is critical to the success of the project. We suggest you consult this detailed outline of a Form Plan for a Predictive Coding Project for a complete checklist.


The first step is Relevancy Dialogues with opposing counsel. This is based on a Cooperative approach to discovery. Cooperation is not only a best practice, but is, to a certain extent at least, a minimum standard required by rules of professional ethics and civil procedure. The primary goal of these dialogues for Predictive Coding purposes is to clarify the e-discovery requests and reach agreement on the scope of relevancy and production. Additional conferences to make disclosures designed to protect clients’ interests are also a best practice for proportional review.

The additional disclosure may in some cases require disclosure of the actual search techniques used, which is traditionally protected by work product. The disclosures may also sometimes include limited disclosure of some of the seed set documents used, both relevant and irrelevant. Nothing in the rules requires disclosure of irrelevant EIS, but if adequate privacy protections are provided, it may be in the best interests of all parties to do so. Such discretionary disclosures may be advantageous  as risk mitigation and efficiency tactics. If an agreement on search protocol is reached by the parties, or imposed by the court, the parties are better protected from the risk of expensive motion practice and repetitions of discovery search and production. Agreement on search protocols can also be used to implement bottom line driven proportional review practices. See Eg. the first case approving predictive coding search protocols: Da Silva Moore v. Publicis Groupe, _ F. Supp. 2d _, 2012 WL 607412 (S.D.N.Y. Feb. 24, 2012) (approved and adopted in Da Silva Moore v. Publicis Groupe, 2012 WL 1446534, at *2 (S.D.N.Y. Apr. 26, 2012)). Also see Global Aerospace Inc. v. Landow Aviation, L.P., 2012 WL 1431215 (Va. Cir. Ct. Apr. 23, 2012).

Initial Sample Reviews by SME Team. The use of Subject Matter Experts (SMEs) is a critical new part of predictive coding review. The samples are both random and judgmental. Judgmental samples use all of the various pre-predictive coding search methods discussed in Hybrid Multimodal page of CAR, including parametric Boolean keyword searches, similarity searches, and concept searches, and even strategic linear reviews of select custodians and date ranges. The random samples broaden the search and also make possible various types of random sample based statistical analysis.  For instance, the random sample can provide a baseline of calculation of the prevalence of relevant documents in the corpus for quality control purposes. The review by SMEs of random and judgmental samples provide the first machine training input for the predictive coding software. The first round of machine training is also sometimes called the initial Seed Set Build.

At the commencement of the project, but using different documents selected by random sample, good predictive coding software will also create what is called a control set. The SMEs code documents in both the first training set and the control set, and may be unaware which of the documents selected by random for them to code are designated for the control set, and which for the initial training set. Typically, both the control and random documents used in the seed set are part of the first  random sample. The control set is used solely for testing the work of the SME during the iterative training process. It is not used for training. The control set documents tests for SME consistency and for overall training effectiveness. The documents marked by the SME for the control set cannot also be used for training as this can introduce bias in the testing. See Losey, R., Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search.

Three-Cylinder Training Set Reviews. The next step continues the machine training in an iterative process. In an active learning type of predictive coding the machine selects documents for you that it would like input on. Typically that is a selection of documents whose classification it is unsure of. That is known as one of the three cylinders of a predictive coding search engine. See Losey, Three-Cylinder Multimodal Approach To Predictive Coding. The other two are random sampling, to keep the search open, and judgmental sampling, just as in the initial round of training.

Machine Learning Iterations is where the software takes the input provided by the SME team and extrapolates and applies those classification to the entire collection of documents. The predictive coding software then ranks all documents in the collection, usually on a percentage basis, from 100% to relevant, to 0%. This key predictive coding step is repeated as needed for quality control (QC) purposes. These iterations continue until the training is complete within the proportional constraints of the case. At that point the SME in charge of the search may declare the search complete and ready for the next quality assurance test.

Quality Assurance Tests are based primarily on random sampling to verify the effectiveness of the final rankings. It is used to verify, or not, the reasonability of the search and predictive coding parameters developed. If the test is not passed, the review is reactivated for additional rounds of SME review and machine learning iterations. If the review passes the tests, then the first-pass relevancy review is complete, and the project moves to the next and final step. For more on sampling and its use in both quality control and quality assurance, see the Review Quality Control page of EDBP, and the KPMG publications: Gabriel, Manfred, Quality Control For Predictive Coding In e-Discovery (2013); and, The Case For Statistical Sampling In e-Discovery (2012).

Protection Reviews and Productions is the last step. It comes only after the Quality Assurance tests have been satisfied. This ties into the eighth Protections step of the EDBP. Predictive coding in this step is used as a way to make efficient assignments to manual reviewers. They do the final reviews of the documents  before production that were coded relevant by the SME team in the prior steps, and only those documents.  In this last step SMEs are not required, unless there are issues regarding any of their first-pass relevancy determinations.  This step is done by attorneys knowledgeable about privilege and confidentiality issues in the case. This may be done by outside contract review attorneys.

In this final step the predicted relevant coding is confirmed and redactions made on the documents within budgetary limits as described, and privileged documents are identified and removed from production for logging. The documents culled by predictive coding, or other search methods, are culled out and not subject to expensive final manual review. So too are documents predicted relevant, but ranked below the budgeted amount under Bottom Line Driven Proportional Strategy. The last steps in Final Review are to spot check the final production media before delivery.

The next chart is another high level summary of how to use predictive coding for legal search. It is still somewhat simplified, but provides more detail than the prior six-step model. It provides an idea of how it really works. The full legal method on how to use predictive coding is more complex. For instance, the one Losey uses and still keeps proprietary is twenty-three steps, not eight, and the checklist description takes four pages. But complex or not, Losey insists that it can be taught to any experienced lawyer, especially if they are hands-on, trying it out with a software platform. Here the simplified eight-step version of Predictive Coding best practice. The circular flows depict the iterative steps specific to the predictive coding features.

In step-1, the process starts with ESI Discovery Communications, or Dialogues, not only with opposing counsel or other requesting parties, but also with the client and within the e-discovery team assigned to the case. Good communications are critical to all successful project management. The ESI Discovery Communications should be facilitated by the lead e-Discovery specialist attorney assigned to the case, but include active participation by the team trial lawyers.

In step-2, the Subject Matter Experts (“SME”) on the case (typically the partners, senior associates, and sometimes also the e-Discovery specialist attorney assigned to the case), perform manual reviews of search samples of the data. The samples are selected by their skilled judgments, not random samples. The selections are made with the help of various software search features, including keyword search, similarities, and concept searches. See Losey, R., Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search.

Step-3 in the diagram above, Random Baseline, is where statistically random sampling is used to establish a baseline for quality control purposes. Most software also uses this random sampling selection and SME coding for machine training.

Step-4 is the Auto Coding Run where the software’s predictive coding calculations begin. This is also known as the First Iteration of seed set training. Here the predictive coding software analyzes all of the categorizations made by SMEs in the prior steps so long as the documents were designated by them as training documents. Based on this input an Auto Coding Run begins where the software scans all of the data uploaded onto review platform (the corpus) and assigns a probable value of from 0 to 100 to each document in the corpus. A value of 100 represents the highest probability (100%) that the document matches the category trained, such as relevant, or highly relevant; a value of 0 means no likelihood of matching; whereas 50 represents equally likelihood. In the initial Auto Coding Runs the software predictions as to a document’s categorization are often wrong, sometimes wildly so, depending on the kind of search and data involved. That is why spot checking and further training are always needed for predictive coding to work properly.

Step-5 is where Prediction Error Corrections are made. Lawyers and paralegals find and correct the computer errors by a variety of methods. The predictive coding software learns from the corrections.  Steps 4 and 5 then repeat as shown in the diagram. This iterative process is considered a virtuous feedback loop that continues until the computer predictions are accurate enough to satisfy the proportional demands of the case.

Step-6Random QC Test, is where the reasonability of the decision to stop the training is evaluated by an objective quality control test. The test is based on a random sample of all documents to be excluded from the Final Review for possible production. The exclusion can be based on both category prediction (i.e. probable irrelevant) and/or probable ranking of document with proportionate cut-offs. The focus is on a search for any false negatives (i.e., relevant documents incorrectly predicted to be irrelevant), that are Hot or otherwise of significance. 100% recall of all relevant documents is both scientifically impossible and legally unnecessary under best practices for proportional review. But the goal is to avoid all false negatives of Hot documents. If this error is encountered, an additional iteration of Steps 4 & 5 is usually required.

Step-7 is where Proportional Final Review is performed, and the final decisions are made on the number of documents to be reviewed for possible production. A decision is made on the number of documents predicted by IR to be Relevant or Hot for review within budget. The standard of $3.00 to $5.00 per file may be used, or specific calculations made based upon metrics gathered in that project. After agreement with the requesting party is reached, or a court order is attained, the Final Review is done, including redactions and privileged document spotting. In large cases the Final Review may be outsourced to a document review team, with much higher rate and higher skilled attorneys only playing a supervisory role.

Step-8 of Phased Production is where the documents are actually produced after a last quality control check of the media on which the production is made, typically CDs or DVDs. The final work includes preparation of a privilege log, which is typically delayed until after production. Also, production is usually done in rolling stages as review is completed in order to buy more time and good will.

I have found that the best software and training methods use what is known as an active learning process in steps four and five in the chart. My preferred active learning process in the iterative machine learning steps is threefold:

  1. The computer selects documents for review where the software classifier is uncertain of the correct classification. This helps the classifier algorithms to learn by adding diversity to the documents presented for review. This in turn helps to locate outliers of a type your initial judgmental searches in step two and five have missed.
  2.  Some reasonable percentage of the documents presented for human review in step five are selected at random . This again helps maximize recall and premature focus on the relevant documents initially retrieved.
  3. All other relevant documents that the skilled reviewer can find using found using a variety of search techniques suggested by the documents the computer selects. That may include some linear review of selected custodians or dates, parametric Boolean keyword searches, similarity searches of all kinds, concept searches, as well as several unique predictive coding probability searches. I call that a multimodal approach.

Also, the initial seed set generation, step two in the chart, should also use some random samples, plus judgmental multimodal type searches. Steps three and six in the chart always use pure random samples and rely on statistical analysis.

My insistence on the use of multimodal judgmental sampling in steps two and five to locate relevant documents follows the consensus view of information scientists specializing in information retrieval, but is not followed by several prominent predictive coding software vendors in e-discovery. They instead rely entirely on machine selected documents for training, or even worse, rely entirely on random selected documents to train the software. In my writings I call this processes the Borg approach. It unnecessarily minimizes the role of legal expert input, and the usefulness of other types of search to supplement an active learning process. I instead use a hybrid approach where the expert review remains in control of the process, and their expertise is leveraged for greater accuracy and speed of review.

For a good introductory overview of some important considerations in using predictive coding see Ten Essential Best Practices in Predictive Coding by Warwick Sharp (Today’s General Counsel, May 2013). I summarize the ten general points made in this article as follows: 1. carefully choose the SME(s); 2. begin with collaborative training; 3. know the limitations of your software; 4. understand differences of the issues trained; 5. and 6. build a separate random control set to test and measure the training set (this is usually built into the software); 7. use hybrid training over the entire corpus; 8. use quality controls for the SME training that focus on consistency; 9. use software that provides graduated ranking of documents, for instance 0% to 100%, not simple relevant or not-relevant; and, 10. validate results (step-6 in the diagram above).

For more background on predictive coding search methods and Losey’s background in the area, see the e-Discovery Team CAR page.

If you have any suggestions and care to contribute to this project, or any questions (nothing case specific please), please leave a comment below.

11 thoughts on “Predictive Coding

  1. Pingback: Attorneys Admonished by Judge Nolan Not to “Confuse Advocacy with Adversarial Conduct” and Instructed on the Proportionality Doctrine « e-Discovery Team ®

  2. Pingback: Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas? « e-Discovery Team ®

  3. An observation I have in the trenches (litigation, civil defense) is that we are not talking about this phrase used above: “all documents uploaded for review.” My enterprise clients cannot believe we would need to collect select custodians’ email and productivity documents from a date range for evaluation; they want to hand me far less material, as they would do with paper. How do I convince them that we know what kind of look is required, particularly in a single plaintiff case/non-aggregated action or “simple” investigation? To put it bluntly, they think that what I suggest is invasive, costly overkill. Adding fuel to the fire: My older colleagues never ask them for that stuff. Maybe we need to price out a sample iterative review for a “basic” controversy, i.e., “Here’s how much cost and effort my recommendation is likely to require.” Five custodians, ten year period, no document destruction policy having pruned the active data, robust in-house review tool . . . We cannot get to the rest before getting past the collection barriers, which we encounter in nearly every case. When I tenderly offer (free) legal advice on litigation readiness, crickets.

  4. Ralph:

    This is a great start on the process. Our research scientists question whether we need senior attorney experts to do much of the random sampling and training. Another approach uses regular (albeit trained) reviewers to tag documents during the training phase and even the initial phase. The system can then evaluate the tags and present the likely mistakes to a senior expert to review. This cuts down on the front end time for that senior lawyer and can actually allow you to quickly tag more seeds, which can improve the rankings quickly. More is more is our motto (and I believe yours as well).

    We believe that the attorneys should find and include as many judgmental seeds as they can find. A random sample can help identify documents which were overlooked but this can also be found during the iterations if the system is smart enough to present a diverse set of documents during the iteration process. The random element comes during that period even if not used at the start.

    Or, you can supplement judgmental seeds (hot docs) with a random sample right from the start. You will get there either way.

    Thanks for all of your insights and great writing.


  5. Pingback: The First Few Weeks of EDBP, the Shoulders of Giants, and the Great Debates of TAR v. CAR and Ball v. Jaar « e-Discovery Team ®

  6. Pingback: Short Personal Summary of My Predictive Coding Methods and Services | e-Discovery Team ®

  7. Pingback: Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search | e-Discovery Team ®

  8. Pingback: My Basic Plan for Document Reviews: The “Bottom Line Driven” Approach | e-Discovery Team ®

  9. Pingback: Legal Search Science | e-Discovery Team ®

  10. Pingback: | Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas?

  11. Pingback: PRESUIT: How Corporpate Counsel Could Use “Smart Data” to Predict and Prevent Litigation | e-Discovery Team ®

Please leave a comment and help improve EDBP