Predictive Coding

Predictive Coding

black_box_SVMPredictive coding uses a supervisedmachine learning (a type of artificial intelligence) to assist an attorney in the review and classification of electronically stored information. Predictive coding type software analyzes whole documents in a dataset, not just keywords, and uses advanced mathematics, including near-infinite-dimensional vector space probability analysis, to order, compare, and rank them.

In predictive coding driven CARs attorneys train a computer to find documents identified by the attorney as a target, typically as relevant to a particular lawsuit, or some other classification, such as privileged. The next illustration shows a six-step process of using this review tool. The gears indicate  the interlocking nature of the ESI production processes used with predictive coding. (Note: illustration derived from a standard Microsoft diagram on project management, although the words are ours.) Later the same process will be described in slightly greater detail using  eight-steps.

Please remember that before you begin to actually carry out a predictive coding project as described, you need to plan for it. This is critical to the success of the project. We suggest you consult this detailed outline of a Form Plan for a Predictive Coding Project for a complete checklist.


Step-1: Relevancy Dialogues. You begin with analysis and discussions with your client, your internal team, and opposing counsel as to what it is you are looking for. The point is to clarify the information sought, the target. You cannot just stumble around and hope you know it when you find it (and yet this happens all to often in legal search). You must first know what you are looking for. The target of most searches is the information relevant to disputed issues of fact in a case or investigation. But what exactly does that mean. If you encounter unresolvable disputes with opposing counsel on the scope of relevance, which can happen during any stage of the review despite your best efforts up-front, you may have to include the Judge in these discussions and seek a ruling.

This dialogue approach is based on a Cooperative approach to discovery. Cooperation is not only a best practice, but is, to a certain extent at least, a minimum standard required by rules of professional ethics and civil procedure. The primary goal of these dialogues for Predictive Coding purposes is to obtain a common understanding of the e-discovery requests, and reach agreement on the scope of relevancy and production. Additional conferences to make disclosures designed to protect clients’ interests are also a best practice for proportional review.

The additional disclosure may in some cases require disclosure of the actual search techniques used, which is traditionally protected by work product. The disclosures may also sometimes include limited disclosure of some of the seed set documents used, both relevant and irrelevant. Nothing in the rules requires disclosure of irrelevant ESI, but if adequate privacy protections are provided, it may be in the best interests of all parties to do so. Such discretionary disclosures may be advantageous as risk mitigation and efficiency tactics. If an agreement on search protocol is reached by the parties, or imposed by the court, the parties are better protected from the risk of expensive motion practice and repetitions of discovery search and production. Agreement on search protocols can also be used to implement bottom line driven proportional review practices. See Eg. the first case approving predictive coding search protocols: Da Silva Moore v. Publicis Groupe, 2012 WL 607412 (S.D.N.Y. Feb. 24, 2012) (approved and adopted in Da Silva Moore v. Publicis Groupe, 2012 WL 1446534, at *2 (S.D.N.Y. Apr. 26, 2012)). Also see Global Aerospace Inc. v. Landow Aviation, L.P., 2012 WL 1431215 (Va. Cir. Ct. Apr. 23, 2012). There have been many other cases after that and can be easily found with Google. Generally these early cases are now disfavored from the point of view of the amount of disclosure required by the courts to approve the use of predictive coding. The clear trend today is away from full disclosure, especially for irrelevant documents. Counsel is advised today to offer translucency, not transparency. That means full disclosure of relevant documents used for training, and even then, only when asked, but only partial or otherwise limited disclosure of irrelevant documents used in training.

Step-2: Initial Sample Reviews by SME Team. The use of Subject Matter Experts (SMEs) is a critical part of predictive coding review. The samples are both random and judgmental. Judgmental samples use all of the various pre-predictive coding search methods discussed in Hybrid Multimodal page of CAR, including parametric Boolean keyword searches, similarity searches, and concept searches, and even strategic linear reviews of select custodians and date ranges. Some random samples broaden the search somewhat and make possible various types of statistical analysis.  For instance, the random sample can provide a baseline of calculation of the prevalence of relevant documents in the corpus for quality control purposes. The review by SMEs of random and judgmental samples provide the first machine training input for the predictive coding software.

The first round of machine training is also sometimes called the initial Seed Set Build. The following training rounds are identified by number, such as the second round of training, the third, etc. There is nothing particularly special about the first round of training, except for the fact that it cannot include rank-based document searches because no ranking has yet occurred. The ranking of documents according to probable relevance and then is only established by the machine training.

At the commencement of the project, but using different documents selected by random sample, some predictive coding software will also create what is called a control set. The SMEs code documents in both the first training set and the control set, and may be unaware which of the documents selected by random for them to code are designated for the control set, and which for the initial training set. Typically, both the control and random documents used in the seed set are part of the first  random sample. The control set is used solely for testing the work of the SME during the iterative training process. It is not used for training. The control set documents tests for SME consistency and for overall training effectiveness. The documents marked by the SME for the control set cannot also be used for training as this can introduce bias in the testing. See Losey, R., Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search. The value of control sets is now very much in question due to the problem of relevance shift, namely the scope of relevance normally changes as a review progresses. This makes the initial adjudications a suspect basis on which to measure the accuracy of all later document classifications. There are also serious statistical limitations to control set measures, especially in low prevalence datasets common to legal search.

Step-3: Three-Cylinder Training Set Reviews. The next step continues the machine training in an iterative process. In a hybrid active learning type of predictive coding the machine selects some of the documents for you to train and you select some. The machine usually selects the document that it would like you input on. Typically that is a selection of documents whose classification the machine is unsure of. They are usually the documents with a probability ranking between 40% and 60%. That is one of the three cylinders of a predictive coding search engine. See Losey, Three-Cylinder Multimodal Approach To Predictive Coding. The other two are random sampling, to keep the search open, which has questionable value, and judgmental sampling, iw – human selection, just as in the initial round of training.

Step-4: Machine Learning Iterations is where the software takes the input provided by the SME team and extrapolates and applies those classifications to the entire collection of documents. The predictive coding software then ranks all documents in the collection, usually on a percentage basis, from 99.9% to relevant, to 0.01%. This key predictive coding step is repeated as needed for quality control (QC) purposes. These iterations continue until the training is complete within the proportional constraints of the case. At that point the attorney in charge of the search may declare the search complete and ready for the next quality assurance test.

Step-5: Quality Assurance Tests are based primarily on random sampling to verify the effectiveness of the final rankings. It is used to verify, or not, the reasonability of the search and predictive coding parameters developed. If the test is not passed, the review is reactivated for additional rounds of review and machine learning iterations. If the review passes the tests, then the first-pass relevancy review is complete, and the project moves to the next and final step. For more on sampling and its use in both quality control and quality assurance, see the Review Quality Control page of EDBPIntroducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal SearchPart One, Part Two and Part ThreeIn Legal Search Exact Recall Can Never Be Knownand the KPMG publications: Gabriel, Manfred, Quality Control For Predictive Coding In e-Discovery (2013); and, The Case For Statistical Sampling In e-Discovery (2012).

Step-6: Protection Reviews and Productions is the last step. It typically comes only after the Quality Assurance tests have been satisfied, but sometimes may take place as part of the Machine Learning Iterations. The later is sometimes called a CAL or Continuous Active Learning approach. The manual review is usually focused on the eighth Protections step of the EDBP. Predictive coding in this step is used as a way to make efficient assignments to manual reviewers of documents likely privileged, or confidential, or of questionable relevance. The final manual reviews are of the documents before production are limited to documents already coded as relevant, or predicted as relevant without manual coding. This step is done by attorneys knowledgeable about privilege and confidentiality issues in the case. This is usually done by outside contract review attorneys. If documents are found that were predicted to be relevant, and thus assigned to reviewers, but were not in fact relevant in the reviewer’s judgment, then additional machine training may be required. This is common in CAL approaches.

In this final step the predicted relevant coding is confirmed and redactions made on the documents within budgetary limits as described, and privileged documents are identified and removed from production for logging. The documents culled by predictive coding, or other search methods, are culled out and not subject to expensive final manual review. So too are documents predicted relevant, but ranked below the budgeted amount under Bottom Line Driven Proportional Strategy. The last steps in Final Review are to spot check the final production media before delivery.

The next chart is another high level summary of how to use predictive coding for legal search. It assumes a CAL approach, which is typically the most effective technique to use. This eight-step chart is still somewhat simplified, but provides more detail than the prior six-step model. The full legal method on how to use predictive coding is more complex. For instance, the one Losey uses and still keeps proprietary is twenty-three steps, not eight, and the checklist description takes four pages. Also see Losey’s Form Plan of a Predictive Coding Project, which is a detailed Outline for project planning purposes.

But complex or not, Losey has found that it can be taught to any experienced lawyer in a second-chair type of hands-on training. Mere intellectual descriptions as we are doing now can serve as a good preparation for later effective apprenticeship training. Here the simplified eight-step version of Predictive Coding best practice. The circular flows depict the iterative steps specific to the predictive coding features.

Step-1: ESI Discovery Communications.  As in the six-step description, the process starts with ESI Discovery Communications, or Dialogues, not only with opposing counsel or other requesting parties, but also with the client and within the e-discovery team assigned to the case. Analysis of the scope of the discovery, and clear communications on relevance and other review procedures, are critical to all successful project management. The ESI Discovery Communications should be facilitated by the lead e-Discovery specialist attorney assigned to the case, but include active participation by the team trial lawyers. See the description of step-1 in the six-step description above for further clarification.

Step-2: Multimodal Search Review.  Again, this step is the same as Step-2 in the six-step description, which we called Initial Sample Reviews by SME Team. Here is is renamed to emphasize the fact that all types of search methods are used to try to find as many relevant documents as possible for the training rounds. In the process many irrelevant documents are incidentally found and coded, although usually only select samples are used in later training.

The multimodal search review is carried out under the general supervision of Subject Matter Experts on the case who make final decisions where there is doubt concerning the relevance of a document or document type. The SME role is typically performed by the partners, senior associates, and sometimes also the e-Discovery specialist attorney(s) assigned to the case. The SME team then trains and supervises the document review specialists, aka, contract review attorneys who usually do most of the manual reviews (Step-Five). It is, however, very important for all members of the SME team to have direct and substantial contact with the actual documents. For instance, they should see all hot documents found in any step of the process.

The hands-on involvement of the SMEs is especially needed in the second step, and thereafter may be limited. In most cases to insure that proper expertise is generated for a project, the contract review attorneys are not brought in until Step Three, Random Baseline, or even Step Five, document review after training has begun. In others, at least one contract review expert is brought in at Step-Two to assist the SMEs and later help in training of additional reviewers if and when they are included in Step-Five of the project. The e-Discovery specialist with special expertise and experience with search, along with an expert on the software being used, the Power-User, should be involved in all stages of the project.

The samples found by the multimodal search methods in Step-Two are selected by human judgment, not by random samples. The selections are made with the help of various software search features, including keyword search, similarities, and concept searches.

Step-3 in the diagram above, Random Baseline, is where statistically random sampling is used to establish a baseline for quality control purposes. Most software also uses this random sampling selection and SME coding for machine training.

Step-4 is the Auto Coding Run where the software’s predictive coding calculations begin. This is also known as the First Iteration of seed set training. Here the predictive coding software analyzes all of the categorizations made by SMEs in the prior steps so long as the documents were designated by them as training documents. Based on this input an Auto Coding Run begins where the software scans all of the data uploaded onto review platform (the corpus) and assigns a probable value of from 0 to 100 to each document in the corpus. A value of 100 represents the highest probability (100%) that the document matches the category trained, such as relevant, or highly relevant; a value of 0 means no likelihood of matching; whereas 50 represents equally likelihood. In the initial Auto Coding Runs the software predictions as to a document’s categorization are often wrong, sometimes wildly so, depending on the kind of search and data involved. That is why spot checking and further training are always needed for predictive coding to work properly.

Step-5 is where Prediction Error Corrections are made. Lawyers and paralegals find and correct the computer errors by a variety of methods. The predictive coding software learns from the corrections.  Steps 4 and 5 then repeat as shown in the diagram. This iterative process is considered a virtuous feedback loop that continues until the computer predictions are accurate enough to satisfy the proportional demands of the case.

Step-6Random QC Test, is where the reasonability of the decision to stop the training is evaluated by an objective quality control test. The test is based on a random sample of all documents to be excluded from the Final Review for possible production. The exclusion can be based on both category prediction (i.e. probable irrelevant) and/or probable ranking of document with proportionate cut-offs. The focus is on a search for any false negatives (i.e., relevant documents incorrectly predicted to be irrelevant), that are Hot or otherwise of significance. 100% recall of all relevant documents is both scientifically impossible and legally unnecessary under best practices for proportional review. But the goal is to avoid all false negatives of Hot documents. If this error is encountered, an additional iteration of Steps 4 & 5 is usually required.

Step-7 is where Proportional Final Review is performed, and the final decisions are made on the number of documents to be reviewed for possible production. A decision is made on the number of documents predicted by IR to be Relevant or Hot for review within budget. The standard of $3.00 to $5.00 per file may be used, or specific calculations made based upon metrics gathered in that project. After agreement with the requesting party is reached, or a court order is attained, the Final Review is done, including redactions and privileged document spotting. In large cases the Final Review may be outsourced to a document review team, with much higher rate and higher skilled attorneys only playing a supervisory role.

Step-8 of Phased Production is where the documents are actually produced after a last quality control check of the media on which the production is made, typically CDs or DVDs. The final work includes preparation of a privilege log, which is typically delayed until after production. Also, production is usually done in rolling stages as review is completed in order to buy more time and good will.

I have found that the best software and training methods use what is known as an active learning process in steps four and five in the chart. My preferred active learning process in the iterative machine learning steps is threefold:

  1. The computer selects documents for review where the software classifier is uncertain of the correct classification. This helps the classifier algorithms to learn by adding diversity to the documents presented for review. This in turn helps to locate outliers of a type your initial judgmental searches in step two and five have missed.
  2.  Some reasonable percentage of the documents presented for human review in step five are selected at random . This again helps maximize recall and premature focus on the relevant documents initially retrieved.
  3. All other relevant documents that the skilled reviewer can find using found using a variety of search techniques suggested by the documents the computer selects. That may include some linear review of selected custodians or dates, parametric Boolean keyword searches, similarity searches of all kinds, concept searches, as well as several unique predictive coding probability searches. I call that a multimodal approach.

Also, the initial seed set generation, step two in the chart, should also use some random samples, plus judgmental multimodal type searches. Steps three and six in the chart always use pure random samples and rely on statistical analysis.

My insistence on the use of multimodal judgmental sampling in steps two and five to locate relevant documents follows the consensus view of information scientists specializing in information retrieval, but is not followed by several prominent predictive coding software vendors in e-discovery. They instead rely entirely on machine selected documents for training, or even worse, rely entirely on random selected documents to train the software. In my writings I call this processes the Borg approach. It unnecessarily minimizes the role of legal expert input, and the usefulness of other types of search to supplement an active learning process. I instead use a hybrid approach where the expert review remains in control of the process, and their expertise is leveraged for greater accuracy and speed of review.

For a good introductory overview of some important considerations in using predictive coding see Ten Essential Best Practices in Predictive Coding by Warwick Sharp (Today’s General Counsel, May 2013). I summarize the ten general points made in this article as follows: 1. carefully choose the SME(s); 2. begin with collaborative training; 3. know the limitations of your software; 4. understand differences of the issues trained; 5. and 6. build a separate random control set to test and measure the training set (this is usually built into the software); 7. use hybrid training over the entire corpus; 8. use quality controls for the SME training that focus on consistency; 9. use software that provides graduated ranking of documents, for instance 0% to 100%, not simple relevant or not-relevant; and, 10. validate results (step-6 in the diagram above).

For more background on predictive coding search methods and Losey’s background in the area, see the e-Discovery Team CAR page.

If you have any suggestions and care to contribute to this project, or any questions (nothing case specific please), please leave a comment below.

14 thoughts on “Predictive Coding

  1. Pingback: Attorneys Admonished by Judge Nolan Not to “Confuse Advocacy with Adversarial Conduct” and Instructed on the Proportionality Doctrine « e-Discovery Team ®

  2. Pingback: Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas? « e-Discovery Team ®

  3. An observation I have in the trenches (litigation, civil defense) is that we are not talking about this phrase used above: “all documents uploaded for review.” My enterprise clients cannot believe we would need to collect select custodians’ email and productivity documents from a date range for evaluation; they want to hand me far less material, as they would do with paper. How do I convince them that we know what kind of look is required, particularly in a single plaintiff case/non-aggregated action or “simple” investigation? To put it bluntly, they think that what I suggest is invasive, costly overkill. Adding fuel to the fire: My older colleagues never ask them for that stuff. Maybe we need to price out a sample iterative review for a “basic” controversy, i.e., “Here’s how much cost and effort my recommendation is likely to require.” Five custodians, ten year period, no document destruction policy having pruned the active data, robust in-house review tool . . . We cannot get to the rest before getting past the collection barriers, which we encounter in nearly every case. When I tenderly offer (free) legal advice on litigation readiness, crickets.

  4. Ralph:

    This is a great start on the process. Our research scientists question whether we need senior attorney experts to do much of the random sampling and training. Another approach uses regular (albeit trained) reviewers to tag documents during the training phase and even the initial phase. The system can then evaluate the tags and present the likely mistakes to a senior expert to review. This cuts down on the front end time for that senior lawyer and can actually allow you to quickly tag more seeds, which can improve the rankings quickly. More is more is our motto (and I believe yours as well).

    We believe that the attorneys should find and include as many judgmental seeds as they can find. A random sample can help identify documents which were overlooked but this can also be found during the iterations if the system is smart enough to present a diverse set of documents during the iteration process. The random element comes during that period even if not used at the start.

    Or, you can supplement judgmental seeds (hot docs) with a random sample right from the start. You will get there either way.

    Thanks for all of your insights and great writing.


  5. Pingback: The First Few Weeks of EDBP, the Shoulders of Giants, and the Great Debates of TAR v. CAR and Ball v. Jaar « e-Discovery Team ®

  6. Pingback: Short Personal Summary of My Predictive Coding Methods and Services | e-Discovery Team ®

  7. Pingback: Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search | e-Discovery Team ®

  8. Pingback: My Basic Plan for Document Reviews: The “Bottom Line Driven” Approach | e-Discovery Team ®

  9. Pingback: Legal Search Science | e-Discovery Team ®

  10. Pingback: | Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas?

  11. Pingback: PRESUIT: How Corporpate Counsel Could Use “Smart Data” to Predict and Prevent Litigation | e-Discovery Team ®

  12. Pingback: Visualizing Data in a Predictive Coding Project – Part Three | e-Discovery Team ®

  13. Pingback: AI-Enhanced Review | ZEN of Document Review

  14. Pingback: Legal Search Science | ZEN of Document Review

Please leave a comment and help improve EDBP