Predictive Coding

Predictive Coding

black_box_SVMPredictive coding uses a supervisedmachine learning (a type of artificial intelligence) to assist an attorney in the review and classification of electronically stored information. Predictive coding type software analyzes whole documents in a dataset, not just keywords, and uses advanced mathematics, including near-infinite-dimensional vector space probability analysis, to order, compare, and rank them.

In predictive coding driven CARs attorneys train a computer to find documents identified by the attorney as a target, typically as relevant to a particular lawsuit, or some other classification, such as privileged.

Below is the diagram of the latest Predictive Coding 3.0 workflow for use in a typical CAR project.


For a full description of the eight steps see Predictive Coding 3.0 part two.

Please remember that before you begin to actually carry out a predictive coding project as described, you need to plan for it. This is critical to the success of the project. We suggest you consult this detailed outline of a Form Plan for a Predictive Coding Project for a complete checklist.


The use of multimodal judgmental sampling in steps three and seven to locate relevant documents follows the consensus view of information scientists specializing in information retrieval, but is not followed by several prominent predictive coding software vendors in e-discovery. They instead rely entirely on machine selected documents for training, or even worse, rely entirely on random selected documents to train the software. See Part One of Predictive Coding 3.0 where some of the errors in Predictive Coding 1.0 and 2.0, are described.

For more background on predictive coding search methods and Losey’s background in the area, see the e-Discovery Team CAR page. Click here to see a complete list of his articles on Predictive Coding.

If you have any suggestions and care to contribute to this project, or any questions (nothing case specific please), please leave a comment below.

14 thoughts on “Predictive Coding

  1. Pingback: Attorneys Admonished by Judge Nolan Not to “Confuse Advocacy with Adversarial Conduct” and Instructed on the Proportionality Doctrine « e-Discovery Team ®

  2. Pingback: Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas? « e-Discovery Team ®

  3. An observation I have in the trenches (litigation, civil defense) is that we are not talking about this phrase used above: “all documents uploaded for review.” My enterprise clients cannot believe we would need to collect select custodians’ email and productivity documents from a date range for evaluation; they want to hand me far less material, as they would do with paper. How do I convince them that we know what kind of look is required, particularly in a single plaintiff case/non-aggregated action or “simple” investigation? To put it bluntly, they think that what I suggest is invasive, costly overkill. Adding fuel to the fire: My older colleagues never ask them for that stuff. Maybe we need to price out a sample iterative review for a “basic” controversy, i.e., “Here’s how much cost and effort my recommendation is likely to require.” Five custodians, ten year period, no document destruction policy having pruned the active data, robust in-house review tool . . . We cannot get to the rest before getting past the collection barriers, which we encounter in nearly every case. When I tenderly offer (free) legal advice on litigation readiness, crickets.

  4. Ralph:

    This is a great start on the process. Our research scientists question whether we need senior attorney experts to do much of the random sampling and training. Another approach uses regular (albeit trained) reviewers to tag documents during the training phase and even the initial phase. The system can then evaluate the tags and present the likely mistakes to a senior expert to review. This cuts down on the front end time for that senior lawyer and can actually allow you to quickly tag more seeds, which can improve the rankings quickly. More is more is our motto (and I believe yours as well).

    We believe that the attorneys should find and include as many judgmental seeds as they can find. A random sample can help identify documents which were overlooked but this can also be found during the iterations if the system is smart enough to present a diverse set of documents during the iteration process. The random element comes during that period even if not used at the start.

    Or, you can supplement judgmental seeds (hot docs) with a random sample right from the start. You will get there either way.

    Thanks for all of your insights and great writing.


  5. Pingback: The First Few Weeks of EDBP, the Shoulders of Giants, and the Great Debates of TAR v. CAR and Ball v. Jaar « e-Discovery Team ®

  6. Pingback: Short Personal Summary of My Predictive Coding Methods and Services | e-Discovery Team ®

  7. Pingback: Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search | e-Discovery Team ®

  8. Pingback: My Basic Plan for Document Reviews: The “Bottom Line Driven” Approach | e-Discovery Team ®

  9. Pingback: Legal Search Science | e-Discovery Team ®

  10. Pingback: | Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas?

  11. Pingback: PRESUIT: How Corporpate Counsel Could Use “Smart Data” to Predict and Prevent Litigation | e-Discovery Team ®

  12. Pingback: Visualizing Data in a Predictive Coding Project – Part Three | e-Discovery Team ®

  13. Pingback: AI-Enhanced Review | ZEN of Document Review

  14. Pingback: Legal Search Science | ZEN of Document Review

Please leave a comment and help improve EDBP