Predictive coding uses a supervised, machine learning (a type of artificial intelligence) to assist an attorney in the review and classification of electronically stored information. Predictive coding type software analyzes whole documents in a dataset, not just keywords, and uses advanced mathematics, including near-infinite-dimensional vector space probability analysis, to order, compare, and rank them.
In predictive coding driven CARs attorneys train a computer to find documents identified by the attorney as a target, typically as relevant to a particular lawsuit, or some other classification, such as privileged. The next illustration shows a six-step process of using this review tool. The gears indicate the interlocking nature of the ESI production processes used with predictive coding. (Note: illustration derived from a standard Microsoft diagram on project management, although the words are ours.) Later the same process will be described in slightly greater detail using eight-steps.
Please remember that before you begin to actually carry out a predictive coding project as described, you need to plan for it. This is critical to the success of the project. We suggest you consult this detailed outline of a Form Plan for a Predictive Coding Project for a complete checklist.
Step-1: Relevancy Dialogues. You begin with analysis and discussions with your client, your internal team, and opposing counsel as to what it is you are looking for. The point is to clarify the information sought, the target. You cannot just stumble around and hope you know it when you find it (and yet this happens all to often in legal search). You must first know what you are looking for. The target of most searches is the information relevant to disputed issues of fact in a case or investigation. But what exactly does that mean. If you encounter unresolvable disputes with opposing counsel on the scope of relevance, which can happen during any stage of the review despite your best efforts up-front, you may have to include the Judge in these discussions and seek a ruling.
This dialogue approach is based on a Cooperative approach to discovery. Cooperation is not only a best practice, but is, to a certain extent at least, a minimum standard required by rules of professional ethics and civil procedure. The primary goal of these dialogues for Predictive Coding purposes is to obtain a common understanding of the e-discovery requests, and reach agreement on the scope of relevancy and production. Additional conferences to make disclosures designed to protect clients’ interests are also a best practice for proportional review.
The additional disclosure may in some cases require disclosure of the actual search techniques used, which is traditionally protected by work product. The disclosures may also sometimes include limited disclosure of some of the seed set documents used, both relevant and irrelevant. Nothing in the rules requires disclosure of irrelevant ESI, but if adequate privacy protections are provided, it may be in the best interests of all parties to do so. Such discretionary disclosures may be advantageous as risk mitigation and efficiency tactics. If an agreement on search protocol is reached by the parties, or imposed by the court, the parties are better protected from the risk of expensive motion practice and repetitions of discovery search and production. Agreement on search protocols can also be used to implement bottom line driven proportional review practices. See Eg. the first case approving predictive coding search protocols: Da Silva Moore v. Publicis Groupe, 2012 WL 607412 (S.D.N.Y. Feb. 24, 2012) (approved and adopted in Da Silva Moore v. Publicis Groupe, 2012 WL 1446534, at *2 (S.D.N.Y. Apr. 26, 2012)). Also see Global Aerospace Inc. v. Landow Aviation, L.P., 2012 WL 1431215 (Va. Cir. Ct. Apr. 23, 2012). There have been many other cases after that and can be easily found with Google. Generally these early cases are now disfavored from the point of view of the amount of disclosure required by the courts to approve the use of predictive coding. The clear trend today is away from full disclosure, especially for irrelevant documents. Counsel is advised today to offer translucency, not transparency. That means full disclosure of relevant documents used for training, and even then, only when asked, but only partial or otherwise limited disclosure of irrelevant documents used in training.
Step-2: Initial Sample Reviews by SME Team. The use of Subject Matter Experts (SMEs) is a critical part of predictive coding review. The samples are both random and judgmental. Judgmental samples use all of the various pre-predictive coding search methods discussed in Hybrid Multimodal page of CAR, including parametric Boolean keyword searches, similarity searches, and concept searches, and even strategic linear reviews of select custodians and date ranges. Some random samples broaden the search somewhat and make possible various types of statistical analysis. For instance, the random sample can provide a baseline of calculation of the prevalence of relevant documents in the corpus for quality control purposes. The review by SMEs of random and judgmental samples provide the first machine training input for the predictive coding software.
The first round of machine training is also sometimes called the initial Seed Set Build. The following training rounds are identified by number, such as the second round of training, the third, etc. There is nothing particularly special about the first round of training, except for the fact that it cannot include rank-based document searches because no ranking has yet occurred. The ranking of documents according to probable relevance and then is only established by the machine training.
At the commencement of the project, but using different documents selected by random sample, some predictive coding software will also create what is called a control set. The SMEs code documents in both the first training set and the control set, and may be unaware which of the documents selected by random for them to code are designated for the control set, and which for the initial training set. Typically, both the control and random documents used in the seed set are part of the first random sample. The control set is used solely for testing the work of the SME during the iterative training process. It is not used for training. The control set documents tests for SME consistency and for overall training effectiveness. The documents marked by the SME for the control set cannot also be used for training as this can introduce bias in the testing. See Losey, R., Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search. The value of control sets is now very much in question due to the problem of relevance shift, namely the scope of relevance normally changes as a review progresses. This makes the initial adjudications a suspect basis on which to measure the accuracy of all later document classifications. There are also serious statistical limitations to control set measures, especially in low prevalence datasets common to legal search.
Step-3: Three-Cylinder Training Set Reviews. This step continues the machine training in an iterative process. In a hybrid active learning type of predictive coding the machine selects some of the documents for you to train, and you select the rest. The machine usually selects the document that it would like you input on. Typically that is a selection of documents whose classification the machine is unsure of. They are usually the documents with a probability ranking between 40% and 60%. With other designs the machine can also select high-ranking documents, and perhaps only high-ranking, for you to review. The computer selection method, either high-ranking, or uncertainty, is one of the three cylinders of a predictive coding search engine. See Losey, Three-Cylinder Multimodal Approach To Predictive Coding. The other two are random sampling, to keep the search open, which has questionable value, and judgmental sampling, iw – human selection, just as in the initial round of training.
Step-4: Machine Learning Iterations is where the software takes the input provided by the SME team and extrapolates and applies those classifications to the entire collection of documents. The predictive coding software then ranks all documents in the collection, usually on a percentage basis, from 99.9% to relevant, to 0.01%. This key predictive coding step is repeated as needed for quality control (QC) purposes. These iterations continue until the training is complete within the proportional constraints of the case. At that point the attorney in charge of the search may declare the search complete and ready for the next quality assurance test.
Step-5: Quality Assurance Tests are based primarily on random sampling to verify the effectiveness of the final rankings. It is used to verify, or not, the reasonability of the search and predictive coding parameters developed. If the test is not passed, the review is reactivated for additional rounds of review and machine learning iterations. If the review passes the tests, then the first-pass relevancy review is complete, and the project moves to the next and final step. For more on sampling and its use in both quality control and quality assurance, see the Review Quality Control page of EDBP; Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part One, Part Two and Part Three; and In Legal Search Exact Recall Can Never Be Known; and the KPMG publications: Gabriel, Manfred, Quality Control For Predictive Coding In e-Discovery (2013); and, The Case For Statistical Sampling In e-Discovery (2012).
Step-6: Protection Reviews and Productions is the last step. It used to take place only after the Quality Assurance tests have been satisfied, but now often occurs on an ongoing basis as part of the Machine Learning Iterations. The later is sometimes called a CAL or Continuous Active Learning approach. The manual review is usually focused on the eighth Protections step of the EDBP. Predictive coding in this step is used as a way to make efficient assignments to manual reviewers of documents likely privileged, or confidential, or of questionable relevance. The final manual reviews are of the documents before production are limited to documents already coded as relevant, or predicted as relevant without manual coding. This step is done by attorneys knowledgeable about privilege and confidentiality issues in the case. This is usually done by outside contract review attorneys. If documents are found that were predicted to be relevant, and thus assigned to reviewers, but were not in fact relevant in the reviewer’s judgment, then additional machine training may be required. This is common in CAL approaches.
In this final step the predicted relevant coding is confirmed and redactions made on the documents within budgetary limits as described, and privileged documents are identified and removed from production for logging. The documents culled by predictive coding, or other search methods, are culled out and not subject to expensive final manual review. So too are documents predicted relevant, but ranked below the budgeted amount under Bottom Line Driven Proportional Strategy. The last steps in Final Review are to spot check the final production media before delivery.
The next chart is another high level summary of how to use predictive coding for legal search. It assumes a CAL approach, which is typically the most effective technique to use. This eight-step chart is still somewhat simplified, but provides more detail than the prior six-step model. The full legal method on how to use predictive coding is more complex. For instance, the one Losey uses and still keeps proprietary is twenty-three steps, not eight, and the checklist description takes four pages. Also see Losey’s Form Plan of a Predictive Coding Project, which is a detailed Outline for project planning purposes.
But complex or not, Losey has found that it can be taught to any experienced lawyer in a second-chair type of hands-on training. Mere intellectual descriptions as we are doing now can serve as a good preparation for later effective apprenticeship training. Here the simplified eight-step version of Predictive Coding best practice. The circular flows depict the iterative steps specific to the predictive coding features.
Step-1: ESI Discovery Communications. As in the six-step description, the process starts with ESI Discovery Communications, or Dialogues, not only with opposing counsel or other requesting parties, but also with the client and within the e-discovery team assigned to the case. Analysis of the scope of the discovery, and clear communications on relevance and other review procedures, are critical to all successful project management. The ESI Discovery Communications should be facilitated by the lead e-Discovery specialist attorney assigned to the case, but include active participation by the team trial lawyers. See the description of step-1 in the six-step description above for further clarification.
Step-2: Multimodal Search Review. Again, this step is the same as Step-2 in the six-step description, which we called Initial Sample Reviews by SME Team. Here is is renamed to emphasize the fact that all types of search methods are used to try to find as many relevant documents as possible for the training rounds. In the process many irrelevant documents are incidentally found and coded, although usually only select samples are used in later training.
The multimodal search review is carried out under the general supervision of Subject Matter Experts on the case who make final decisions where there is doubt concerning the relevance of a document or document type. The SME role is typically performed by the partners, senior associates, and sometimes also the e-Discovery specialist attorney(s) assigned to the case. The SME team then trains and supervises the document review specialists, aka, contract review attorneys who usually do most of the manual reviews (Step-Five). It is, however, very important for all members of the SME team to have direct and substantial contact with the actual documents. For instance, they should see all hot documents found in any step of the process.
The hands-on involvement of the SMEs is especially needed in the second step, and thereafter may be limited. The SME involvement up-front is needed to ensure that proper expertise on relevance, and the then expected story to be told at trial, guides all later activities. For that reason contract review attorneys are often not brought in until Step Three, Random Baseline, or even Step Five, Document Review, well after training has begun. In some projects, at least one contract review expert is brought in at Step-Two to assist the SMEs and later help in training of additional reviewers when they are included in Step-Five of the project. The e-Discovery specialist with expertise and experience with search, the Experienced Searcher, along with an expert on the software being used, the Power-User, should be involved in all stages of the project. Often these two roles are performed by one search expert, but rarely is that person also an SME of the legal issues.
The samples found by the multimodal search methods in Step-Two are selected by human judgment, not by random samples. As stated, the selections are made with the help of various software search features, including keyword search, similarities, and concept searches. Documents outside of the dataset such a subpoenas or complaints may be included for training purposes, even synthetic documents may be used as ideal exemplars.
Step-3: Random Baseline. This step in the work-flow is essentially a computer function with statistical analysis. Here you create a random sample and analyze the results of expert review of the sample. Some review is thus involved in this step and you have to be be very carefully it is correctly done. This sample is taken for statistical purposes to establish a baseline for quality control purposes in Step-Seven. Typically prevalence calculations are made at this point. Some software also uses this random sampling selection for purposes of a control set creation. As mentioned, many no longer use a control set, myself included, because it is so unreliable due to relevance shift and low prevalence issues. Instead, my sole purpose is to determine prevalence. This Random Baseline step was included in Step-Two of the Six-Step description above. Also see: In Legal Search Exact Recall Can Never Be Known.
This step requires an experienced searcher and power user. An SME is only needed to verify the classifications of any grey area documents found in the random sample. Otherwise the review can be done by your one contract reviewer. Document review is limited to the sample documents, but aside from that, this step is a computer function and mathematical analysis. In complex cases a consulting statistician or scientist may be needed to supervise this step, as well to assist in Step Seven, quality control.
Step-4: AI Predictive Ranking. This is the Auto Coding Run where the software’s predictive coding calculations begin. This is the same as Step-Four, Machine Learning Iterations, described in the Six-Step work flow above. The first time the training runs is known as the seed set training. Step-Four repeats, with steps Five and Six, in an iterative process, which is also known as Continuous Active learning (CAL). The first repetition of the training is known as the Second Round of training, the next the Third Round, etc..
Step-Four in the eight-step workflow is a purely algorithmic function. The ranking of a million documents may take as long as an hour, or even more, depending on the complexity and software. You could fairly call Step-Four a coffee break, as far as the humans involved in the project are concerned. With some projects and software, however, the computer processing and ranking may be done in just a few minutes.
The predictive coding software in this step is analyzing all of the document categorizations made in Step-Three for the initial run, the seed set. Thereafter in all subsequent training rounds, when Step-Four repeats, the Machine not only uses the input from Steps Two and Three, but also the new documents reviewed in Step-Five, and found and selected for training coded in Step-Six. Note that skilled searchers rarely use all documents coded as training documents, and that is where the art and experience come in. The concern is to avoid over-training on any one document type and thus lowering recall and missing a key black-swan document.
All documents selected for training are including in this Step-Four computer processing. The software studies the documents marked for training, and then scans all of the data uploaded onto the review platform (aka, the corpus). It then ranks all of the documents according to probable relevance. It essentially assigns a probable value of from 0.01% to 99.9% probable relevance to each document in the corpus. (Note, some software uses different ranking values, but this is essentially what it is doing.) A value of 99.9% represents the highest probability that the document matches the category trained, such as relevant, or highly relevant, or privileged. A value of 0.01% means no likelihood of matching. A probability ranking of 50% represents equally likelihood. The machine is uncertain as to the document classification.
The first few times this AI-Ranking step is run, the software predictions as to a document’s categorization are often wrong, sometimes wildly so. It depends on the kind of search and data involved, and the number of documents already classified and included for training. That is why spot-checking and further training are always needed for predictive coding to work properly.
AI Predictive Ranking is now necessarily an iterative process where human feedback is provided throughout the process. Analytic software in the future may be far less dependent on human involvement in the iterative process, but for now it is critical. That is where the next two Steps Five and Six come in, Document Review and Hybrid Active Training.
Step-5: Document Review.
This is the step where most of the actual document review is done, where the documents are seen and classified by human reviewers. In my experience, this can take as little as one-second, assuming your software is good and fast, for an obvious document, to as long as a half hour if you have to fast-read a long document to be sure of its classification. Depending on the classification, a document is either produced, if relevant and not-privileged, or not produced if irrelevant. If relevant and privileged, then it is logged, but not produced. If relevant, not privileged, but confidential for some reason, then it is either redacted and/or specially labeled before production. The special labeling performed is typically to prominently affix the word CONFIDENTIAL on the Tiff image production, or the phrase CONFIDENTIAL – ATTORNEYS EYES ONLY. The actual wording of the legends depends upon the parties confidentiality agreement or court order. When redaction is required, the total time to review a document can go way up.
When reviewing a document not already manually classified, the reviewer is usually presented with a document that the expert searcher running the project has determined is probably relevant. Typically this means it has higher than a 50% probable relevant ranking. The reviewer may, or may not, know the ranking. Whether you disclose this depends on a number of other factors. Since I usually only used highly skilled reviewers, I trust them with disclosure. But sometimes you may not want to disclose the ranking.
During the review many documents predicted to be relevant, will not be. The reviewers will code them correctly, as they see them, but special quality controls in the form of second reviews on a random or judgmental selection process may be imposed on disagreements. They often involve close questions and are often used in the next round of training for irrelevance. That is a decision made in Step-Six. Prediction Error Corrections are made. Lawyers and paralegals find and correct the computer errors by a variety of methods. The predictive coding software learns from the corrections. Steps 4 and 5 then repeat as shown in the diagram. This iterative process is considered a virtuous feedback loop that continues until the computer predictions are accurate enough to satisfy the proportional demands of the case.
Step-6: Hybrid Active training, In this step new documents are selected for review in the next iteration of Step-Five. Moreover, in Step-Six decisions are made as to what documents to include in training in the next round of Step-Four, AI Predictive Ranking. Step-Six is much like Step-2, Multimodal Search Review, except that now new types of document ranking search are possible. Since the documents are now all probability ranked in Step-Four, you can use this ranking to select documents for the next round of document review (Step Five). For instance, the research of Cormack and Grossman, has shown that selection of the highest ranked documents can be a very effective method to continuously find and train relevant documents.. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, at pg. 9. Also see Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training – Parts One, Two, Three and Four. Another popular method, also tested and reported on by Grossman and Cormack, is to select mid-ranked documents, the ones the computer is uncertain about.
My preferred active multimodal learning process in the iterative machine learning steps is threefold:
- I sometimes, but not in all rounds, choose to have documents reviewed that are selected by the computer. These are documents that the software classifier is uncertain of the correct classification. This helps the classifier algorithms to learn by adding diversity to the documents presented for review. This in turn also helps to locate outliers of a type your initial judgmental searches in step two and five have missed.
- Sometimes I also select documents at random, either by proper computer random sampling or by informal random selection. This again helps maximize recall and premature focus on the relevant documents initially retrieved. Random samples taken in Steps Three and Seven are typically also all included for training, and, of course, are always very carefully reviewed.
- All other relevant documents that the skilled reviewer can find using found using a variety of search techniques suggested by the documents the computer selects. That may include some linear review of selected custodians or dates, parametric Boolean keyword searches, similarity searches of all kinds, concept searches, as well as several unique predictive coding probability searches. I call that a multimodal approach. Again, you need not limit these searches to ESI in the original dataset, but can also use outside documents such a subpoenas or complaint; even synthetic documents may be used as ideal exemplars.
Step-7: ZEN Quality Assurance Tests. ZEN here stands for Zero Error Numerics. This requires quality control activities in all steps, but the efforts peak in this Step-Seven. In this step a random sample is taken to try to evaluate the recall range attained in the project. The method currently favored is described in detail in Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part One, Part Two and Part Three. Also see: In Legal Search Exact Recall Can Never Be Known.
The ei_Recall test is based on a random sample of all documents to be excluded from the Final Review for possible production. The exclusion can be based on both category prediction (i.e. probable irrelevant) and/or probable ranking of document with proportionate cut-offs. The focus is on a search for any false negatives (i.e., relevant documents incorrectly predicted to be irrelevant), that are Hot or otherwise of significance. 100% recall of all relevant documents is both scientifically impossible (unless you produce all documents, 0% precision) and legally unnecessary under best practices for proportional review. But the goal is to avoid all false negatives of Hot documents. If this error is encountered, an additional iteration of Steps 4, 5 and 6 is required.
In this step you also make and test the decision to stop the training (the repetition of Steps Four, Five and Six). This decision is evaluated by the random sample, but determined by a complex variety of factors that can be case specific. Typically it is determined when the software has attaining a highly stratified distribution of documents. See License to Kull: Two-Filter Document Culling and Visualizing Data in a Predictive Coding Project – Part One, Part Two and Part Three, and Introducing a New Website, a New Legal Service, and a New Way of Life / Work; Plus a Postscript on Software Visualization and Thanks to Kroll Ontrack (see end of this blog for the Postscript).
When the stratification has stabilized you will see very few new documents found as predicted relevant that have not already been human reviewed and coded as relevant. You essentially run out of documents for Step Five review. Put another way, your Step Six no longer uncovers new relevant documents. This exhaustion marker to stop is fairly rare. Instead, in most projects the the rate of new found documents will slow, but not stop entirely.
Another key determinant of when to stop is the cost of further review. Is it worth it to continue on with more iterations of Steps Four, Five and Six. Predictive Coding and the Proportionality Doctrine: a Marriage Made in Big Data, 26 Regent U. Law Review 1 (2013-2014). Another criteria in the stop decision is whether you have found the information needed. If so, what is the purpose of continuing a search? Again, the law never requires finding all relevant, only reasonable efforts to find the relevant documents needed to decide an important fact issue in a case.
Step-8: Phased Production. This last step is where the documents are actually produced. This is done after a last quality control check of the media on which the production is made, typically CDs or DVDs. If you have to do FTP to meet a tight deadline, also produce on a tangible media to keep a permanent record of the production. Always use a WORM medium for the production, meaning write once, and read many times. That means the data you produced can not be altered. This might be helpful later for forensic purposes, along with hash, to confirm ESI authenticity, and detect and changes. The final work included here is to prepare a privilege log. The logging is typically delayed until after production. Also, production is usually done in rolling stages as review is completed in order to buy more time and good will.
My insistence on the use of multimodal judgmental sampling in steps three and seven to locate relevant documents follows the consensus view of information scientists specializing in information retrieval, but is not followed by several prominent predictive coding software vendors in e-discovery. They instead rely entirely on machine selected documents for training, or even worse, rely entirely on random selected documents to train the software. In my writings I call this processes the Borg approach. It unnecessarily minimizes the role of legal expert input, and the usefulness of other types of search to supplement an active learning process. As described here, I instead use a hybrid multimodal approach where the legal search expert remains in control of the process, and their expertise is leveraged for greater accuracy and speed of review.
For more background on predictive coding search methods and Losey’s background in the area, see the e-Discovery Team CAR page.