Case Law Restrictions on the Use of Full Text Searching

By Joe Howie

Home
Publications

Courts are taking a closer look at how full text searching is used in litigation support — litigation support professionals ought to be aware of what these cases say and where they’re going. A high-level summary: parties who use full text searching as part of their production methodology ought to, at the very least, document the terms that were used. Parties using full text searching to identify privileged records ought to use statistical sampling to examine the unselected records as a reasonable measure to validate the effectiveness of the full text searches.

First a few words on full text searching then we’ll look at three leading cases, O’Keefe, Equity Analytics and Victor Stanley.

Full Text Searching

Full text searching is one of the basic tools used to find and manage documents within litigation support databases. However, there has been recognition for many years that full text searching has two significant limitations. The first is especially critical in working with collections that are “new” in the sense the searcher hasn’t worked with the documents — the researcher won't know which terms are actually used in the documents to refer to items of interest. The second is words can have multiple meanings and even when full text searching finds documents with the specified terms, the documents may not really be what the searcher wanted to find.

The article published by David C. Blair and M. E. Maron, “An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System,” Communications of the ACM, March 1985, Vol. 28, No. 3, pp. 289-299, is the seminal reference when discussing full text effectiveness. It used the term “precision” to indicate the percentage of the retrieved records that were relevant, and it used the term “recall” to indicate the percentage of all of the relevant records in the database that were located as a result of searching. A search that finds only one record, but that record is relevant, would be 100 percent precise — yet not identify the bulk of the relevant records. A search that finds most of the relevant records would, by definition, have high recall, but may require weeding through many irrelevant results.

Precision can be measured by examining the results themselves, but—to measure recall, the researcher has to either examine all of the records that weren’t included in the result set (this really defeats the purpose of using searches to limit what the searcher has to look at) OR examine a random sample of the initially unselected records to estimate the recall measure.

In the Blair and Maron study, they had a database containing 40,000 documents representing about 350,000 pages of text. They used full text searching to locate records that were responsive to 51 different information requests and then measured precision by examining the results. They then estimated recall by using sampling techniques on the unselected records. The results showed that although the lawyers in the study thought they were achieving 75 percent recall they were in fact really only finding about 20 percent of the relevant records.

The study detailed the reasons for the low recall, e.g. looking for the term “accident” when the evidentiary records also used terms such as “event,” “incident,” “situation,” “problem,” “difficulty,” or just “what happened last week,” often without mentioning the relevant proper names. Another example was that “trap correction” was also referred to as the “wire Warp,” and the “shunt correction system.” Other problems included the use of slang and misspellings.

Through work done by the Sedona Conference and elsewhere, judges are aware of the limitations of full text searching and are taking a closer look at how it is being used, as discussed below.

Case Law

Three relatively recent opinions have addressed the use of full-text searching in meeting discovery obligations. In one, a producing party’s documented use of key terms to select records to produce was upheld against a challenge that was not supported by expert testimony. In another, a party seeking to employ full text search technology in a forensic examination of the producing party’s computer was required to support that request with at least an affidavit from an expert. In the third, the court held that privilege was waived when the producing party failed to document its use of full text search technology and failed to use random sampling on unselected records to validate the effectiveness of the selected search terms.

Read It Yourself: I believe there is significant value in reading opinions and not relying on commentators to filter what they say. For example, some commentators might have you believe that these cases say that full text searching is dead or that you’ll have to produce an expert any time that you use it — that’s not always the case. I encourage you to read the opinions themselves, links are located at the end of this article.

O’Keefe

The first case is United States of America vs. O’Keefe, 537 F. Supp. 2d 14, Cr. No. 06-240 (D.D.C. Feb. 18, 2008). This was a criminal case alleging that O’Keefe, a U.S. State Department employee in Canada, essentially accepted bribes for expediting visa requests from a co-defendant’s company. The defendants had requested exculpatory records that showed that expedited requests were commonly granted without the requestors paying anything of value. The government had used the following search terms to locate ESI that was potentially responsive to those requests:

“early or expedite* or appointment or early & interview or expedite* & interview.”

This is how Judge Facciola dealt with this issue:

3. Search Terms and Other Deficiencies

As noted above, defendants protest the search terms the government used6. Whether search terms or “keywords” will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics. See George L. Paul & Jason R. Baron, Information Inflation: Can the Legal System Adapt?, 13 RICH. J.L. & TECH. 10 (2007). Indeed, a special project team of the Working Group on Electronic Discovery of the Sedona Conference is studying that subject and their work indicates how difficult this question is. See The Sedona Conference, Best Practices Commentary on the Use of Search and Information Retrieval, 8 THE SEDONA CONF. J. 189 (2008)... Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread. This topic is clearly beyond the ken of a layman and requires that any such conclusion be based on evidence that, for example, meets the criteria of Rule 702 of the Federal Rules of Evidence. Accordingly, if defendants are going to contend that the search terms used by the government were insufficient, they will have to specifically so contend in a motion to compel and their contention must be based on evidence that meets the requirements of Rule 702 of the Federal Rules of Evidence. 

Footnote 6 to the Opinion: Note that the defendants also take the government to task for not interviewing the employees to ascertain how often they used electronic means to create any electronic documents regarding expedited interviews. Reply at 6. But if the search terms used actually captured everything there was to capture, such interviews would be unnecessary.

Note that the holding denied a challenge to the use of Boolean searches that had been used by the producing party (the government) and that appeared at least somewhat likely to find relevant records. However, unlike some other cases, the government had documented its approach and was at least able to tell which search terms were used. Note also that the government apparently did not produce experts to justify what it had done.

Equity Analytics

The second case is Equity Analytics v. Lundin, Civ. No. 07-2033, 248 F.R.D. 331, 2008 U.S. Dist LEXIS 17407 (D.D.C. Mar. 7, 2008). It was a civil case brought by a company (Equity Analytics) alleging that its ex-employee (Lundin) gained illegal access to electronic data on Salesforce.com after he was fired. Both parties had agreed that a forensic examination of Lundin’s computer drives would be required to determine what information if any he had copied.

Lundin wanted the examination restricted to certain file types and then only files that contained specified key terms. He claimed this was necessary to maintain the privacy of files containing attorney-client communications, business, medical, tax and banking records and images created for his professional photography business. Equity contended there should be no file type restriction because files can be converted to other file types and that keywords would not be adequate to find files that may have been fragmented when Lundin loaded a new operating system on his computer. This is how Judge Facciola resolved this issue:

III. Resolution of the Controversy

I recently commented that lawyers express as facts what are actually highly debatable propositions as to efficacy of various methods used to search electronically stored information. United States v. O’Keefe, No. 06-CR-249, 2008 WL 44972, at *8 (D.D.C. Feb. 18, 2008).

As I explained in that case, determining whether a particular search methodology, such as keywords, will or will not be effective certainly requires knowledge beyond the ken of a lay person (and a lay lawyer) and requires expert testimony that meets the requirements of Rule 702 of the Federal Rules of Evidence. Obviously, determining the significance of the loading of a new operating system upon file structure and retention and why the contemplated forensic search will yield information that will not be yielded by a search limited by file types or keywords are beyond any experience or knowledge I can claim.

Accordingly, I am going to require Equity to submit an affidavit from its examiner explaining why the limitations proposed by plaintiff are unlikely to capture all the information Equity seeks and the impact, if any, of the loading of the new operating system upon Lundin’s computer and the data that was on it before the new operating system was loaded. The expert shall also describe in detail how the search will be conducted. Armed with that information, supplemented if necessary by a hearing at which the expert will be cross examined, I can make the best possible judgment as to how to balance Equity’s need for information against Lundin’s privacy.

In both of these cases, Judge Facciola required more than a lawyer’s assertion about what was needed or effective in the technical aspects of ESI.

Victor Stanley

The third opinion (written by Judge Grimm) is Victor Stanley, Inc. v. Creative Pipe, Inc., Civ. No. MJG-06-2662 (D. Md. May 29, 2008). Defendant had produced 165 documents that it claimed were privileged and argued that there should be no waiver of privilege because the production of the 165 documents was inadvertent. Plaintiff of course argued that defendant had not taken proper precautions to guard against inadvertent production. The Court held that even assuming that the documents were privileged in the first instance, and assuming that the defendant had properly particularized its claims for privilege, the defendant had waived privilege by not taking adequate precautions. Some of the factors that lead to this finding included:

  • Defendant had earlier requested a claw-back agreement but had then abandoned those efforts, thereby losing any protection such an agreement might have afforded;
  • Defendant’s excuse that it did not have time to complete a more exhaustive review was undercut by the fact that it did not request additional time;
  • It was the plaintiff’s counsel who discovered that defendants had produced the potentially privileged records;
  • Defendant identified some privileged records, but not the 165 records that were produced, through the use of a 70-term full text search, but claimed that a significant number of records were not full-text searchable; as to those non-searchable records, defendant examined only the title page of the documents. Judge Grimm took defendant to task for not explaining any of the particulars about the full-text search to locate the privileged records:

First, the Defendants are regrettably vague in their description of the seventy keywords used for the text-searchable ESI privilege review, how they were developed, how the search was conducted, and what quality controls were employed to assess their reliability and accuracy. … there is a growing body of literature that highlights the risks associated with conducting an unreliable or inadequate keyword search or relying exclusively on such searches for privilege review. Additionally, the Defendants do not assert that any sampling was done of the text searchable ESI files that were determined not to contain privileged information on the basis of the keyword search to see if the search results were reliable. Common sense suggests that even a properly designed and executed keyword search may prove to be over-inclusive or under-inclusive, resulting in the identification of documents as privileged which are not, and non-privileged which, in fact, are. The only prudent way to test the reliability of the keyword search is to perform some appropriate sampling of the documents determined to be privileged and those determined not to be in order to arrive at a comfort level that the categories are neither overinclusive nor under-inclusive. There is no evidence on the record that the Defendants did so in this case. Rather, it appears from the information that they provided to the court that they simply turned over to the Plaintiff all the text searchable ESI files that were identified by the keyword search Turner performed as non-privileged, as well as the non-text searchable files that Monkman and M. Pappas’ limited title page search determined not to be privileged. (page 11–12 of opinion)

  • Plaintiffs contended that the supposedly non-text searchable records could have been converted to searchable format by the use of readily available OCR software or the native OCR Text Recognition Tool within Adobe Acrobat, and also contended that most of the 165 records were in fact text-searchable.

In a privilege waiver dispute, the burden of proof is on the producing party to show that it has undertaken reasonable efforts to prevent disclosure. At the very least, Victor Stanley establishes that a party that uses full text search to identify privileged records needs to keep a record of the searches that were conducted and needs to validate the effectiveness of the searches by random sampling of unselected records. Victor Stanley also points out the wisdom of having a court-approved claw-back agreement.

Read it Yourself: The following links will take you to the text of the opinions discussed above:

O’Keefe: https://ecf.dcd.uscourts.gov/cgi-bin/show_public_doc?2006cr0249-90
Equity Analytics: https://ecf.dcd.uscourts.gov/cgi-bin/show_public_doc?2007cv2033-32
Victor Stanley: http://www.mdd.uscourts.gov/Opinions/Opinions/VictorStanley052908.pdf

To read what the commentators have to say about these decisions, Google “Facciola O'Keefe Equity Analytics Grimm Victor Stanley”.

Postscript on locating the full text of opinions
With some patient Internet searching you can often find the text of significant e-Discovery opinions even without subscriptions to expensive online legal research services. For example in the federal court system, PACER (Public Access to Court Electronic Records) often enables you to find docket entries as well as the text of opinions themselves. There is a small fee for accessing certain records but it is far less than the main commercial legal research systems.

The search for the O’Keefe decision on PACER was made a bit more challenging because it was listed under the name of a co-defendant, Sunil Agrawal, and it wasn’t actually filed on the PACER system until a few days after the date of the opinion. The point being that when you rely on PACER, you sometimes have to be a bit flexible. It will often be helpful to have the case number as well as the date the opinion was signed.

 

 

 

   

This article appeared originally in the March 2009 ALSP Update, the monthly publication of the Association of Litigation Support Professionals and is reprinted with permission. Read more about this nonprofit membership organization at www.alsponline.org.

www.HowieConsulting.com

When you have to get it write
For more information, email Joe Howie, Joe@HowieConsulting.com