|
Courts are taking a
closer look at how full text searching is used in
litigation support — litigation support professionals
ought to be aware of what these cases say and where
they’re going. A high-level summary: parties who use
full text searching as part of their production
methodology ought to, at the very least, document the
terms that were used. Parties using full text searching
to identify privileged records ought to use statistical
sampling to examine the unselected records as a
reasonable measure to validate the effectiveness of the
full text searches.
First a few words on
full text searching then we’ll look at three leading
cases, O’Keefe, Equity Analytics and
Victor Stanley.
Full Text Searching
Full text searching is
one of the basic tools used to find and manage documents
within litigation support databases. However, there has
been recognition for many years that full text searching
has two significant limitations. The first is especially
critical in working with collections that are “new” in
the sense the searcher hasn’t worked with the documents
— the researcher won't know which terms are actually
used in the documents to refer to items of interest. The
second is words can have multiple meanings and even when
full text searching finds documents with the specified
terms, the documents may not really be what the searcher
wanted to find.
The article published by
David C. Blair and M. E. Maron, “An Evaluation of
Retrieval Effectiveness for a Full-Text
Document-Retrieval System,” Communications of the ACM,
March 1985, Vol. 28, No. 3, pp. 289-299, is the seminal
reference when discussing full text effectiveness. It
used the term “precision” to indicate the percentage of
the retrieved records that were relevant, and it used
the term “recall” to indicate the percentage of all of
the relevant records in the database that were located
as a result of searching. A search that finds only one
record, but that record is relevant, would be 100
percent precise — yet not identify the bulk of the
relevant records. A search that finds most of the
relevant records would, by definition, have high recall,
but may require weeding through many irrelevant results.
Precision can be
measured by examining the results themselves, but—to
measure recall, the researcher has to either examine all
of the records that weren’t included in the result set
(this really defeats the purpose of using searches to
limit what the searcher has to look at) OR examine a
random sample of the initially unselected records to
estimate the recall measure.
In the Blair and Maron
study, they had a database containing 40,000 documents
representing about 350,000 pages of text. They used full
text searching to locate records that were responsive to
51 different information requests and then measured
precision by examining the results. They then estimated
recall by using sampling techniques on the unselected
records. The results showed that although the lawyers in
the study thought they were achieving 75 percent recall
they were in fact really only finding about 20 percent
of the relevant records.
The study detailed the
reasons for the low recall, e.g. looking for the term
“accident” when the evidentiary records also used terms
such as “event,” “incident,” “situation,” “problem,”
“difficulty,” or just “what happened last week,” often
without mentioning the relevant proper names. Another
example was that “trap correction” was also referred to
as the “wire Warp,” and the “shunt correction system.”
Other problems included the use of slang and
misspellings.
Through work done by the
Sedona Conference and elsewhere, judges are aware of the
limitations of full text searching and are taking a
closer look at how it is being used, as discussed below.
Case Law
Three relatively recent
opinions have addressed the use of full-text searching
in meeting discovery obligations. In one, a producing
party’s documented use of key terms to select records to
produce was upheld against a challenge that was not
supported by expert testimony. In another, a party
seeking to employ full text search technology in a
forensic examination of the producing party’s computer
was required to support that request with at least an
affidavit from an expert. In the third, the court held
that privilege was waived when the producing party
failed to document its use of full text search
technology and failed to use random sampling on
unselected records to validate the effectiveness of the
selected search terms.
Read It Yourself:
I believe there is significant value in reading opinions
and not relying on commentators to filter what they say.
For example, some commentators might have you believe
that these cases say that full text searching is dead or
that you’ll have to produce an expert any time that you
use it — that’s not always the case. I encourage you to
read the opinions themselves, links are located at the
end of this article.
O’Keefe
The first case is
United States of America vs. O’Keefe, 537 F. Supp.
2d 14, Cr. No. 06-240 (D.D.C. Feb. 18, 2008). This was a
criminal case alleging that O’Keefe, a U.S. State
Department employee in Canada, essentially accepted
bribes for expediting visa requests from a
co-defendant’s company. The defendants had requested
exculpatory records that showed that expedited requests
were commonly granted without the requestors paying
anything of value. The government had used the following
search terms to locate ESI that was potentially
responsive to those requests:
“early or expedite* or
appointment or early & interview or expedite* &
interview.”
This is how Judge
Facciola dealt with this issue:
3. Search Terms and
Other Deficiencies
|
As noted above, defendants protest the
search terms the government used6.
Whether search terms or “keywords” will yield
the information sought is a complicated question
involving the interplay, at least, of the
sciences of computer technology, statistics and
linguistics. See George L. Paul & Jason
R. Baron,
Information
Inflation: Can the Legal System Adapt?,
13 RICH. J.L. & TECH. 10 (2007). Indeed, a
special project team of the Working Group on
Electronic Discovery of the Sedona Conference is
studying that subject and their work indicates
how difficult this question is.
See The Sedona
Conference, Best Practices Commentary on the Use
of Search and Information Retrieval,
8 THE SEDONA CONF. J. 189 (2008)... Given this
complexity, for lawyers and judges to dare opine
that a certain search term or terms would be
more likely to produce information than the
terms that were used is truly to go where angels
fear to tread. This topic is clearly beyond the
ken of a layman and requires that any such
conclusion be based on evidence that, for
example, meets the criteria of Rule 702 of the
Federal Rules of Evidence. Accordingly, if
defendants are going to contend that the search
terms used by the government were insufficient,
they will have to specifically so contend in a
motion to compel and their contention must be
based on evidence that meets the requirements of
Rule 702 of the Federal Rules of Evidence.
Footnote 6 to the Opinion: Note that the
defendants also take the government to task for
not interviewing the employees to ascertain how
often they used electronic means to create any
electronic documents regarding expedited
interviews. Reply at 6. But if the search terms
used actually captured everything there was to
capture, such interviews would be unnecessary. |
Note that the holding
denied a challenge to the use of Boolean searches
that had been used by the producing party (the
government) and that appeared at least somewhat likely
to find relevant records. However, unlike some other
cases, the government had documented its approach and
was at least able to tell which search terms were used.
Note also that the government apparently did not produce
experts to justify what it had done.
Equity Analytics
The second case is
Equity Analytics v. Lundin, Civ. No. 07-2033, 248
F.R.D. 331, 2008 U.S. Dist LEXIS 17407 (D.D.C. Mar. 7,
2008). It was a civil case brought by a company (Equity
Analytics) alleging that its ex-employee (Lundin) gained
illegal access to electronic data on Salesforce.com
after he was fired. Both parties had agreed that a
forensic examination of Lundin’s computer drives would
be required to determine what information if any he had
copied.
Lundin wanted the
examination restricted to certain file types and then
only files that contained specified key terms. He
claimed this was necessary to maintain the privacy of
files containing attorney-client communications,
business, medical, tax and banking records and images
created for his professional photography business.
Equity contended there should be no file type
restriction because files can be converted to other file
types and that keywords would not be adequate to find
files that may have been fragmented when Lundin loaded a
new operating system on his computer. This is how Judge
Facciola resolved this issue:
III. Resolution of
the Controversy
|
I recently commented that lawyers
express as facts what are actually highly
debatable propositions as to efficacy of various
methods used to search electronically stored
information. United States v. O’Keefe, No.
06-CR-249, 2008 WL 44972, at *8 (D.D.C. Feb. 18,
2008).
As I explained in that case, determining whether
a particular search methodology, such as
keywords, will or will not be effective
certainly requires knowledge beyond the ken of a
lay person (and a lay lawyer) and requires
expert testimony that meets the requirements of
Rule 702 of the Federal Rules of Evidence.
Obviously, determining the significance of the
loading of a new operating system upon file
structure and retention and why the contemplated
forensic search will yield information that will
not be yielded by a search limited by file types
or keywords are beyond any experience or
knowledge I can claim.
Accordingly, I am going to require Equity to
submit an affidavit from its examiner explaining
why the limitations proposed by plaintiff are
unlikely to capture all the information Equity
seeks and the impact, if any, of the loading of
the new operating system upon Lundin’s computer
and the data that was on it before the new
operating system was loaded. The expert shall
also describe in detail how the search will be
conducted. Armed with that information,
supplemented if necessary by a hearing at which
the expert will be cross examined, I can make
the best possible judgment as to how to balance
Equity’s need for information against Lundin’s
privacy. |
In both of these cases,
Judge Facciola required more than a lawyer’s assertion
about what was needed or effective in the technical
aspects of ESI.
Victor Stanley
The third opinion
(written by Judge Grimm) is Victor Stanley, Inc. v.
Creative Pipe, Inc., Civ. No. MJG-06-2662 (D. Md.
May 29, 2008). Defendant had produced 165 documents that
it claimed were privileged and argued that there should
be no waiver of privilege because the production of the
165 documents was inadvertent. Plaintiff of course
argued that defendant had not taken proper precautions
to guard against inadvertent production. The Court held
that even assuming that the documents were privileged in
the first instance, and assuming that the defendant had
properly particularized its claims for privilege, the
defendant had waived privilege by not taking adequate
precautions. Some of the factors that lead to this
finding included:
-
Defendant had
earlier requested a claw-back agreement but had then
abandoned those efforts, thereby losing any
protection such an agreement might have afforded;
-
Defendant’s excuse
that it did not have time to complete a more
exhaustive review was undercut by the fact that it
did not request additional time;
-
It was the
plaintiff’s counsel who discovered that defendants
had produced the potentially privileged records;
-
Defendant identified
some
privileged records, but not the 165 records that
were produced, through the use of a 70-term full
text search, but claimed that a significant number
of records were not full-text searchable; as to
those non-searchable records, defendant examined
only the title page of the documents. Judge Grimm
took defendant to task for not explaining any of the
particulars about the full-text search to locate the
privileged records:
|
First, the Defendants are regrettably
vague in their description of the seventy
keywords used for the text-searchable ESI
privilege review, how they were developed, how
the search was conducted, and what quality
controls were employed to assess their
reliability and accuracy. … there is a growing
body of literature that highlights the risks
associated with conducting an unreliable or
inadequate keyword search or relying exclusively
on such searches for privilege review.
Additionally, the Defendants do not assert that
any sampling was done of the text searchable ESI
files that were determined not to contain
privileged information on the basis of the
keyword search to see if the search results were
reliable. Common sense suggests that even a
properly designed and executed keyword search
may prove to be over-inclusive or
under-inclusive, resulting in the identification
of documents as privileged which are not, and
non-privileged which, in fact, are. The only
prudent way to test the reliability of the
keyword search is to perform some appropriate
sampling of the documents determined to be
privileged and those determined not to be in
order to arrive at a comfort level that the
categories are neither overinclusive nor
under-inclusive. There is no evidence on the
record that the Defendants did so in this case.
Rather, it appears from the information that
they provided to the court that they simply
turned over to the Plaintiff all the text
searchable ESI files that were identified by the
keyword search Turner performed as
non-privileged, as well as the non-text
searchable files that Monkman and M. Pappas’
limited title page search determined not to be
privileged. (page 11–12 of opinion) |
-
Plaintiffs contended
that the supposedly non-text searchable records
could have been converted to searchable format by
the use of readily available OCR software or the
native OCR Text Recognition Tool within Adobe
Acrobat, and also contended that most of the 165
records were in fact text-searchable.
In a privilege waiver
dispute, the burden of proof is on the producing party
to show that it has undertaken reasonable efforts to
prevent disclosure. At the very least, Victor Stanley
establishes that a party that uses full text search to
identify privileged records needs to keep a record of
the searches that were conducted and needs to validate
the effectiveness of the searches by random sampling of
unselected records. Victor Stanley also points out the
wisdom of having a court-approved claw-back agreement.
Read it Yourself:
The following links will take you to the text of the
opinions discussed above:
O’Keefe:
https://ecf.dcd.uscourts.gov/cgi-bin/show_public_doc?2006cr0249-90
Equity
Analytics:
https://ecf.dcd.uscourts.gov/cgi-bin/show_public_doc?2007cv2033-32
Victor Stanley:
http://www.mdd.uscourts.gov/Opinions/Opinions/VictorStanley052908.pdf
To read what the
commentators have to say about these decisions, Google
“Facciola O'Keefe Equity Analytics Grimm Victor
Stanley”.
Postscript on locating
the full text of opinions
With some patient Internet searching you can often find
the text of significant e-Discovery opinions even
without subscriptions to expensive online legal research
services. For example in the federal court system, PACER
(Public Access to Court Electronic Records) often
enables you to find docket entries as well as the text
of opinions themselves. There is a small fee for
accessing certain records but it is far less than the
main commercial legal research systems.
The search for the
O’Keefe decision on PACER was made a bit more
challenging because it was listed under the name of a
co-defendant, Sunil Agrawal, and it wasn’t actually
filed on the PACER system until a few days after the
date of the opinion. The point being that when you rely
on PACER, you sometimes have to be a bit flexible. It
will often be helpful to have the case number as well as
the date the opinion was signed.
|