Searching…Sorting Through the Tool Box Posted By Daniel Kaiser, Esq. on August 28, 2009
You may know what you’re looking for, but do you know how to look?
For those who are engaged in eDiscovery, two cases touching on search methodologies that have held our attention over the past year include Magistrate Judge John Facciola’s decision in U.S. v. O’Keefe[1], and Magistrate Judge Paul Grimm’s decision in Victor Stanley, Inc. v. Creative Pipe, Inc.[2].
Facciola’s harangue regarding the complex nature of ESI searches may have assured his immortality, and it is too good to resist quoting yet again:
Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics…. Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.[3]
In this vein, Facciola noted that searching was best left to the experts. On the other hand, Grimm, emphasizing cross-party collaboration, sees the creation of search protocols as potentially falling within attorney competency – so long as the attorney has performed quality assurance testing on the methodology selected, can explain the rationale for selecting the methodology, and can show proper implementation.[4]
Facciola and Grimm come by their wariness honestly. The Text Retrieval Conference (TREC) series is a research body co-sponsored by the NIST and the IARPA (Logik is a 2009 TREC participant). The TREC Legal Track “focuses on evaluation of search technology for discovery of electronically stored information in litigation and regulatory settings.”[5] The Overview of the TREC 2008 Legal Track reports that “the consensus Boolean query found 42% of the highly relevant documents, on average per topic, . . . [and 33% ] of all relevant documents.”[6] Further, negotiated Boolean keyword searches were found to be on par with the newer and more complex search methods tested.[7] In fact, keyword searches can be notably strengthened when they are performed in an iterative fashion: sampling the search results, and then adjusting the negotiated keywords to improve the results. Yet it has been observed that although various search methodologies may return a comparable percentage of recall, the actual responsive documents retrieved varies – allowing a higher rate of recall through the use of mixed search technologies on the same data set.[8]
This emerging data, along with recent judicial enthusiasm for the incorporation of concept searching[9], reinforces the idea that attorneys need develop a comfortable working knowledge of the array of electronic data search technologies. The following non-exclusive list of search methodologies and vocabulary is intended as a reference for those who are finding their way through the etymological wrangle and getting to know the eDiscovery landscape:
- Keyword Search: A search through a body of data for a stipulated word or set of words. Keywords are useful in finding documents containing a specific term.
- Boolean Search: Keyword searching with the aid of Boolean operators such as “AND”, “OR”, “NOT”, “W/#”, “( )”, “NEAR”, “TI=( )”, “BEFORE”, “AFTER”, “*”, or “!” (proximity designators, phrase designators, sequencing instructions, and word-trunk expanding instructions) to increase the searcher’s precision in included or excluded results.
- Fuzzy Logic: A search method using non-exact word matching to capture results that include variations of, or misspellings of stipulated search terms.
- Concept Search: The use of sophisticated (and often proprietary) mathematical and linguistic analysis to return results pertaining to the concept and context suggested by your search term(s). The concept upon which your search results are based may or may not be literally present in your search terms, or in your search results.
- Algebraic Search: A search using mathematical models, including Boolean proximity operators, to interpret meaning in a document and to retrieve results accordingly.
- Clustering: The grouping of documents with related content into “clusters,” within which documents are often given a statistical ranking in their relationship to a template or seed document. These documents may be found to be related through an overlap of concepts and contexts, or through an overlap of specific terms. The use of this search method may provide the searcher access to the entire cluster, or may provide the searcher with related, alternative search terms.
- Concept and Categorization Tools: Search methods based on the use of a given thesaurus to return results from documents that express the same concept contained in the search term(s), in an alternative fashion.
- Linguistic Methods: Search methodologies that classify or select text documents based on a given taxonomy, ontology, or thesaurus.
- Naive Bayes Classifier: Based on the Bayesian theorem, a predictive relevance value is assigned to particular words according to their interrelationships, recurrence, a word’s position within a document, and proximity to other search terms.
- Ontologies: An ontology is similar to a taxonomy, but the relationships between terms need not be hierarchical and are broad (including synonyms and associated ideas). Using this search methodology, a searcher entering the term “tort” could pull results from documents containing the terms “litigation” or “damages.”
- Probabilistic Latent Semantic Analysis: In brief, this method of analysis (or indexing) uses a probabilistic model to retrieve text containing polysemy (words having multiple similar meanings) and synonymy (words having the same meaning).
- Probabilistic Search Models (including Bayesian Classifiers): Probability formulas, including Bayesian methods, are used to determine the relevance of documents within a search pool – often incorporating a term’s historical relevance to the particular search performed to rank the search results.
- Social Network Analysis: An analysis and mapping of the interactions or associations amongst sets of nodes (actors, people, entities, information sources) into a complex grid representation of a network. Significance may be found in various factors such as the centrality of a node.
- Taxonomies: The hierarchical classification of terms and ideas into categories or sets, and subcategories or subsets. The use of this tool enables the searcher, for example, to retrieve results from any subcategory of their search query. A search for “tort” could pull results from documents containing the terms “negligence” or “nuisance.”
- Vector Space Retrieval: A search methodology based on the Vector Space Model. This method measures the similarity between documents, premised upon the idea that similarity may be used to indicate relevance. The model represents various documents as vectors in space, with those deemed to be more similar being positioned closer together in space.
[1] U.S. v. O’Keefe, 537 F. Supp. 2d 14 (D.D.C. 2008).
[2] Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008).
[3] U.S. v. O’Keefe, 537 F. Supp. 2d 14, 24 (D.D.C. 2008).
[4] Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) (citing The Sedona Conference Best Practices Commentary on the Use of Search & Information Retrieval Methods in E-Discovery, 8 Sedona Conf. J. 189 (2007)).
[5] http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf.
[6] http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf at 5.
[7] Jason Krause, In Search of the Perfect Search, A.B.A.J. (Apr. 2009), http://www.abajournal.com/ magazine/in_search_of_the_perfect_search/.
[8] Id.
[9] See Disability Rights Council of Greater Wash. V. Wash. Metro. Area Transit Auth., 2007 WL 1585452 (D.D.C. June 1, 2007).
Post A Comment