Archive for the ‘e-Discovery’ Category

Research topics in e-discovery

Monday, August 25th, 2014

Research topics in e-discovery by William Webber.

From the post:

Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I’d capture some of these topics here:

E-discovery, if you don’t know, is found in civil litigation and government investigations. Think of it as hacking with rules as the purpose of e-discovery is to find information that supports your claims or defense. E-discovery is high stakes data mining that pays very well. Need I say more?

Webber lists the following research topics:

  1. Classification across heterogeneous document types
  2. Automatic detection of document types
  3. Faceted categorization
  4. Label propagation across related documents
  5. Identifying unclassifiable documents
  6. Identifying poor training examples
  7. Identifying significant fragments in non-significant text
  8. Routing of documents to specialized trainers
  9. Total cost of annotation

“Label propagation across related documents” looks like a natural for topic maps but searching over defined properties that identify subjects as opposed to opaque tokens would enhance the results for a number of these topics.


Tuesday, December 17th, 2013

2013 End-of Year List of People Who Make a Difference in eDiscovery by Gerard. J. Britton.

Gerald has created a list of six (6) people who made a difference in ediscovery in 2013.

If ediscovery is unfamiliar, you have all of the issues of data/big data with an additional layer of legal rules and requirements.

Typically seen in litigation with high stakes.

A fruitful area for the application of semantic integration technologies, topic maps in particular.

dtsearch Tutorial Videos

Tuesday, February 19th, 2013

Tutorials for the dtsearch engine have been posted to ediscovery TV.

In five parts:

Part 1

Part 2

Part 3

Part 4

Part 5

I skipped over the intro videos only to find:

Not being able to “select all” in Excel doesn’t increase my confidence in the presentation. (part 3)

The copying of files that are “responsive” to a search request is convenient but not all that impressive. (part 4)

User isn’t familiar with basic operations in dtsearch, such as files not copied. Does finally appear. (part 5)

Disappointing because I remember dtsearch from years ago and it was (and still is) an impressive bit of work.

Suggestion: Don’t judge dtsearch by these videos.

I started to suggest you download all the brochures/white papers you will find at:

There is a helpful “Download All: PDF Porfolio” link. Except that it doesn’t work in Chrome at least. Keeps giving me a Download Adobe Acrobat 10 download window. Even after I install Adobe Acrobat 10.

Here’s a general hint for vendors: Don’t try to help. You will get it wrong. If you want to give users access to file, great, but let viewing/use be on their watch.

So, download the brochures/white papers individually until dtsearch recovers from another self-inflicted marketing wound.

Then grab a 30-day evaluation copy of the software.

It may or may not fit your needs but you will get a fairer sense of the product than you will from the videos or parts of the dtsearch website.

Maybe that’s the key: They are great search engineers, not so hot at marketing or websites.

I first saw this at dtSearch Harnesses TV Power. Where videos are cited, but not watched.

Day Nine of a Predictive Coding Narrative: A scary search…

Wednesday, August 8th, 2012

Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CAR with the Griswold’s, and a moral dilemma by Ralph Losey.

From the post:

In this sixth installment I continue my description, this time covering day nine of the project. Here I do a quality control review of a random sample to evaluate my decision in day eight to close the search.

Ninth Day of Review (4 Hours)

I began by generating a random sample of 1,065 documents from the entire null set (95% +/- 3%) of all documents not reviewed. I was going to review this sample as a quality control test of the adequacy of my search and review project. I would personally review all of them to see if any were False Negatives, in other words, relevant documents, and if relevant, whether any were especially significant or Highly Relevant.

I was looking to see if there were any documents left on the table that should have been produced. Remember that I had already personally reviewed all of the documents that the computer had predicted were like to be relevant (51% probability). I considered the upcoming random sample review of the excluded documents to be a good way to check the accuracy of reliance on the computer’s predictions of relevance.

I know it is not the only way, and there are other quality control measures that could be followed, but this one makes the most sense to me. Readers are invited to leave comments on the adequacy of this method and other methods that could be employed instead. I have yet to see a good discussion of this issue, so maybe we can have one here.

I can appreciate Ralph’s apprehension at a hindsight review of decisions already made. In legal proceedings, decisions are made and they move forward. Some judgements/mistakes can be corrected, others are simply case history.

Days Seven and Eight of a Predictive Coding Narrative [Re-Use of Analysis?]

Wednesday, August 8th, 2012

Days Seven and Eight of a Predictive Coding Narrative: Where I have another hybrid mind-meld and discover that the computer does not know God by Ralph Losey.

From the post:

In this fifth installment I will continue my description, this time covering days seven and eight of the project. As the title indicates, progress continues and I have another hybrid mind-meld moment. I also discover that the computer does not recognize the significance of references to God in an email. This makes sense logically, but is unexpected and kind of funny when encountered in a document review.

Ralph discovered new terms to use for training as the analysis of the documents progressed.

While Ralph captures those for his use, my question would be how to capture what he learned for re-use?

As in re-use by other parties, perhaps in other litigation.

Thinking of reducing the cost of discovery by sharing analysis of data sets, rather than every discovery process starting at ground zero.

Days Five and Six of a Predictive Coding Narrative

Friday, July 27th, 2012

Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment by Ralph Losey.

From the post:

This is my fourth in a series of narrative descriptions of an academic search project of 699,082 Enron emails and attachments. It started as a predictive coding training exercise that I created for Jackson Lewis attorneys. The goal was to find evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane. The third and fourth days are described in Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.

In this fourth installment I continue to describe what I did in days five and six of the project. In this narrative I go deep into the weeds and describe the details of multimodal search. Near the end of day six I have an affirming hybrid multimodal mind-meld moment, which I try to describe. I conclude by sharing some helpful advice I received from Joseph White, one of Kroll Ontrack’s (KO) experts on predictive coding and KO’s Inview software. Before I launch into the narrative, a brief word about vendor experts. Don’t worry, it is not going to be a commercial for my favorite vendors; more like a warning based on hard experience.

You will learn a lot about predictive analytics and e-discovery from this series of posts but the most important paragraphs I have read thus far:

When talking to the experts, be sure that you understand what they say to you, and never just nod in agreement when you do not really get it. I have been learning and working with new computer software of all kinds for over thirty years, and am not at all afraid to say that I do not understand or follow something.

Often you cannot follow because the explanation is so poor. For instance, often the words I hear from vendor tech experts are too filled with company specific jargon. If what you are being told makes no sense to you, then say so. Keep asking questions until it does. Do not be afraid of looking foolish. You need to be able to explain this. Repeat back to them what you do understand in your own words until they agree that you have got it right. Do not just be a parrot. Take the time to understand. The vendor experts will respect you for the questions, and so will your clients. It is a great way to learn, especially when it is coupled with hands-on experience.

Insisting that experts explain until you understand what is being said will help you avoid costly mistakes and make you more sympathetic to a client’s questions when you are the expert.

The technology and software will change for predictive coding will change beyond recognition in a few short years.

Demanding and giving explanations that “explain” is a skill that will last a lifetime.

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree

Friday, July 27th, 2012

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree by Ralph Losey.

From the post:

This is the third in a series of detailed descriptions of a legal search project. The project was an academic training exercise for Jackson Lewis e-discovery liaisons conducted in May and June 2012. I searched a set of 699,082 Enron emails and attachments for possible evidence pertaining to involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane.

The description of day-two was short, but it was preceded by a long explanation of my review plan and search philosophy, along with a rant in favor of humanity and against over-dependence on computer intelligence. Here I will just stick to the facts of what I did in days three and four of my search using Kroll Ontrack’s (KO) Inview software.

Interesting description of where Ralph and the computer disagree on relevant/irrelevant judgement on documents.

Unless I just missed it, Ralph is only told be the software what rating a document was given, not why the software arrived at that rating. Yes?

If you knew what terms drove a particular rating, it would be interesting to “comment out” those terms in a document to see the impact on its relevance rating.

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane

Friday, July 13th, 2012

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane by Ralph Losey.

From the post:

Day One of the search project ended when I completed review of the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and then, with the help of William Webber, go deeper into math and statistical sampling than ever before. I will also give you the big picture of my review plan and search philosophy: its hybrid and multimodal. Some search experts disagree with my philosophy. They think I do not go far enough to fully embrace machine coding. They are wrong. I will explain why and rant on in defense of humanity. Only then will I conclude with the Day Two narrative.

More than you are probably going to want to know about sample sizes and their calculation but persevere until you get to the defense of humanity stuff. It is all quite good.

If I had to add a comment on the defense of humanity rant, it would be that machines have a flat view of documents and not the richly textured one of a human reader. While true that machines can rapidly compare document without tiring, they will miss an executive referring to a secretary as his “cupcake.” A reference that would jump out at a human reader. Same text, different result.

Perhaps because in one case the text is being scanned for tokens and in the other case it is being read.

Predictive Coding Patented, E-Discovery World Gets Jealous

Wednesday, June 27th, 2012

Predictive Coding Patented, E-Discovery World Gets Jealous by Christopher Danzig

From the post:

The normally tepid e-discovery world felt a little extra heat of competition yesterday. Recommind, one of the larger e-discovery vendors, announced Wednesday that it was issued a patent on predictive coding (which Gabe Acevedo, writing in these pages, named the Big Legal Technology Buzzword of 2011).

In a nutshell, predictive coding is a relatively new technology that allows large chunks of document review to be automated, a.k.a. done mostly by computers, with less need for human management.

Some of Recommind’s competitors were not happy about the news. See how they responded (grumpily), and check out what Recommind’s General Counsel had to say about what this means for everyone who uses e-discovery products….

Predictive coding has received a lot of coverage recently as a new way to save buckets of money during document review (a seriously expensive endeavor, for anyone who just returned to Earth).

I am always curious why a patent or even patent number will be cited but no link to the patent given?

In case you are curious, it is patent 7,933,859, as a hyperlink.

The abstract reads:

Systems and methods for analyzing documents are provided herein. A plurality of documents and user input are received via a computing device. The user input includes hard coding of a subset of the plurality of documents, based on an identified subject or category. Instructions stored in memory are executed by a processor to generate an initial control set, analyze the initial control set to determine at least one seed set parameter, automatically code a first portion of the plurality of documents based on the initial control set and the seed set parameter associated with the identified subject or category, analyze the first portion of the plurality of documents by applying an adaptive identification cycle, and retrieve a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle test on the first portion of the plurality of documents.

If that sounds familiar to you, you are not alone.

Predictive coding, developed over the last forty years, is an excellent feed into a topic map. As a matter of fact, it isn’t hard to imagine a topic map seeding and being augmented by a predictive coding process.

I also mention it as a caution that the IP in this area, as in many others, is beset by the ordinary being approved as innovation.

A topic map would be ideal to trace claims, prior art and to attach analysis to a patent. I saw several patents assigned to Recommind and some pending applications. When I have a moment I will post a listing with links to those documents.

I first saw this at Beyond Search.

Hands-on examples of legal search

Saturday, May 19th, 2012

Hands-on examples of legal search by Michael J. Bommarito II.

From the post:

I wanted to share with the group some of my recent work on search in the legal space. I have been developing products and service models, but I thought many of the experiences or guides could be useful to you. I would love to share some of this work to help foster a “hacker” community in which we might collaborate on projects.

The first few posts are based on Amazon’s CloudSearch service. CloudSearch, as the name suggests, is a “cloud-based” search service. Once you decide what and how you would like to search, Amazon handles procuring the underlying infrastructure, scaling to required capacity, stemming, stop-wording, building indices, etc. For those of you who do not have access to “search appliances” or labor to configure products like Solr, this offers an excellent opportunity.

Pointers to several posts by Michael that range from searching U.S. Supreme Court decisions, email archives, to statutory law.

From law to eDiscovery, something for everybody!

Monique da Silva Moore, et al. v. Publicis Group SA, et al, 11 Civ. 1279

Tuesday, May 8th, 2012

Monique da Silva Moore, et al. v. Publicis Group SA, et al, 11 Civ. 1279

The foregoing link is something of a novelty. It is a link to the opinion by US Magistrate Andrew Peck, approving the use of predictive coding (computer-assisted review) as part of e-discovery.

It is not a pointer to an article with no link to the opinion. It is not a pointer to an article on the district judge’s opinion, upholding the magistrate’s order but adding nothing of substance on the use of predictive coding. It is not a pointer to a law journal that requires “free” registration.

I think readers have a reasonable expectation that articles contain pointers to primary source materials. Otherwise, why not write for the tabloids?

Sorry, I just get enraged when resources do not point to primary sources. Not only is it poor writing, it is discourteous to readers.

Magistrate Peck’s opinion is said to be the first that approves the use of predictive coding as part of e-discovery.

In very summary form, the plaintiff (the person suing) has requested the defendant (the person being sued), produce documents, including emails, in its possession that are responsive to a discovery request. A discovery request is where the plaintiff specifies what documents it wants the defendant to produce, usually described as a member of a class of documents. For example, all documents with statements about [plaintiff’s name] employment with X, prior to N date.

In this case, there are 3 million emails to be searched and then reviewed by the defense lawyers (for claims of privilege, non-disclosure authorized by law, such as advice of counsel in some cases) prior to production for review by the plaintiff, who may then use one or more of the emails at trial.

The question is: Should the defense lawyers use a few thousand documents to train a computer to search the 3 million documents or should they use other methods, which will result in much higher costs because lawyers have to review more documents?

The law, facts and e-discovery issues weave in and out of Magistrate Peck’s decision but if you ignore the obviously legalese parts you will get the gist of what is being said. (If you have e-discovery issues, please seek professional assistance.)

I think topic maps could be very relevant in this situation because subjects permeate the discovery process, under different names and perspectives, to say nothing of sharing analysis and data with co-counsel.

I am also mindful that analysis of presentations, speeches, written documents, emails, discovery from other cases, could well develop profiles of potential witnesses in business litigation in particular. A topic map could be quite useful in mapping the terminology most likely to be used by a particular defendant.

BTW, it will be a long time coming, in part because it would reduce the fees of the defense bar, but I would say, “OK, here are the 3 million emails. We reserve the right to move to exclude any on the basis of privilege, relevancy, etc.”

That ends all the dancing around about discovery and if the plaintiff wants to slough through 3 million emails, fine. They still have to disclose what they intend to produce as exhibits at trial.

Electronic Discovery Reference Model

Tuesday, September 13th, 2011

Electronic Discovery Reference Model (EDRM)

From the webpage:

EDRM develops guidelines, sets standards and delivers resources to help e-discovery consumers and providers improve quality and reduce costs associated with e-discovery

EDRM consists of 9 projects, each designed to help reach those goals:

Data Set, Evergreen, IGRM (Information Governance Reference Model), Jobs, Metrics, Model Code of Conduct, Search, Testing, XML.

Definitely on your radar if you are working on topic maps and legal discovery.

I will be returning to the projects to treat them individually. The “Data Set” project alone may take longer than my usual post to simply summarize.

e-Discovery Zone

Monday, September 12th, 2011

e-Discovery Zone

Vendor sponsored site but looks like a fairly rich collection of links to e-discovery (law/legal) materials.

Saturday, August 27th, 2011 OpenSource eDiscovery Engine

Gartner projects that eDiscovery will be a $1.5 Billion market by 2013.

An open source project that compares to or exceeds the capabilities of other solutions would be a very interesting prospect.

Particularly if the software had an inherent capability to merge eDiscovery results from multiple sources, say multiple plaintiffs attorneys who had started on litigation separately, but now need to “merge” their discovery results.

The Information Explosion and a Great Article by Grossman and Cormack on Legal Search

Tuesday, June 14th, 2011

The Information Explosion and a Great Article by Grossman and Cormack on Legal Search

A discussion the “information explosion” and review of Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Richmond Journal of Law and Technology

See what you think but I don’t read the article as debunking exhaustive manual review (by humans) so much as introducing technology to make human reviewers more effective.

Still human review, but assisted by technology to leverage it for large document collections.

As the article notes, the jury (sorry!) is still out on what assisted search methods work the best. This is an area where human recognition of subjects and recording that recognition for others, such as the use of different names for parties to litigation, would be quite useful. I would record that recognition using topic maps, but that isn’t surprising.

Justice Department … E-Discovery Review and Advanced Text Analytics

Wednesday, June 8th, 2011

United States Justice Department Implements Relativity for E-Discovery Review and Advanced Text Analytics

From the announcement:

…Relativity Analytics powers functionality such as clustering, the automatic grouping of documents by similar concepts, as well as concept search, and the ability for end users to train the system to group documents based on concepts and issues they define.

Relativity is being deployed in EOUSA’s Litigation Technology Service Center (LTSC) to provide electronic discovery services for all U.S. Attorneys’ Offices, which include over 6,000 attorneys nationwide. EOUSA will use Relativity Analytics to empower U.S. Attorney teams to do more with limited resources by allowing them to quickly locate key documents and increase their review speeds through enormous data sets in compressed time frames.

I like the training the system to group documents idea. Not that far from interactive merging based on user criteria. Would be more useful to colleagues if portions of documents could be grouped, so they don’t have to wade through documents for the relevant bits.

There is a lot of e-discovery management software on the market but two quick points:

1) The bar for good software goes up every year, and,

2) Topic maps have unique features that could make them players in this ever expanding market.