Archive for the ‘Legal Informatics’ Category

If You Believe in Parliaments

Wednesday, July 19th, 2017

If you believe in parliaments, other than as examples of how governments don’t “get it,” then the The Law Library of Congress, Global Legal Research Center has a treat for you!

Fifty (50) countries and seventy websites surveyed in: Features of (70)Parliamentary Websites in Selected Jurisdictions.

From the summary:

In recent years, parliaments around the world have enhanced their websites in order to improve access to legislative information and other parliamentary resources. Innovative features allow constituents and researchers to locate and utilize detailed information on laws and lawmaking in various ways. These include tracking tools and alerts, apps, the use of open data technology, and different search functions. In order to demonstrate some of the developments in this area, staff from the Global Legal Research Directorate of the Law Library of Congress surveyed the official parliamentary websites of fifty countries from all regions of the world, plus the website of the European Parliament. In some cases, information on more than one website is provided where separate sites have been established for different chambers of the national parliament, bringing the total number of individual websites surveyed to seventy.

While the information on the parliamentary websites is primarily in the national language of the particular country, around forty of the individual websites surveyed were found to provide at least limited information in one or more other languages. The European Parliament website can be translated into any of the twenty-four official languages of the members of the European Union.

All of the parliamentary websites included in the survey have at least basic browse tools that allow users to view legislation in a list format, and that may allow for viewing in, for example, date or title order. All of the substantive websites also enable searching, often providing a general search box for the whole site at the top of each page as well as more advanced search options for different types of documents. Some sites provide various facets that can be used to further narrow searches.

Around thirty-nine of the individual websites surveyed provide users with some form of tracking or alert function to receive updates on certain documents (including proposed legislation), parliamentary news, committee activities, or other aspects of the website. This includes the ability to subscribe to different RSS feeds and/or email alerts.

The ability to watch live or recorded proceedings of different parliaments, including debates within the relevant chamber as well as committee hearings, is a common feature of the parliamentary websites surveyed. Fifty-eight of the websites surveyed featured some form of video, including links to dedicated YouTube channels, specific pages where users can browse and search for embedded videos, and separate video services or portals that are linked to or viewable from the main site. Some countries also make videos available on dedicated mobile-friendly sites or apps, including Denmark, Germany, Ireland, the Netherlands, and New Zealand.

In total, apps containing parliamentary information are provided in just fourteen of the countries surveyed. In comparison, the parliamentary websites of thirty countries are available in mobile-friendly formats, enabling easy access to information and different functionalities using smartphones and tablets.

The table also provides information on some of the additional special features available on the surveyed websites. Examples include dedicated sites or pages that provide educational information about the parliament for children (Argentina, El Salvador, Germany, Israel, Netherlands, Spain, Taiwan, Turkey); calendar functions, including those that allow users to save information to their personal calendars or otherwise view information about different types of proceedings or events (available on at least twenty websites); and open data portals or other features that allow information to be downloaded in bulk for reuse or analysis, including through the use of APIs (application programming interfaces) (at least six countries).

With differing legal vocabularies and local personification of multi-nationals, this is a starting point for transparency based upon topic maps.

I first saw this in a tweet by the Global Investigative Journalism Network (GIJN).

ODI – Access To Legal Data News

Friday, January 13th, 2017

Strengthening our legal data infrastructure by Amanda Smith.

Amanda recounts an effort between the Open Data Institute (ODI) and Thomas Reuters to improve access to legal data.

From the post:

Paving the way for a more open legal sector: discovery workshop

In September 2016, Thomson Reuters and the ODI gathered publishers of legal data, policy makers, law firms, researchers, startups and others working in the sector for a discovery workshop. Its aims were to explore important data types that exist within the sector, and map where they sit on the data spectrum, discuss how they flow between users and explore the opportunities that taking a more open approach could bring.

The notes from the workshop explore current mechanisms for collecting, managing and publishing data, benefits of wider access and barriers to use. There are certain questions that remain unanswered – for example, who owns the copyright for data collected in court. The notes are open for comments, and we invite the community to share their thoughts on these questions, the data types discussed, how to make them more open and what we might have missed.

Strengthening data infrastructure in the legal sector: next steps

Following this workshop we are working in partnership with Thomson Reuters to explore data infrastructure – datasets, technologies and processes and organisations that maintain them – in the legal sector, to inform a paper to be published later in the year. The paper will focus on case law, legislation and existing open data that could be better used by the sector.

The Ministry of Justice have also started their own data discovery project, which the ODI have been contributing to. You can keep up to date on their progress by following the MOJ Digital and Technology blog and we recommend reading their data principles.

Get involved

We are looking to the legal and data communities to contribute opinion pieces and case studies to the paper on data infrastructure for the legal sector. If you would like to get involved, contact us.
…(emphasis in original)

Encouraging news, especially for those interested in building value-added tools on top of data that is made available publicly. At least they can avoid the cost of collecting data already collected by others.

Take the opportunity to comment on the notes and participate as you are able.

If you think you have seen use cases for topic maps before, consider that the Code of Federal Regulations (US), as of December 12, 2016, has 54938 separate but not unique, definitions of “person.” The impact of each regulation depending upon its definition of that term.

Other terms have similar semantic difficulties both in the Code of Federal Regulations as well as the US Code.

U.K. Parliament – U.S. Congress : Legislative Process Glossaries

Monday, August 22nd, 2016

I encountered the glossary for legislative activity for the U.S. Congress and remembered a post where I mentioned a similar resource for the U.K.

Rather than having to dig for both of them in the future:

U.K. Parliment – Glossary

U.S. Congress – Glossary

To be truly useful, applications displaying information from either source should automatically tag these terms for quick reference by readers.


Law Library Blogs

Friday, February 19th, 2016

Law Library Blogs by Aaron Kirschenfeld.

A useful collection of fifty-four (54) institutional law library blogs on Feedly.

Law library blogs being one of the online resources you should be following if you are interested in legal informatics.

Bluebook® vs. Baby Blue’s (Or, Bleak House “Lite”)

Friday, February 19th, 2016

The suspense over what objections The Bluebook® A Uniform System of Citation® could have to the publication of Baby Blue’s Manual of Legal Citation, ended with a whimper and not a bang on the publication of Baby Blue’s.

You may recall I have written in favor of Baby Blue’s, sight unseen, Bloggers! Help Defend The Public Domain – Prepare To Host/Repost “Baby Blue”, and, Oxford Legal Citations Free, What About BlueBook?.

Of course, then Baby Blue’s Manual of Legal Citation was published.

I firmly remain of the opinion that legal citations are in the public domain. Moreover, the use of legal citations is the goal of any citation originator so assertion of copyright on the same would be self-defeating, if not insane.

Having said that, Baby Blue’s Manual of Legal Citation is more of a Bleak House “Lite” than a useful re-imagining of legal citation in a modern context.

I don’t expect you to take my word for that judgment so I have prepared mappings from Bluebook® to Baby Blue’s and Baby Blue’s to Bluebook®.

Caveat 1: Baby Blue’s is still subject to revision and may tinker with its table numbering to further demonstrate its “originality” for example, so consider these mappings as provisional and subject to change.

Caveat 2: The mappings are pointers to equivalent subject matter and not strictly equivalent content.

How closely the content of these two publications track each other is best resolved by automated comparison of the two.

As general assistance, pages 68-191 (out of 198) of Baby Blue’s are in substantial accordance with pages 233-305 and 491-523 of the Bluebook®. Foreign citations, covered by pages 307-490 in the Bluebook®, merit a scant two pages, 192-193, in Baby Blue’s.

The substantive content of Baby Blue’s doesn’t begin until page 10 and continues to page 67, with tables beginning on page 68. In terms of non-table content, there is only 57 pages of material for comparison to the Bluebook®. As you can see from the mappings, the ordering of rules has been altered from the Bluebook®, no doubt as a showing of “originality.”

The public does need greater access to primary legal resources but treating the ability to cite Tucker and Celphane (District of Columbia, 1892-1893) [Baby Blue’s page 89] on a par with Federal Reporter [Baby Blue’s page 67], is not a step in that direction.

PS: To explore the issues and possibilities at hand, you will need a copy of the The Bluebook® A Uniform System of Citation®.

Some starter questions:

  1. What assumptions underlie the rules reported in the Bluebook®?
  2. How would you measure the impact of changing the rules it reports?
  3. What technologies drove the its form and organization?
  4. What modern technologies could alter its form and organization?
  5. How can modern technologies display content differently that used its citations?

A more specific question could be: Do we need 123 pages of abbreviations (Babyblue), 113 pages of abbreviations (Bluebook®) when software has the capability to display expanded abbreviations to any user? Even if written originally as an abbreviation.

Abbreviations being both a means of restricting access/understanding and partially a limitation of the printed page into which we sought to squeeze as much information as possible.

Should anyone raise the issue of “governance,” with you in regard to the Bluebook®, they are asking for a seat at the citation rule table for themselves, not you. My preference is to turn the table over in favor of modern mechanisms for citations that result in access, not promises of access if you learn a secret code.

PS: I use Bleak House as a pejorative above but it is one of my favorite novels. Bear in mind that I also enjoy reading the Bluebook and the Chicago Manual of Style. 😉

Baby Blue’s Manual of Legal Citation [Public Review Ends 15 March 2016]

Tuesday, February 9th, 2016

The Baby Blue’s Manual of Legal Citation, is available for your review and comments:

The manuscript currently resides at The manuscript is created from an HTML source file. Transformations of this source file are available in PDF and Word formats. You may submit point edits by editing the html source (from which we will create a diff) or using Word with Baby Blue’s Manual of Legal Citation track changes enabled. You may also provide comments on the PDF or Word documents, or as free-form text. Comments may be submitted before March 15, 2016 to:

Carl Malamud
Public.Resource.Org, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472 USA

Comment early and often!

More to follow.

Harvard Law Library Readies Trove of Decisions for Digital Age

Thursday, October 29th, 2015

Harvard Law Library Readies Trove of Decisions for Digital Age by Erik Eckholm.

From the post:

Shelves of law books are an august symbol of legal practice, and no place, save the Library of Congress, can match the collection at Harvard’s Law School Library. Its trove includes nearly every state, federal, territorial and tribal judicial decision since colonial times — a priceless potential resource for everyone from legal scholars to defense lawyers trying to challenge a criminal conviction.

Now, in a digital-age sacrifice intended to serve grand intentions, the Harvard librarians are slicing off the spines of all but the rarest volumes and feeding some 40 million pages through a high-speed scanner. They are taking this once unthinkable step to create a complete, searchable database of American case law that will be offered free on the Internet, allowing instant retrieval of vital records that usually must be paid for.

“Improving access to justice is a priority,” said Martha Minow, dean of Harvard Law School, explaining why Harvard has embarked on the project. “We feel an obligation and an opportunity here to open up our resources to the public.”

While Harvard’s “Free the Law” project cannot put the lone defense lawyer or citizen on an equal footing with a deep-pocketed law firm, legal experts say, it can at least guarantee a floor of essential information. The project will also offer some sophisticated techniques for visualizing relations among cases and searching for themes.

Complete state results will become publicly available this fall for California and New York, and the entire library will be online in 2017, said Daniel Lewis, chief executive and co-founder of Ravel Law, a commercial start-up in California that has teamed up with Harvard Law for the project. The cases will be available at Ravel is paying millions of dollars to support the scanning. The cases will be accessible in a searchable format and, along with the texts, they will be presented with visual maps developed by the company, which graphically show the evolution through cases of a judicial concept and how each key decision is cited in others.

A very challenging dataset for capturing and mapping semantics!

If you think current legal language is confusing, strap on a couple of centuries of decisions plus legislation as the meaning of words and concepts morph.

Some people will search it as flatly as they do Google Ngrams and that will be reflected in the quality of their results.

Yet another dataset where sharing search trails with commentary would enrich the data with every visit. Less experienced searchers could follow the trails of more accomplished searchers.

Whether capturing and annotating search trails and other non-WestLaw/LexisNexis features will make it into user facing interfaces remains to be seen.

There is some truth to the Westlaw claim that “Core primary law is only the beginning…” but the more court data becomes available, the greater the chance for innovative tools.

Computational Legal Studies Blog

Friday, October 9th, 2015

Computational Legal Studies Blog by Daniel Katz, Mike Bommarito & Jon Zelner.

From the about page:

The Computational Legal Studies Blog was founded on March 17, 2009. The CLS Blog is an attempt to disseminate legal or law related studies that employ a computational or complex systems component. We hope this venue will serve as a coordinating device for those interested in using such techniques to consider the development of legal systems and/or implementation of more reasoned public policy.

It isn’t important that you believe in “…reasoned public policy” but that you realize a number of people do.

This site collects information and analysis that may be persuasive to “…reasoned public policy” types.

There are a large number of resources and if even a quarter of them are as good as this site, the time spent mining them will be well worth it.

Ping me if you see something extraordinary.


Attention Law Students: You Can Change the Way People Interact with the Law…

Friday, September 25th, 2015

Attention Law Students: You Can Change the Way People Interact with the Law…Even Without a J.D. by Katherine Anton.

From the post:

A lot of people go to law school hoping to change the world and make their mark on the legal field. What if we told you that you could accomplish that, even as a 1L?

Today we’re launching the WeCite contest: an opportunity for law students to become major trailblazers in the legal field. WeCite is a community effort to explain the relationship between judicial cases, and will be a driving force behind making the law free and understandable.

To get involved, all you have to do is go to and choose the treatment that best describes a newer case’s relationship with an older case. Law student contributors, as well as the top contributing schools, will be recognized and rewarded for their contributions to WeCite.

Read on to learn why WeCite will quickly become your new favorite pastime and how to get started!

Shepard’s Citations began publication in 1873 and by modern times, had such an insurmountable lead, that the cost of creating a competing service were a barrier to anyone else entering the field.

To be useful to lawyers, a citation index can’t index some of the citations but all of the citations.

The WeCite project, based on crowd-sourcing, is poised to demonstrate creation of a public law citation index is doable.

While the present project is focused on law students, I am hopeful that the project opens up for contributions from more senior survivors of law school, practicing or not.

U.S. Congressional Documents and Debates (1774-1875)

Tuesday, December 23rd, 2014

U.S. Congressional Documents and Debates (1774-1875) by Barbara Davis and Robert Brammer (law library specialists at the Library of Congress).

A video introduction to the website A Century of Lawmaking For a New Nation.

I know you are probably wondering why I would post on this resource considering that I just posted on finding popular topics for topic maps! 😉

Popularity, beyond social media popularity, is in the eye of the beholder. This sort of material would appeal to anyone who debates the “intent” of the original framers of the constitution, the American Enterprise Institute for example.

Justice Justice Scalia would be another likely consumer of a topic map based on these materials. He advocates what Wikipedia calls “…textualism in statutory interpretation and originalism in constitutional interpretation.”

Put anyone seeking to persuade Justice Scalia of their cause, is another likely consumer for such a topic map. Or prospective law clerks for that matter. Tying this material to Scalia’s opinions and other writings would increase the value of such a map.

The topic mapping theory part would be fun but imaging Scalia solving the problem of other minds and discerning their intent over two hundred (200) years later would require more imagination than I can muster on most days.

US Congress OKs ‘unprecedented’ codification of warrantless surveillance

Tuesday, December 16th, 2014

US Congress OKs ‘unprecedented’ codification of warrantless surveillance by Lisa Vaas.

From the post:

Congress last week quietly passed a bill to reauthorize funding for intelligence agencies, over objections that it gives the government “virtually unlimited access to the communications of every American”, without warrant, and allows for indefinite storage of some intercepted material, including anything that’s “enciphered”.

That’s how it was summed up by Rep. Justin Amash, a Republican from Michigan, who pitched and lost a last-minute battle to kill the bill.

The bill is titled the Intelligence Authorization Act for Fiscal Year 2015.

Amash said that the bill was “rushed to the floor” of the house for a vote, following the Senate having passed a version with a new section – Section 309 – that the House had never considered.

Lisa reports that the bill codifies Executive Order 12333, a Ronald Reagan remnant from an earlier attempt to dismantle the United States Constitution.

There is a petition underway to ask President Obama to veto the bill. Are you a large bank? Skip the petition and give the President a call.

From Lisa’s report, it sounds like Congress needs a DEW Line for legislation:

Rep. Zoe Lofgren, a California Democrat who voted against the bill, told the National Journal that the Senate’s unanimous passage of the bill was sneaky and ensured that the House would rubberstamp it without looking too closely:

If this hadn’t been snuck in, I doubt it would have passed. A lot of members were not even aware that this new provision had been inserted last-minute. Had we been given an additional day, we may have stopped it.

How do you “sneak in” legislation in a public body?

Suggestions on an early warning system for changes to legislation between the two houses of Congress?

Caselaw is Set Free, What Next? [Expanding navigation/search targets]

Thursday, November 6th, 2014

Caselaw is Set Free, What Next? by Thomas Bruce, Director, Legal Information Institute, Cornell.

Thomas provides a great history of Google Scholar’s caselaw efforts and its impact on the legal profession.

More importantly, at least to me, were his observations on how to go beyond the traditional indexing and linking in legal publications:

A trivial example may help. Right now, a full-text search for “tylenol” in the US Code of Federal Regulations will find… nothing. Mind you, Tylenol is regulated, but it’s regulated as “acetaminophen”. But if we link up the data here in Cornell’s CFR collection with data in the DrugBank pharmaceutical collection , we can automatically determine that the user needs to know about acetaminophen — and we can do that with any name-brand drug in which acetaminophen is a component. By classifying regulations using the same system
that science librarians use to organize papers in agriculture
, we can determine which scientific papers may form the rationale for particular regulations, and link the regulations to the papers that explain the underlying science. These techniques, informed by emerging approaches in natural-language processing and the Semantic Web, hold great promise.

All successful information-seeking processes permit the searcher to exchange something she already knows for something she wants to know. By using technology to vastly expand the number of things that can meaningfully and precisely be submitted for search, we can dramatically improve results for a wide swath of users. In our shop, we refer to this as the process of “getting from barking dog to nuisance”, an in-joke that centers around mapping a problem expressed in real-world terms to a legal concept. Making those mappings on a wide scale is a great challenge. If we had those mappings, we could answer a lot of everyday questions for a lot of people.

(emphasis added)

The first line I bolded in the quote:

All successful information-seeking processes permit the searcher to exchange something she already knows for something she wants to know.

captures the essence of a topic map. Yes? That is a user navigates or queries a topic map on the basis of terms they already know. In so doing, they can find other terms that are interchangeable with theirs, but more importantly, if information is indexed using a different term than theirs, they can still find the information.

In traditional indexing systems, think of the Readers Guide to Periodical Literature, Library of Congress Subject Headings, some users learned those systems in order to become better searchers. Still an interchange of what you know for what you don’t know, but with a large front-end investment.

Thomas is positing a system like topic maps that enables a users to navigate by the terms they know already to find information they don’t know.

The second block of text I bolded:

Making those mappings on a wide scale is a great challenge. If we had those mappings, we could answer a lot of everyday questions for a lot of people.

Making wide scale mappings certainly is a challenge. In part because there are so many mappings to be made and so many different ways to make them. Not to mention that the mappings will evolve over time as usages change.

There is growing realization that indexing or linking data results in a very large pile of indexed or linked data. You can’t really navigate it unless or until you hit upon the correct terms to make the next link. We could try to teach everyone the correct terms but as more correct terms appear everyday, that seems an unlikely solution. Thomas has the right of it when he suggests expanding the target of “correct” terms.

Topic maps are poised to help expand the target of “correct” terms, and to do so in such a way as to combine with other expanded targets of “correct” terms.

I first saw this in a tweet by Aaron Kirschenfeld.

Update: Tarlton Law Libary (University of Texas at Austin) Legal Research Guide has a great page of tips and pointers on the Google Scholar caselaw collection. Bookmark this guide.

Introduction to Basic Legal Citation (online ed. 2014)

Sunday, November 2nd, 2014

Introduction to Basic Legal Citation (online ed. 2014) by Peter W. Martin.

From the post:

This work first appeared in 1993. It was most recently revised in the fall of 2014 following a thorough review of the actual citation practices of judges and lawyers, the relevant rules of appellate practice of federal and state courts, and the latest edition of the ALWD Guide to Legal Citation, released earlier in the year. As has been true of all editions released since 2010, it is indexed to both the ALWD guide and the nineteenth edition of The Bluebook. However, it also documents the many respects in which contemporary legal writing, very often following guidelines set out in court rules, diverges from the citation formats specified by those academic texts.

The content of this guide is also available in three different e-book formats: 1) a pdf version that can be printed out in whole or part and also used with hyperlink navigation on an iPad or other tablet, indeed, on any computer; 2) a version designed specifically for use on the full range of Kindles as well as other readers or apps using the Mobi format; and 3) a version in ePub format for the Nook and other readers or apps that work with it. To access any of them, click here. (Over 50,000 copies of the 2013 edition were downloaded.)

Since the guide is online, its further revision is not tied to a rigid publication cycle. Any user seeing a need for clarification, correction, or other improvement is encouraged to “speak up.” What doesn’t work, isn’t clear, is missing, appears to be in error? Has a change occurred in one of the fifty states that should be reported? Comments of these and other kinds can sent by email addressed to (Please include “Citation” in the subject line.) Many of the features and some of the coverage of this reference are the direct result of past user questions and advice.

A complementary series of video tutorials offers a quick start introduction to citation of the major categories of legal sources. They may also be useful for review. Currently, the following are available:

  1. Citing Judicial Opinions … in Brief (8.5 minutes)
  2. Citing Constitutional and Statutory Provisions … in Brief (14 minutes)
  3. Citing Agency Material … in Brief (12 minutes)

Finally, for those with an interest in current issues of citation practice, policy, and instruction, there is a companion blog, “Citing Legally,” at:

Obviously legal citations are identifiers but Peter helpfully expands on the uses of legal citations:

A reference properly written in “legal citation” strives to do at least three things, within limited space:

  • identify the document and document part to which the writer is referring
  • provide the reader with sufficient information to find the document or document part in the sources the reader has available (which may or may not be the same sources as those used by the writer), and
  • furnish important additional information about the referenced material and its connection to the writer’s argument to assist readers in deciding whether or not to pursue the reference.

I would quibble with Peter’s description of a legal citation “identif[ing] a document or document part,” in part because of his second point, that a reader can find an alternative source for the document.

To me it is easier to say that legal citation identifies a legal decision, legislation or agency decision/rule, which may be reported by any number of sources. Some sources have their own unique references systems that are mapped to other systems. Making a legal decision, legislation or agency decision/rule an abstraction identified by the citation, avoids confusion with a particular source.

A must read for law students, practitioners, judges and potential inventors of the Nth citation system for legal materials.

Research topics in e-discovery

Monday, August 25th, 2014

Research topics in e-discovery by William Webber.

From the post:

Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I’d capture some of these topics here:

E-discovery, if you don’t know, is found in civil litigation and government investigations. Think of it as hacking with rules as the purpose of e-discovery is to find information that supports your claims or defense. E-discovery is high stakes data mining that pays very well. Need I say more?

Webber lists the following research topics:

  1. Classification across heterogeneous document types
  2. Automatic detection of document types
  3. Faceted categorization
  4. Label propagation across related documents
  5. Identifying unclassifiable documents
  6. Identifying poor training examples
  7. Identifying significant fragments in non-significant text
  8. Routing of documents to specialized trainers
  9. Total cost of annotation

“Label propagation across related documents” looks like a natural for topic maps but searching over defined properties that identify subjects as opposed to opaque tokens would enhance the results for a number of these topics.

What You Thought The Supreme Court…

Sunday, June 15th, 2014

Clever piece of code exposes hidden changes to Supreme Court opinions by Jeff John Roberts.

From the post:

Supreme Court opinions are the law of the land, and so it’s a problem when the Justices change the words of the decisions without telling anyone. This happens on a regular basis, but fortunately a lawyer in Washington appears to have just found a solution.

The issue, as Adam Liptak explained in the New York Times, is that original statements by the Justices about everything from EPA policy to American Jewish communities, are disappearing from decisions — and being replaced by new language that says something entirely different. As you can imagine, this is a problem for lawyers, scholars, journalists and everyone else who relies on Supreme Court opinions.

Until now, the only way to detect when a decision has been altered is a pain-staking comparison of earlier and later copies — provided, of course, that someone knew a decision had been changed in the first place. Thanks to a simple Twitter tool, the process may become much easier.

See Jeff’s post for more details, including a twitter account to follow the discovery of changes in opinions in the opinions of the Supreme Court of the United States.

In a nutshell, the court issues “slip” opinions in cases they decide and then later, sometimes years later, they provide a small group of publishers of their opinions with changes to be made to those opinions.

Which means the opinion you read as a “slip” opinion or in an advance sheet (paper back issue that is followed by a hard copy volume combining one or more advance sheets), may not be the opinion of record down the road.

Two questions occur to me immediately:

  1. We can distinguish the “slip” opinion version of an opinion from the “final” published opinion, but how do we distinguish a “final” published decision from a later “more final” published decision? Given the stakes at hand in proceedings before the Supreme Court, certainty about the prior opinions of the Court is very important.
  2. While the Supreme Court always gets most of the attention, it occurs to me that the same process of silent correction has been going on for other courts with published opinions, such as the United States Courts of Appeal and the United States District Courts. Perhaps for the last century or more.

    Which makes it only a small step to ask about state supreme courts and their courts of appeal. What is their record on silent correction of opinions?

There are mechanical difficulties the older records become because the “slip” opinions may be lost to history but in terms of volume, that would certainly be a “big data” project for legal informatics. To discover and document the behavior of courts over time with regard to silent correction of opinions.

What you thought the Supreme Court said may not be what our current record reflects. Who wins? What you heard or what a silently corrected record reports?

A crowdsourcing approach to building a legal ontology from text

Tuesday, May 27th, 2014

A crowdsourcing approach to building a legal ontology from text by Anatoly P. Getman and Volodymyr V. Karasiuk.


This article focuses on the problems of application of artificial intelligence to represent legal knowledge. The volume of legal knowledge used in practice is unusually large, and therefore the ontological knowledge representation is proposed to be used for semantic analysis, presentation and use of common vocabulary, and knowledge integration of problem domain. At the same time some features of legal knowledge representation in Ukraine have been taken into account. The software package has been developed to work with the ontology. The main features of the program complex, which has a Web-based interface and supports multi-user filling of the knowledge base, have been described. The crowdsourcing method is due to be used for filling the knowledge base of legal information. The success of this method is explained by the self-organization principle of information. However, as a result of such collective work a number of errors are identified, which are distributed throughout the structure of the ontology. The results of application of this program complex are discussed in the end of the article and the ways of improvement of the considered technique are planned.

Curious how you would compare this attempt to extract an ontology from legal texts to the efforts in the 1960’s and 1970’s to extract logic from the United States Internal Revenue Code? Apologies but my undergraduate notes aren’t accessible so I can’t give you article titles and citations.

If you do dig out some of that literature, pointers would be appreciated. As I recall, capturing the “logic” of those passages was fraught with difficulty.

Annotating, Extracting, and Linking Legal Information

Sunday, April 20th, 2014

Annotating, Extracting, and Linking Legal Information by Adam Wyner. (slides)

Great slides, provided you have enough background in the area to fill in the gaps.

I first saw this at: Wyner: Annotating, Extracting, and Linking Legal Information, which has collected up the links/resources mentioned in the slides.

Despite decades of electronic efforts and several centuries of manual effort before that, legal information retrieval remains an open challenge.

Placement of Citations [Discontinuity and Users]

Friday, April 11th, 2014

If the Judge Will Be Reading My Brief on a Screen, Where Should I Place My Citations? by Peter W. Martin.

From the post:

In a prior post I explored how the transformation of case law to linked electronic data undercut Brian Garner’s longstanding argument that judges should place their citations in footnotes. As that post promised, I’ll now turn to Garner’s position as it applies to writing that lawyers prepare for judicial readers.

brief page

Implicitly, Garner’s position assumes a printed page, with footnote calls embedded in the text and the related notes placed at the bottom. In print that entirety is visible at once. The eyes must move, but both call and footnote remain within a single field of vision. Secondly, when the citation sits inert on a printed page and the cited source is online, the decision to inspect that source and when to do so is inevitably influenced by the significant discontinuity that transaction will entail. In print, citation placement contributes little to that discontinuity. The situation is altered – significantly, it seems to me – when a brief or memorandum is submitted electronically and will most likely be read from a screen. In 2014 that is the case with a great deal of litigation.

This is NOT a discussion of interest only to lawyers and judges.

While Peter has framed the issue in terms of contrasting styles of citation, as he also points out, there is a question of “discontinuity” and I would argue comprehension for the reader in these styles.

At first blush, being a regular hypertext maven you may think that inline citations are “the way to go,” on this citation issue.

To some degree I would agree with you but leaving the current display to consult a citation or other material that could appear in a footnote, introduces another form of discontinuity.

You are no longer reading a brief prepared by someone familiar with the law and facts at hand but someone who is relying on different facts and perhaps even a different legal context for their statements.

If you are a regular reader of hypertexts, try writing down the opinion of one author on a note card, follow a hyperlink in that post to another resource, record the second author’s opinion on the same subject on a second note card and then follow a link from the second resource to a third and repeat the note card opinion recording. Set all three cards aside, with no marks to associate them with a particular author.

After two (2) days return to the cards and see if you can distinguish the card you made for the first author from the next two.

Yes, after a very short while you are unable to identify the exact source of information that you were trying to remember. Now imagine that in a legal context where facts and/or law are in dispute. Exactly how much “other” content do you want to display with your inline reference?

The same issue comes up for topic map interfaces. Do you really want to display all the information on a subject or do you want to present the user with a quick overview and enable them to choose greater depth?

Personally I would use citations with pop-ups that contain a summary of the cited authority, with a link to the fuller resource. So a judge could quickly confirm their understanding of a case without waiting for resources to load, etc.

But in any event, how much visual or cognitive discontinuity your interface is inflicting on users is an important issue.


Monday, March 17th, 2014

ACTUS (Algorithmic Contract Types Unified Standards)

From the webpage:

The Alfred P. Sloan Foundation awarded Stevens Institute of Technology a grant to work on the proposal entitled “Creating a standard language for financial contracts and a contract-centric analytical framework”. The standard follows the theoretical groundwork laid down in the book “Unified Financial Analysis” (1) – UFA.The goal of this project is to build a financial instrument reference database that represents virtually all financial contracts as algorithms that link changes in risk factors (market risk, credit risk, and behavior, etc.) to cash flow obligations of financial contracts. This reference database will be the technological core of a future open source community that will maintain and evolve standardized financial contract representations for the use of regulators, risk managers, and researchers.

The objective of the project is to develop a set of about 30 unique contract types (CT’s) that represent virtually all existing financial contracts and which generate state contingent cash flows at a high level of precision. The term of art that describes the impact of changes in the risk factors on the cash flow obligations of a financial contract is called “state contingent cash flows,” which are the key input to virtually all financial analysis including models that assess financial risk.

1- Willi Brammertz, Ioannis Akkizidis, Wolfgang Breymann, Rami Entin, Marco Rustmann; Unified Financial Analysis – The Missing Links of Finance, Wiley 2009.

This will help with people who are not cheating in the financial markets.

After the revelations of the past couple of years, any guesses on the statistics of non-cheating members of the financial community?


Even if these are used by non-cheaters, we know that the semantics are going to vary from user to user.

The real questions are: 1) How will we detect semantic divergence? and 2) How much semantic divergence can be tolerated?

I first saw this in a tweet by Stefano Bertolo.

Cataloguing projects

Tuesday, March 11th, 2014

Cataloguing projects (UK National Archive)

From the webpage:

The National Archives’ Cataloguing Strategy

The overall objective of our cataloguing work is to deliver more comprehensive and searchable catalogues, thus improving access to public records. To make online searches work well we need to provide adequate data and prioritise cataloguing work that tackles less adequate descriptions. For example, we regard ranges of abbreviated names or file numbers as inadequate.

I was lead to this delightful resource by a tweet from David Underdown, advising that his presentation from National Catalogue Day in 2013 was now onlne.

His presentation along with several others and reports about projects in prior years are available at this projects page.

I thought the presentation titled: Opening up of Litigation: 1385-1875 by Amanda Bevan and David Foster, was quite interesting in light of various projects that want to create new “public” citation systems for law and litigation.

I haven’t seen such a proposal yet that gives sufficient consideration to the enormity of what do you do with old legal materials?

The litigation presentation could be a poster child for topic maps.

I am looking forward to reading the other presentations as well.

The FIRST Act, Retro Legislation?

Tuesday, March 11th, 2014

Language in FIRST act puts United States at Severe Disadvantage Against International Competitors by Ranit Schmelzer.

From the press release:

The Scholarly Publishing and Academic Research Coalition (SPARC), an international alliance of nearly 800 academic and research libraries, today announced its opposition to Section 303 of H.R. 4186, the Frontiers in Innovation, Research, Science and Technology (FIRST) Act. This provision would impose significant barriers to the public’s ability to access the results of taxpayer-funded research.

Section 303 of the bill would undercut the ability of federal agencies to effectively implement the widely supported White House Directive on Public Access to the Results of Federally Funded Research and undermine the successful public access program pioneered by the National Institutes of Health (NIH) – recently expanded through the FY14 Omnibus Appropriations Act to include the Departments Labor, Education and Health and Human Services. Adoption of Section 303 would be a step backward from existing federal policy in the directive, and put the U.S. at a severe disadvantage among our global competitors.

“This provision is not in the best interests of the taxpayers who fund scientific research, the scientists who use it to accelerate scientific progress, the teachers and students who rely on it for a high-quality education, and the thousands of U.S. businesses who depend on public access to stay competitive in the global marketplace,” said Heather Joseph, SPARC Executive Director. “We will continue to work with the many bipartisan members of the Congress who support open access to publicly funded research to improve the bill.”

[the parade of horribles follows]

SPARC‘s press release never quotes a word from H.R. 4186. Not one. Commentary but nary a part of its object.

I searched at Thomas (the Congressional information service at the Library of Congress), for H.R. 4186 and came up empty by bill number. Switching to the Congressional Record for Monday, March 10, 2014, I did find the bill being introduced and the setting of a hearing on it. The GPO as not (as of today) posted the text of H.R. 4186, but when it does, follow this link: H.R. 4186.

Even more importantly, SPARC doesn’t point out who is responsible for the objectionable section appearing in the bill. Bills don’t write themselves and as far as I know, Congress doesn’t have a random bill generator.

The bottom line is that someone, an identifiable someone, asked for longer embargo wording to be included. If the SPARC press release is accurate, the most likely someone’s asked are Chairman Lamar Smith (R-TX 21st District) or Rep. Larry Bucshon (R-IN 8th District).

The Wikipedia page on the 8th Congressional District in Illinois needs to be updated but it also fails to mention that the 8th district is to the West and North-West of Chicago. You might want to check Bucshon‘s page at Wikipedia and links there to other resources.

Wikipedia on the 21st Congressional District of Texas, places it north of San Antonio, the seventh largest city in the United States. Lamar Smith‘s page at Wikipedia has some interested reading.

Odds are in and around Chicago and San Antonio there are people interested in longer embargo periods on federally funded research.

Those are at least some starting points for effective opposition to this legislation, assuming it was reported accurately by SPARC. Let’s drop the pose of disinterested legislators trying valiantly to serve the public good. Not impossible, just highly unlikely. Let’s argue about who is getting paid and for what benefits.

Or as Captain Ahab advises:

All visible objects, man, are but as pasteboard masks. But in each event –in the living act, the undoubted deed –there, some unknown but still reasoning thing puts forth the mouldings of its features from behind the unreasoning mask. If man will strike, strike through the mask! [Melville, Moby Dick, Chapter XXXVI]

Legislation as a “pasteboard mask” is a useful image. There is not a contour, dimple, shade or expression that wasn’t bought and paid for by someone. You have to strike through the mask to discover who.

Are you game?

PS: Curious, where would you go next (data wise, I don’t have the energy to lurk in garages) in terms of searching for the buyers of longer embargoes in H.R. 4186?

Making the meaning of contracts visible…

Sunday, February 23rd, 2014

Making the meaning of contracts visible – Automating contract visualization by Stefania Passera, Helena Haapio, Michael Curtotti.


The paper, co-authored by Passera, Haapio and Curtotti, presents three demos of tools to automatically generate visualizations of selected contract clauses. Our early prototypes include common types of term and termination, payment and liquidated damages clauses. These examples provide proof-of-concept demonstration tools that help contract writers present content in a way readers pay attention to and understand. These results point to the possibility of document assembly engines compiling an entirely new genre of contracts, more user-friendly and transparent for readers and not too challenging to produce for lawyers.



From slides 2 and 3:

Need for information to be accessible, transparent, clear and easy to understand
   Contracts are no exception.

Benefits of visualization

  • Information encoded explicitly is easier to grasp & share
  • Integrating pictures & text prevents cognitive overload by distributing effort on 2 different processing systems
  • Visual structures and cues act as paralanguage, reducing the possibility of misinterpretation

Sounds like the output from a topic map doesn’t it?

A contract is “explicit and transparent” to a lawyer, but that doesn’t mean everyone reading it sees the contract as “explicit and transparent.”

Making what the lawyer “sees” explicit, in other words, is another identification of the same subject, just a different way to describe it.

What’s refreshing is the recognition that not everyone understands the same description, hence the need for alternative descriptions.

Some additional leads to explore on these authors:

Stefania Passera Homepage with pointers to her work.

Helena Haapio Profile at Lexpert, pointers to her work.

Michael Curtotti – Computational Tools for Reading and Writing Law.

There is a growing interest in making the law transparent to non-lawyers, which is going to require a lot more than “this is the equivalent of that, because I say so.” Particularly for re-use of prior mappings.

Looks like a rapid growth area for topic maps to me.


I first saw this at: Passera, Haapio and Curtotti: Making the meaning of contracts visible – Automating contract visualization.

Identifying Case Law

Wednesday, January 29th, 2014

Costs of the (Increasingly) Lengthy Path to U.S. Report Pagination by Peter W. Martin.

If you are not familiar with the U.S. Supreme Court, the thumbnail sketch is that the court publishes its opinions without official page numbers and they remain that way for years. When the final printed version appears, all the cases citing a case without official page numbers, have to be updated. Oh joy! 😉

Peter does a great job illustrating the costs of this approach.

From the post:

On May 17, 2010, the U.S. Supreme Court decided United States v. Comstock, holding that Congress had power under the Necessary and Proper Clause of the U.S. Constitution to authorize civil commitment of a mentally ill, sexually dangerous federal prisoner beyond his release date. (18 U.S.C. § 4248). Three and a half years later, the Court communicated the Comstock decision’s citation pagination with the shipment of the “preliminary print” of Part 1 of volume 560 of the United States Reports. That paperbound publication was logged into the Cornell Law Library on January 3 of this year. (According to the Court’s web site the final bound volume shouldn’t be expected for another year.) United States v. Comstock, appears in that volume at page 126, allowing the full case finally to be cited: United States v. Comstock, 560 U.S. 126 (2010) and specific portions of the majority, concurring and dissenting opinions to be cited by means of official page numbers.

This lag between opinion release and attachment of official volume and page numbers along the slow march to a final bound volume has grown in recent years, most likely as a result of tighter budgets at the Court and the Government Printing Office. Less than two years separated the end of the Court’s term in 2001 and our library’s receipt of the bound volume containing its last decisions. By 2006, five years later, the gap had widened to a full three years. Volume 554 containing the last decisions from the term ending in 2008 didn’t arrive until July 9 of last year. That amounts to nearly five years of delay.

If the printed volumes of the Court’s decisions served solely an archival function, this increasingly tardy path to print would warrant little concern or comment. But because the Court provides no means other than volume and page numbers to cite its decisions and their constituent parts, the increasing delays cast a widening ripple of costs on the federal judiciary, the services that distribute case law, and the many who need to cite it.

The nature of those costs can be illustrated using the Comstock case itself.

In addition to detailing the costs of delayed formal citation, Peter’s analysis is equally applicable to multiple gene names, for example, that precede any attempt at an official name.

What happens to all the literature that was published using the “interim” names?

Yes, we can map between them or create synonym tables, but who knows on what basis we created those tables or mappings?

Legal citations aren’t changing rapidly but the fact they are changing at all is fairly remarkable. Taken as lessons in the management of identifiers, it is a area to watch closely.

Is Link Rot Destroying Stare Decisis…

Monday, December 30th, 2013

Is Link Rot Destroying Stare Decisis as We Know It? The Internet-Citation Practice of the Texas Appellate Courts by Arturo Torres (Journal of Appellate Practice and Process, Vol 13, No. 2, Fall 2012 )


In 1995 the first Internet-based citation was used in a federal court opinion. In 1996, a state appellate court followed suit; one month later, a member of the United States Supreme Court cited to the Internet; finally, in 1998 a Texas appellate court cited to the Internet in one of its opinions. In less than twenty years, it has become common to find appellate courts citing to Internet-based resources in opinions. Because of the current extent of Internet-citation practice varies by courts across jurisdictions, this paper will examine the Internet-citation practice of the Texas Appellate courts since 1998. Specifically, this study surveys the 1998 to 2011 published opinions of the Texas appellate courts and describes their Internet-citation practice.

A study that confirms what was found in …Link and Reference Rot in Legal Citations for the Harvard Law Review and the U.S. Supreme Court.

Curious that a West Key Numbers remain viable after more than a century of use (manual or electronic resolution) whereas Internet citations expire over the course of a few years.

What do you think is the difference in those citations, West Key Numbers versus URLs, that accounts for one being viable and the other only ephemerally so?

Cross-categorization of legal concepts…

Tuesday, December 17th, 2013

Cross-categorization of legal concepts across boundaries of legal systems: in consideration of inferential links by Fumiko Kano Glückstad, Tue Herlau, Mikkel N. Schmidt, Morten Mørup.


This work contrasts Giovanni Sartor’s view of inferential semantics of legal concepts (Sartor in Artif Intell Law 17:217–251, 2009) with a probabilistic model of theory formation (Kemp et al. in Cognition 114:165–196, 2010). The work further explores possibilities of implementing Kemp’s probabilistic model of theory formation in the context of mapping legal concepts between two individual legal systems. For implementing the legal concept mapping, we propose a cross-categorization approach that combines three mathematical models: the Bayesian Model of Generalization (BMG; Tenenbaum and Griffiths in Behav Brain Sci 4:629–640, 2001), the probabilistic model of theory formation, i.e., the Infinite Relational Model (IRM) first introduced by Kemp et al. (The twenty-first national conference on artificial intelligence, 2006, Cognition 114:165–196, 2010) and its extended model, i.e., the normal-IRM (n-IRM) proposed by Herlau et al. (IEEE International Workshop on Machine Learning for Signal Processing, 2012). We apply our cross-categorization approach to datasets where legal concepts related to educational systems are respectively defined by the Japanese- and the Danish authorities according to the International Standard Classification of Education. The main contribution of this work is the proposal of a conceptual framework of the cross-categorization approach that, inspired by Sartor (Artif Intell Law 17:217–251, 2009), attempts to explain reasoner’s inferential mechanisms.

From the introduction:

An ontology is traditionally considered as a means for standardizing knowledge represented by different parties involved in communications (Gruber 1992; Masolo et al. 2003; Declerck et al. 2010). Kemp et al. (2010) also points out that some scholars (Block 1986; Field 1977; Quilian 1968) have argued the importance of knowledge structuring, i.e., ontologies, where concepts are organized into systems of relations and the meaning of a concept partly depends on its relationships to other concepts. However, real human to human communication cannot be absolutely characterized by such standardized representations of knowledge. In Kemp et al. (2010), two challenging issues are raised against such idea of systems of concepts. First, as Fodor and Lepore (1992) originally pointed out, it is beyond comprehension that the meaning of any concept can be defined within a standardized single conceptual system. It is unrealistic that two individuals with different beliefs have common concepts. This issue has also been discussed in semiotics (Peirce 2010; Durst-Andersen 2011) and in cognitive pragmatics (Sperber and Wilson 1986). For example, Sperber and Wilson (1986) discuss how mental representations are constructed diversely under different environmental and cognitive conditions. A second point which Kemp et al. (2010) specifically address in their framework is the concept acquisition problem. According to Kemp et al. (2010; see also: Hempel (1985), Woodfield (1987)):

if the meaning of each concept depends on its role within a system of concepts, it is difficult to see how a learner might break into the system and acquire the concepts that it contains. (Kemp et al. 2010)

Interestingly, the similar issue is also discussed by legal information scientists. Sartor (2009) argues that:

legal concepts are typically encountered in the context of legal norms, and the issue of determining their content cannot be separated from the issue of identifying and interpreting the norms in which they occur, and of using such norms in legal inference. (Sartor 2009)

This argument implies that if two individuals who are respectively belonging to two different societies having different legal systems, they might interpret a legal term differently, since the norms in which the two individuals belong are not identical. The argument also implies that these two individuals must have difficulties in learning a concept contained in the other party’s legal system without interpreting the norms in which the concept occurs.

These arguments motivate us to contrast the theoretical work presented by Sartor (2009) with the probabilistic model of theory formation by Kemp et al. (2010) in the context of mapping legal concepts between two individual legal systems. Although Sartor’s view addresses the inferential mechanisms within a single legal system, we argue that his view is applicable in a situation where a concept learner (reasoner) is, based on the norms belonging to his or her own legal system, going to interpret and adapt a new concept introduced from another legal system. In Sartor (2009), the meaning of a legal term results from the set of inferential links. The inferential links are defined based on the theory of Ross (1957) as:

  1. the links stating what conditions determine the qualification Q (Q-conditioning links), and
  2. the links connecting further properties to possession of the qualification Q (Q-conditioned links.) (Sartor 2009)

These definitions can be seen as causes and effects in Kemp et al. (2010). If a reasoner is learning a new legal concept in his or her own legal system, the reasoner is supposed to seek causes and effects identified in the new concept that are common to the concepts which the reasoner already knows. This way, common-causes and common-effects existing within a concept system, i.e., underlying relationships among domain concepts, are identified by a reasoner. The probabilistic model in Kemp et al. (2010) is supposed to learn these underlying relationships among domain concepts and identify a system of legal concepts from a view where a reasoner acquires new concepts in contrast to the concepts already known by the reasoner.

Pardon the long quote but the paper is pay-per-view.

I haven’t started to run down all the references but this is an interesting piece of work.

I was most impressed by the partial echoing of the topic map paradigm that: “meaning of each concept depends on its role within a system of concepts….

True, a topic map can capture only “surface” facts and relationships between those facts but that merits a comment on a topic map instance and not topic maps in general.

Noting that you also shouldn’t pay for more topic map than you need. If all you need is a flat mapping between DHS and say the CIA, then doing nor more than mapping terms is sufficient. If you need a maintainable and robust mapping, different techniques would be called for. Both results would be topic maps, but one of them would be far more useful.

One of the principal sources relied upon by the authors’ is: The Nature of Legal Concepts: Inferential Nodes or Ontological Categories? by Giovanni Sartor.

I don’t see any difficulty with Sartor’s rules of inference, any more than saying if a topic has X property (occurrence in TMDM speak), then of necessity it must have property E, F, and G.

Where I would urge caution is with the notion that properties of a legal concept spring from a legal text alone. Or even from a legal ontology. In part because two people in the same legal system can read the same legal text and/or use the same legal ontology and expect to see different properties for a legal concept.

Consider the text of Paradise Lost by John Milton. If Stanley Fish, a noted Milton scholar, were to assign properties to the concepts in Book 1, his list of properties would be quite different from my list of properties. Same words, same text, but very different property lists.

To refine what I said about the topic map paradigm a bit earlier, it should read: “meaning of each concept depends on its role within a system of concepts [and the view of its hearer/reader]….

The hearer/reader being the paramount consideration. Without a hearer/reader, there is no concept or system of concepts or properties of either one for comparison.

When topics are merged, there is a collecting of properties, some of which you may recognize and some of which I may recognize, as identifying some concept or subject.

No guarantees but better than repeating your term for a concept over and over again, each time in a louder voice. 😉

Everything is Editorial:..

Saturday, December 14th, 2013

Everything is Editorial: Why Algorithms are Hand-Made, Human, and Not Just For Search Anymore by Aaron Kirschenfeld.

From the post:

Down here in Durham, NC, we have artisanal everything: bread, cheese, pizza, peanut butter, and of course coffee, coffee, and more coffee. It’s great—fantastic food and coffee, that is, and there is no doubt some psychological kick from knowing that it’s been made carefully by skilled craftspeople for my enjoyment. The old ways are better, at least until they’re co-opted by major multinational corporations.

Aside from making you either hungry or jealous, or perhaps both, why am I talking about fancy foodstuffs on a blog about legal information? It’s because I’d like to argue that algorithms are not computerized, unknowable, mysterious things—they are produced by people, often painstakingly, with a great deal of care. Food metaphors abound, helpfully I think. Algorithms are the “special sauce” of many online research services. They are sets of instructions to be followed and completed, leading to a final product, just like a recipe. Above all, they are the stuff of life for the research systems of the near future.

Human Mediation Never Went Away

When we talk about algorithms in the research community, we are generally talking about search or information retrieval (IR) algorithms. A recent and fascinating VoxPopuLII post by Qiang Lu and Jack Conrad, “Next Generation Legal Search – It’s Already Here,” discusses how these algorithms have become more complicated by considering factors beyond document-based, topical relevance. But I’d like to step back for a moment and head into the past for a bit to talk about the beginnings of search, and the framework that we have viewed it within for the past half-century.

Many early information-retrieval systems worked like this: a researcher would come to you, the information professional, with an information need, that vague and negotiable idea which you would try to reduce to a single question or set of questions. With your understanding of Boolean search techniques and your knowledge of how the document corpus you were searching was indexed, you would then craft a search for the computer to run. Several hours later, when the search was finished, you would be presented with a list of results, sometimes ranked in order of relevance and limited in size because of a lack of computing power. Presumably you would then share these results with the researcher, or perhaps just turn over the relevant documents and send him on his way. In the academic literature, this was called “delegated search,” and it formed the background for the most influential information retrieval studies and research projects for many years—the Cranfield Experiments. See also “On the History of Evaluation in IR” by Stephen Robertson (2008).

In this system, literally everything—the document corpus, the index, the query, and the results—were mediated. There was a medium, a middle-man. The dream was to some day dis-intermediate, which does not mean to exhume the body of the dead news industry. (I feel entitled to this terrible joke as a former journalist… please forgive me.) When the World Wide Web and its ever-expanding document corpus came on the scene, many thought that search engines—huge algorithms, basically—would remove any barrier between the searcher and the information she sought. This is “end-user” search, and as algorithms improved, so too would the system, without requiring the searcher to possess any special skills. The searcher would plug a query, any query, into the search box, and the algorithm would present a ranked list of results, high on both recall and precision. Now, the lack of human attention, evidenced by the fact that few people ever look below result 3 on the list, became the limiting factor, instead of the lack of computing power.

delegated search

The only problem with this is that search engines did not remove the middle-man—they became the middle-man. Why? Because everything, whether we like it or not, is editorial, especially in reference or information retrieval. Everything, every decision, every step in the algorithm, everything everywhere, involves choice. Search engines, then, are never neutral. They embody the priorities of the people who created them and, as search logs are analyzed and incorporated, of the people who use them. It is in these senses that algorithms are inherently human.

A delightful piece on search algorithms that touches at the heart of successful search and/or data integration.

Its first three words capture the issue: Everything is Editorial….

Despite the pretensions of scholars, sages and rogues, everything is editorial, there are no universal semantic primitives.

For convenience in data processing we may choose to treat some tokens as semantic primitives, but that is always a choice that we make.

Once you make that leap, it comes as no surprise that owl:sameAs wasn’t used the same way by everyone who used it.

See: When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, and Patrick J. Hayes, for one take on the confusion around owl:sameAs.

If you are interested in moving beyond opaque keyword searching, consider Aaron’s post carefully.

…Link and Reference Rot in Legal Citations

Tuesday, September 24th, 2013

Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations by Jonathan Zittrain, Kendra Albert, Lawrence Lessig.


We document a serious problem of reference rot: more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs found within U.S. Supreme Court opinions do not link to the originally cited information.

Given that, we propose a solution for authors and editors of new scholarship that involves libraries undertaking the distributed, long-term preservation of link contents.

Imagine trying to use a phone book where 70% of the addresses were wrong.

Or you are looking for your property deed and learn that only 50% of the references are correct.

Do those sound like acceptable situations?

Considering the Harvard Law Review and the U.S. Supreme Court put a good deal of effort into correct citations, the fate of the rest of the web must be far worse.

The about page for Perma reports:

Any author can go to the website and input a URL. downloads the material at that URL and gives back a new URL (a “ link”) that can then be inserted in a paper.

After the paper has been submitted to a journal, the journal staff checks that the provided link actually represents the cited material. If it does, the staff “vests” the link and it is forever preserved. Links that are not “vested” will be preserved for two years, at which point the author will have the option to renew the link for another two years.

Readers who encounter links can click on them like ordinary URLs. This takes them to the site where they are presented with a page that has links both to the original web source (along with some information, including the date of the link’s creation) and to the archived version stored by

I would caution that “forever” is a very long time.

What happens to the binding between an identifier and a URL when URLs are replaced by another network protocol?

After all the change over the history of the Internet, you don’t believe the current protocols will last “forever” Yes?

A more robust solution would divorce identifiers/citations from any particular network protocol, whether you think it will last forever or not.

That separation of identifier from network protocol preserves the possibility of an online database such as but also databases that have local caches of the citations and associated content, databases that point to multiple locations for associated content, and databases that support currently unknown protocols to access content associated with an identifier.

Just as a database of citations from Codex Justinianus could point to the latest printed Latin text, online versions or future versions.

Citations can become permanent identifiers if they don’t rely on a particular network addressing systems.

Court Listener

Tuesday, September 24th, 2013

Court Listener

From the about page:

Started as a part-time hobby in 2010, CourtListener is now a core project of the Free Law Project, a California Non-Profit corporation. The goal of the site is to provide powerful free legal tools for everybody while giving away all our data in bulk downloads.

We collect legal opinions from court websites and from data donations, and are aiming to have the best, most complete data on the open Web within the next couple years. We are slowly expanding to provide search and awareness tools for as many state courts as possible, and we already have tools for all of the Federal Appeals Courts. For more details on which jurisdictions we support, see our coverage page. If you’re able to help us acquire more cases, please get in touch.

This rather remarkable site has collected 905,842 court opinions as of September 24, 2013.

The default listing of cases is newest first but you can choose oldest first, most/least cited first and keyword relevance. Changing the listing order becomes interesting once you perform a keyword search (top search bar). The refinement (left hand side) works quite well, except that I could not filter search results by a judges name. On case names, separate the parties with “v.” as “vs” doesn’t work.

It is also possible to discover examples of changing legal terminology that impact your search results.

For example, try searching for the keyword phrase, “interstate commerce.” Now choose “Oldest first.” you will see Price v. Ralston (1790) and the next case is Crandall v. State of Nevada (1868). Hmmm, what happened to the early interstate commerce cases under John Marshall?

OK, so try “commerce.” Now set to “Oldest first.” Hmmm, a lot more cases. Yes? Under case name, type in “Gibbons” and press return. Now the top case is Gibbons v. Ogden (1824). The case name is a hyperlink so follow that now.

It is a long opinion by Chief Justice Marshall but at paragraph 5 he announces:

The power to regulate commerce extends to every species of commercial intercourse between the United States and foreign nations, and among the several States. It does not stop at the external boundary of a State.

The phrase “among the several States,” occurs 21 times in Gibbons v. Ogden, with no mention of the modern “interstate commerce.”

What we now call the “interstate commerce clause” played a major role in the New Deal legislation that ended the 1930’s depression in the United States. See Commerce Clause. Following the cases cited under “New Deal” will give you an interesting view of the conflicting sides. A conflict that still rages today.

The terminology problem, “among the several states” vs. “interstate commerce” is one that makes me doubt the efficacy of public access to law programs. Short of knowing the “right” search words, it is unlikely you would have found Gibbons v. Ogden. Well, short of reading through the entire corpus of Supreme Court decisions. 😉

Public access to law would be enhanced with mappings such as “interstate commerce,” and “among the several states,” but also distinguishing “due process,” didn’t always mean what it means today, and further mappings to colloquial search expressions.

A topic map could capture those nuances and many more.

I guess the question is whether people should be free to search for the law or should they be freed by finding the law?

Legislative XML Data Mapping [$10K]

Friday, September 13th, 2013

Legislative XML Data Mapping (Library of Congress)

First, the important stuff:

First Place: $10K

Entry due by: December 31 at 5:00pm EST

Second, the details:

The Library of Congress is sponsoring two legislative data challenges to advance the development of international data exchange standards for legislative data. These challenges are an initiative to encourage broad participation in the development and application of legislative data standards and to engage new communities in the use of legislative data. Goals of this initiative include:
• Enabling wider accessibility and more efficient exchange of the legislative data of the United States Congress and the United Kingdom Parliament,
• Encouraging the development of open standards that facilitate better integration, analysis, and interpretation of legislative data,
• Fostering the use of open source licensing for implementing legislative data standard.

The Legislative XML Data Mapping Challenge invites competitors to produce a data map for US bill XML and the most recent Akoma Ntoso schema and UK bill XML and the most recent Akoma Ntoso schema. Gaps or issues identified through this challenge will help to shape the evolving Akoma Ntoso international standard.

The winning solution will win $10,000 in cash, as well as opportunities for promotion, exposure, and recognition by the Library of Congress. For more information about prizes please see the Official Rules.

Can you guess what tool or technique I would suggest that you use? 😉

The winner is announced February 12, 2014 at 5:00pm EST.

Too late for the holidays this year, too close to Valentines Day, what holiday will you be wanting to celebrate?

Input Requested: Survey on Legislative XML

Wednesday, September 11th, 2013

Input Requested: Survey on Legislative XML

A request for survey participants who are familiar with XML and law. To comment on the Crown Legislative Markup Language (CLML) which is used for the content at:


By way of background, the Crown Legislation Mark-up Language (CLML) is used to represent UK legislation in XML. It’s the base format for all legislation published on the website. We make both the schema and all our data freely available for anyone to use, or re-use, under the UK government’s Open Government Licence. CLML is currently expressed as a W3C XML Schema which is owned and maintained by The National Archives. A version of the schema can be accessed online at . Legislation as CLML XML can be accessed from the website using the API. Simply add “/data.xml” to any legislation content page, e.g. .

Why is this important for topic maps?

Would you believe that the markup semantics of CLML are different from the semantics of United States Legislative Markup (USLM)?

That’s just markup syntax differences. Hard to say what substantive semantic variations are in the laws themselves.

Mapping legal semantics becomes important when the United States claims extraterritorial jurisdiction for the application of its laws.

Or when the United States uses its finance laws to inflict harm on others. (Treasury’s war: the unleashing of a new era of financial warfare by Juan Carlos Zarate.)

Mapping legal semantics won’t make U.S. claims any less extreme but may help convince others of a clear and present danger.