Everything is Editorial: Why Algorithms are Hand-Made, Human, and Not Just For Search Anymore by Aaron Kirschenfeld.
From the post:
Down here in Durham, NC, we have artisanal everything: bread, cheese, pizza, peanut butter, and of course coffee, coffee, and more coffee. It’s great—fantastic food and coffee, that is, and there is no doubt some psychological kick from knowing that it’s been made carefully by skilled craftspeople for my enjoyment. The old ways are better, at least until they’re co-opted by major multinational corporations.
Aside from making you either hungry or jealous, or perhaps both, why am I talking about fancy foodstuffs on a blog about legal information? It’s because I’d like to argue that algorithms are not computerized, unknowable, mysterious things—they are produced by people, often painstakingly, with a great deal of care. Food metaphors abound, helpfully I think. Algorithms are the “special sauce” of many online research services. They are sets of instructions to be followed and completed, leading to a final product, just like a recipe. Above all, they are the stuff of life for the research systems of the near future.
Human Mediation Never Went Away
When we talk about algorithms in the research community, we are generally talking about search or information retrieval (IR) algorithms. A recent and fascinating VoxPopuLII post by Qiang Lu and Jack Conrad, “Next Generation Legal Search – It’s Already Here,” discusses how these algorithms have become more complicated by considering factors beyond document-based, topical relevance. But I’d like to step back for a moment and head into the past for a bit to talk about the beginnings of search, and the framework that we have viewed it within for the past half-century.
Many early information-retrieval systems worked like this: a researcher would come to you, the information professional, with an information need, that vague and negotiable idea which you would try to reduce to a single question or set of questions. With your understanding of Boolean search techniques and your knowledge of how the document corpus you were searching was indexed, you would then craft a search for the computer to run. Several hours later, when the search was finished, you would be presented with a list of results, sometimes ranked in order of relevance and limited in size because of a lack of computing power. Presumably you would then share these results with the researcher, or perhaps just turn over the relevant documents and send him on his way. In the academic literature, this was called “delegated search,” and it formed the background for the most influential information retrieval studies and research projects for many years—the Cranfield Experiments. See also “On the History of Evaluation in IR” by Stephen Robertson (2008).
In this system, literally everything—the document corpus, the index, the query, and the results—were mediated. There was a medium, a middle-man. The dream was to some day dis-intermediate, which does not mean to exhume the body of the dead news industry. (I feel entitled to this terrible joke as a former journalist… please forgive me.) When the World Wide Web and its ever-expanding document corpus came on the scene, many thought that search engines—huge algorithms, basically—would remove any barrier between the searcher and the information she sought. This is “end-user” search, and as algorithms improved, so too would the system, without requiring the searcher to possess any special skills. The searcher would plug a query, any query, into the search box, and the algorithm would present a ranked list of results, high on both recall and precision. Now, the lack of human attention, evidenced by the fact that few people ever look below result 3 on the list, became the limiting factor, instead of the lack of computing power.
The only problem with this is that search engines did not remove the middle-man—they became the middle-man. Why? Because everything, whether we like it or not, is editorial, especially in reference or information retrieval. Everything, every decision, every step in the algorithm, everything everywhere, involves choice. Search engines, then, are never neutral. They embody the priorities of the people who created them and, as search logs are analyzed and incorporated, of the people who use them. It is in these senses that algorithms are inherently human.
A delightful piece on search algorithms that touches at the heart of successful search and/or data integration.
Its first three words capture the issue: Everything is Editorial….
Despite the pretensions of scholars, sages and rogues, everything is editorial, there are no universal semantic primitives.
For convenience in data processing we may choose to treat some tokens as semantic primitives, but that is always a choice that we make.
Once you make that leap, it comes as no surprise that owl:sameAs wasn’t used the same way by everyone who used it.
See: When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, and Patrick J. Hayes, for one take on the confusion around owl:sameAs.
If you are interested in moving beyond opaque keyword searching, consider Aaron’s post carefully.