The techniques we discussed in the Cleanup and Reconciliation parts come in very handy when your data is already in a structured format. However, many fields (notoriously description) contain unstructured text, yet they usually convey a high amount of interesting information. To capture this in machine-processable format, named entity recognition can be used.

A Google Refine / OpenRefine extension developed by Multimedia Lab (ELIS — Ghent University / iMinds) and MasTIC (Université Libre de Bruxelles.

Unstructured metadata fields such as ‘description’ offer tremendous value for users to understand cultural heritage objects. However, this type of narrative information is of little direct use within a machine-readable context due to its unstructured nature. This paper explores the possibilities and limitations of Named-Entity Recognition (NER) to mine such unstructured metadata for meaningful concepts. These concepts can be used to leverage otherwise limited searching and browsing operations, but they can also play an important role to foster Digital Humanities research. In order to catalyze experimentation with NER, the paper proposes an evaluation of the performance of three thirdparty NER APIs through a comprehensive case study, based on the descriptive fields of the Smithsonian Cooper-Hewitt National Design Museum in New York. A manual analysis is performed of the precision, recall, and F-score of the concepts identified by the third party NER APIs. Based on the outcomes of the analysis, the conclusions present the added value of NER services, but also point out to the dangers of uncritically using NER, and by extension Linked Data principles, within the Digital Humanities. All metadata and tools used within the paper are freely available, making it possible for researchers and practitioners to repeat the methodology. By doing so, the paper offers a significant contribution towards understanding the value of NER for the Digital Humanities.

I commend the paper to you for a very close reading, particularly those of you in the humanities.

To conclude, the Digital Humanities need to launch a broader debate on how we can incorporate within our work the probabilistic character of tools such as NER services. Drucker eloquently states that ‘we use tools from disciplines whose epistemological foundations are at odds with, or even hostile to, the humanities. Positivistic, quantitative and reductive, these techniques preclude humanistic methods because of the very assumptions on which they are designed: that objects of knowledge can be understood as ahistorical and autonomous.’

…that objects of knowledge can be understood as ahistorical and autonomous.

Certainly possible, but lossy, very lossy, in my view.


