As part of our recent Digital Humanities Doctoral School's programme, participants were asked to write a blogpost capturing their experiences with the digital humanities. This week we hear from Dr. Mario Slugan, a Marie Skłodowska-Curie Actions Fellow at the Centre for Cinema and Media Studies, Ghent University, Belgium. Mario is working on a European Union's Horizon 2020 funded project "Fiction, Imagination, and Early Cinema" under the supervision of Professor Daniël Biltereyst. He holds degrees in Computer Science, Comparative Literature, Philosophy, Slavic Studies, German Studies, and Film and Television Studies. He is the author of two books: Montage as Perceptual Experience: Berlin Alexanderplatz from Döblin to Fassbinder (Camden House, 2017) and Noël Carroll on Film: A Philosophy of Art and Popular Culture (I. B. Tauris, 2018, forthcoming). He is also the managing editor of an open-access peer-reviewed academic journal Apparatus: Film, Media and Digital Cultures of Central and Eastern Europe.
Digital Humanities and Film History
The Doctoral School on Digital Humanities provided me with a great opportunity to think about not only how I could apply my Computer Science background to my current research in film history but also to consider whether I should engage in coding myself (and how much) or if it would be more efficient to defer to the IT department for practical implementation.
In the first case, my present-day research focus revolves around archival work, a lot of which owes to the recently digitized sources such as the Media History Digital Library and the tool for its search and visualization Lantern. Media History Digital Library is an open access treasure trove for film historians which provides millions of pages of film magazines, trade press, books, etc. all of which are downloadable, searchable, and browsable online for free. In the attempt, for instance, to collect relevant data on topics of my research such as Hale’s Tours or lecturers, I often start off by searching for the key terms of interest. Although, generally speaking, the search is far more effective than any pedestrian browsing and or reading of the material, it becomes clear that precision and recall values leave some things to be desired. If one searches, for instance, for “Tours” false positives such as “Yours” regularly turn up among the results. But oftentimes “Tours” does not register as a true positive when it should because the OCR has misrecognized the word. This gave me a far-from-an-original the idea that perhaps the effectiveness of the search function could be improved.
This brings me to the second question – how much IT work could and/or should I put into this and how much should it be done in consultation with the experts. I clearly do not have access to the original pre-processed material but only to the final output after OCR has taken place. Given that it is highly unlikely that I could gain access to the pre-OCR data the most viable and simplest solution is to deal with the text version of the output and try to correct that version. There exist various post-processing error correction algorithms for improving OCR output. Some use google online spelling suggestions. Others use corpora to correct mistakes specific to particular historical English-language texts. (And this does not even scratch the surface). With that in mind, the best course of action would be to identify a procedure which could be expected to work best with the type of text I’m dealing with – early 20th century English-language material on film industry. I suspect this is the part for which I would need proper consultations with experts in the field. Once an existing method is identified, however, I suspect it should not be too complicated for me to implement a code to test the chosen procedure.
For an analysis of the search function on the database see Eric Hoyt, Kit Hughes, Derek Long, Kevin Ponto, and Anthony Tran, “Scaled Entity Search: A Method for Media Historiography and Response to Critiques of Big Humanities Data Research,” Proceedings of IEEE Big Humanities Data (2014).
Bassil, Youssef, and Mohammad Alwani. "OCR post-processing error correction algorithm using google online spelling suggestion." Journal of Emerging Trends in Computing and Information Sciences 3(1) (2012).
Alex, B., Grover, C., Klein, E., & Tobin, R. “Digitised historical text: Does it have to be mediOCRe?” In KONVENS (pp. 401-409) (2012).