The Royal Library of Belgium (KBR) houses an enormous collection of historical popular press magazines (similar to today’s Cosmopolitan, Time or Dag Allemaal). However, getting an overview of the contents of the whole corpus would be almost impossible if done manually. Current knowledge about these sources is based on qualitative research with narrow scopes to make the research feasible to do by hand.
At the Ghent Centre for Digital Humanities we partnered with a team from KBR to face this challenge on a particular case: the ARTPRESSE corpus. This corpus is a digitized collection of Belgian Magazines created during the interwar period (1920s – 1930s). The team at KBR particularly wanted to find comics in this corpus, because the style of comics changed significantly during this era.
My colleague AI-Researcher Krishna Kumar Thirukokaranam Chandrasekar and I, created a workflow leveraging computer vision to identify and gather comics alongside illustrations such as cartoons and photographs from a sample of around 80.000 pages. In the future, we hope to extend this approach to extract all graphical elements from the entire 500,000-page corpus.
This workflow is a great example of the synergy between humans and computers, and an excellent example of how Digital Humanities can advance research. While researchers or volunteers would get bored and tired of such a monumental task, computers can process every image faster and more reliably. Yet because a computer does not inherently understand a comic or cartoon, researchers still need to break down what a comic or a cartoon is for a computer to understand.
A quality-hungry workflow
YOLO (You Only Look Once) version 11, a funnily named yet highly advanced object detector using deep learning methods to visually identify objects in an image. Once the model is trained, it is extremely fast, taking less than half a second to scan a whole page and find all illustrations.
We, the team at KBR and five enthusiastic interns with diverse academic backgrounds composed a dataset to train a YOLO model on. To ensure high-quality training data, all annotators labelled the same set of data, within the Easy Label and Box Annotator (ELABA) tool developed by our team. Our model’s performance heavily depends on the quality of the training data and humans often make mistakes. As the classic saying in machine learning goes: “garbage in, garbage out”.
We implemented a methodology to create an aggregated dataset of all different annotations. This process compared all assigned labels to a specific illustration and used a ‘majority voting’ approach to determine the correct answer.
Finding the comics
Once we refined the workflow, we deployed it on the Ghent High Performance Computing Infrastructure of the Vlaamse Supercomputer Centrum (VSC). This cutting-edge infrastructure allowed us to scale up our processing power significantly, enabling faster processing of the magazines. We were able to leverage this power and deliver a high-quality dataset tailored to the specific needs of the researchers.
The continuous collaboration between the GhentCDH team and the comics scholars during the development can be illustrated by the evolution of our workflow's output. Initially, the workflow extracted comics from the pages and saved them in a dataset. However, we realised that approach risked losing some information, such as the title of comics or information about the author/publisher. In response, the workflow was adapted to save the entire page and include the model's findings in a separate metadata file. Researchers could thus perform close reading directly from the enriched dataset without losing contextual information, achieving the project's ultimate goal of eliminating the need for browsing through the magazines manually
This research was made possible with the support of the KBR Digital research lab. We would like to thank Benoît Crucifix (KULeuven, KBR), Erwin Dejasse (ULB, KBR), Sébastien Hermans (KBR) and the dedicated interns at KBR for this fruitful collaboration.
KBR - ARTPRESSE, "Hebdo, 26 Februari 1932"