Automated Coding of Historical Danish Cause of Death Data Using String Similarity

Louise Villefrance Isted Ludvigsen, Mads Linnet Perner, Bjørn-Richard Pedersen, Rafarl Nozal Cañadas, Anders Sildnes, Nikita Shvetsov, Trygve Andersen, Lars Ailo Bongo, Hilde Leikny Sommerseth

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review


The study of causes of death has been central to some of the most influential studies of the modern mortality decline in the nineteenth and twentieth centuries. The digitization of individual-level cause of-death data has been game-changing, however, the data presents a major challenge: how do we code the thousands of unique strings for analysis in an efficient way? This paper aims to see how far we can get with automated coding based on string similarity. We do this by applying a Jaro Winkler string similarity algorithm in Python (pyjarowinkler) that codes our cause of death data from the Copenhagen Burial Register 1861-1911 to DK1875, a contemporary coding and classification system from nineteenth century Denmark. We then compare the performance of the algorithm to that of a manual (historian) coder in three different ways: at the level of each unique cause-of-death string, at the level of each cause-of-death group and for the overall cause-of-death pattern for all burials in Copenhagen 1861-1911. Our results show that a minimum-effort algorithm coded approximately half of the causes of death correctly compared to the manually coded dataset. This means that the method applied here is not accurate enough to use for actual data analysis of mortality patterns, as it is not possible to examine individual causes within larger causal groups. However, the results are promising for different uses of the method as a help for the manual coder. A way forward could be to use cut-off points of the Jaro-Winkler scores, coding only those causes where the string similarity match is relatively certain or use the automated method to catch most of the initial cases of a certain disease with a very set phrasing, such as cancer. In both cases, the remainder of the unique cause of death strings could then be coded by a manual coder.
TidsskriftDigital Humanities in the Nordic and Baltic Countries Publications
Udgave nummer1
Sider (fra-til)203-221
Antal sider18
StatusUdgivet - 10 okt. 2023