Teaching machines to think like conservators - machine learning as a tool for predicting the stability of paper based archive and library collections

Ulla Bøgvad Kejser, Birgit Vinther Hansen, Morten Ryhl-Svendsen, Christian Boesgaard, Søren Mollerup

Publikation: Bidrag til bog/antologi/rapportKonferenceabstrakt i proceedingsForskningpeer review


Conservators and conservation scientists collect large amounts of data to support heritage preservation. Typically, the data relates to the cultural significance and nature of museum objects and their surrounding environment. In many disciplines, machine learning is increasingly used to enhance and automate data analysis. Machine learning is a subfield in artificial intelligence (AI). It brings together computer science and statistics to enable computers to do, what is natural to humans, namely to learn from experience. Our hypothesis is that this technology can also strengthen the analysis of conservation related data, thereby creating a better foundation for decision-making around preservation. To better understand challenges and opportunities of machine learning we designed a case study based on a dataset created in 2007 to assess the need for mass de-acidification of paper based collections in Danish national archives and libraries. The dataset consisted of 756 randomly selected samples. For each sample the hand folding number, acidity (pH) and color (CIE Lab b*) had been measured to evaluate the condition of the paper. In addition, each sample had an identification number, and creation year (Fig. 1). The purpose of the study was to predict the number of hand foldings, i.e. the brittleness of the paper, based on the features year, acidity and color, and to test the relative importance of these features. Based on the characteristics of the dataset, we grouped the number of hand foldings in three classes: 1-3 (very brittle paper), 4-12 (brittle), and more than 12 foldings (not brittle), and applied supervised classification. In this method, a machine-learning algorithm is trained to model the relationship between known input (year, pH, color) and output (hand foldings) data. The model’s ability to accurately classify output, is then tested on additional known input and output data (Fig. 2). More specifically, we compared the accuracy of the methods K-Nearest Neighbors (KNN) and Random Forest Classification (RFC). Since the dataset is relatively small, we applied ten-fold-cross validation, i.e. we used 90% of the dataset to train the model and the reserved 10% for testing, repeating the validation 10 times. We used a Jupyter Notebook in Google Colaboratory for data analysis (https://colab.research.google.com/). The results show that the accuracy of the models is 74% for KNN and 79% for RFC (Fig. 3-7). In itself, this first result is not good enough to be useful in practice, but it has demonstrated the potential of integrating machine learning and data science in the field of conservation. Our next goal is to train the model on another large dataset including the same features. An important future improvement would also be to strengthen domain specific knowledge about the input data, as, for example, the feature “year” not only reflects age in linear terms, but holds underlying information on how the paper was manufactured, which suggests how stable it was from the beginning. Likewise, we are interested in exploring the use of images of paper as an alternative to color measurement, and other types of image oriented machine learning algorithms.
TitelICOM-CC 19thTriennial Conference : Transcending Boundaries: Integrated Approaches to Conservation
StatusE-pub ahead of print - 2020
BegivenhedICOM-CC Triennial Conference: 19th Triennial Conference 2021 Beijing - Virtual, Beijing, Kina
Varighed: 17 maj 202121 maj 2021


KonferenceICOM-CC Triennial Conference