Digging into Big Web Archive Data: The Development of the Danish Web 2006- 2015

Niels Brügger, Janne Nielsen, Ditte Laursen

Publikation: KonferencebidragKonferenceabstrakt til konference


ABSTRACT. In this paper we will examine how an entire national web domain has developed, with the Danish web as a case. In brief, we want to investigate the research question: What has the Danish web domain looked like in the past, and how has it developed in the period 2006-2015? Studying national web domains and their development at scale is a novel approach to web studies as well as to the writing of media and communication history (studies exist, e.g. Ben-David, 2016; Hale et al., 2014; Rogers et al., 2013). Therefore, it is necessary to introduce a number of methodological issues related to this new type of study, including reflections on: a) how a national web can be delimited, b) what characterizes the archived web as a historical source for academic studies, and c) the general characteristics of our data source, the archived web in the national Danish web archive Netarkivet. Once this is in place the paper will introduce in more detail the data sources of the case study and how data were processed to enable the study. Then a selection of the analytical results and insights are presented and discussed, and, finally, possible next steps are outlined. A number of the general methodological themes related to this type of study have been discussed in the literature (Brügger, 2017; Brügger & Laursen, 2018, 2019), and therefore this paper will have its main focus on how these general themes were translated into an analytical design, and on the resulting historical analysis, the first of its kind at this scale. The paper is based on an ongoing research project, of which the first phases have been very explorative and one of the aims was to become familiar with the source material, including developing the necessary methods to unlock the material and to make the first digs into large amounts of this new type of digital cultural heritage, the archived web (for a brief research history of studies of national web domains, see Brügger & Laursen, 2018: 415-416). This study of the historical development of the Danish web is based on the material in the Danish web archive Netarkivet, and we delimit ‘the Danish web’ to what was present on the country code Top-Level Domain (ccTLD) .dk as well as the material on other TLDs that Netarkivet has identified and collected as relevant for the Danish web. That Netarkivet is used is also the main reason why the investigated period starts in 2006, since the first relevant crawl in Netarkivet is from 2006. Working with this amount and complexity of data demands for an analytical design that is rigorously and thoroughly thought out to make the analysis manageable. We distinguish between three main phases: 1) Extracting, transforming and loading (ETL), 2) Selecting the corpus, 3) Translating research questions to code. Each of these three phases will briefly be presented. In the research project on which this paper is based a large number of metrics were generated to get a better understanding of the historical development of the Danish web. To provide an overview of what this type of results look like we have selected five sub-research questions to be investigated in the paper: 1) The size of the Danish web, 2) Web Danica outside the Danish ccTLD, 3) Maintenance of the Danish Web, 4) The Degree of Restricted Access, 5) Content types: Written text and images.
StatusUdgivet - 2019
