Projekter pr. år
Abstract
This paper describes a framework supporting definition of how to automatically identify national webpages outside a country’s top level domain. The framework aims at a definition that can be put into operation in order to make automatic detection of national web pages. At the same time the framework aims at a definition that can be reused independent of changed behaviours on the net, changes in jurisdiction and changes in technology. A crucial point in this framework is that the perspectives of collection, technology and Scholarship are present in decision making.
The framework origins from a study that aimed at evaluation of different two different strategies for automatic identification of national webpages outside a country’s top level domain; one strategy was based on data from Internet Archives wide_005 world wide webcrawl, and the other was based on a local web crawl based on bulk harvests from the Danish national web archive, Netarkivet. However in both cases a definition of national webpages was needed. Thus the creation of the framework was a prerequisite for the rest of this study.
Motivation of the study and framework is based on the fact that human communication activities are moving more and more onto the internet. This means that a lot of present and future research in the 20th century information flow depends on optimised collection and archiving of such information in web archives. Web archives often reside within national cultural heritage institutions, regularly having a collection scope outlined within some form of legal deposit legislation.
The challenge to define “national webpages” showed out to be is far from trivial, and in creation of the framework it quickly became obvious that such a definition requires that three important perspectives in order to make qualified decisions. In this paper this definition is based on input from three important fields represented by each of the authors, representing the perspectives of scholarship, the Danish web Archive, and computer science. This represents the perspectives of collection, technology and scholarship, which are all very different but also crucial perspectives when formulating definition of national webpages that is basis for actual collection and thus consequently form a web archive.
Besides the non-trivial need for formal, the study also found reason for arguing that it is necessary to repeatedly adjust web collection strategies within a web archive. The conditions for web collection is constantly changing. Even over a five year period we see: change in technology that can assist in collection, change in human behavior moving away from countries top levels domains and out on .com, .org etc., and changes in jurisdiction influencing the way that the web can be collected technology, thus regularly adjustments of what is national web pages may likely be needed. Therefore the presented framework consists of a list of general criteria as basis for adjustment of web collection strategies which can be made operational in a specific context taking into account the three perspectives.
The framework origins from a study that aimed at evaluation of different two different strategies for automatic identification of national webpages outside a country’s top level domain; one strategy was based on data from Internet Archives wide_005 world wide webcrawl, and the other was based on a local web crawl based on bulk harvests from the Danish national web archive, Netarkivet. However in both cases a definition of national webpages was needed. Thus the creation of the framework was a prerequisite for the rest of this study.
Motivation of the study and framework is based on the fact that human communication activities are moving more and more onto the internet. This means that a lot of present and future research in the 20th century information flow depends on optimised collection and archiving of such information in web archives. Web archives often reside within national cultural heritage institutions, regularly having a collection scope outlined within some form of legal deposit legislation.
The challenge to define “national webpages” showed out to be is far from trivial, and in creation of the framework it quickly became obvious that such a definition requires that three important perspectives in order to make qualified decisions. In this paper this definition is based on input from three important fields represented by each of the authors, representing the perspectives of scholarship, the Danish web Archive, and computer science. This represents the perspectives of collection, technology and scholarship, which are all very different but also crucial perspectives when formulating definition of national webpages that is basis for actual collection and thus consequently form a web archive.
Besides the non-trivial need for formal, the study also found reason for arguing that it is necessary to repeatedly adjust web collection strategies within a web archive. The conditions for web collection is constantly changing. Even over a five year period we see: change in technology that can assist in collection, change in human behavior moving away from countries top levels domains and out on .com, .org etc., and changes in jurisdiction influencing the way that the web can be collected technology, thus regularly adjustments of what is national web pages may likely be needed. Therefore the presented framework consists of a list of general criteria as basis for adjustment of web collection strategies which can be made operational in a specific context taking into account the three perspectives.
Originalsprog | Engelsk |
---|---|
Publikationsdato | 10 jun. 2015 |
Antal sider | 7 |
Status | Udgivet - 10 jun. 2015 |
Begivenhed | RESAW conference : Web Archives as Scholarly Sources: Issues, Practices and Perspectives - Århus Universitet, Århus, Danmark Varighed: 8 jun. 2015 → 10 jun. 2015 |
Konference
Konference | RESAW conference |
---|---|
Lokation | Århus Universitet |
Land/Område | Danmark |
By | Århus |
Periode | 08/06/2015 → 10/06/2015 |
Projekter
- 1 Afsluttet
-
WebDanica: Indsamling af dansk onlinekulturarv uden for .dk-domænet
Zierau , E. (Projektleder, faglig)
01/01/2014 → 31/12/2014
Projekter: Projekt › Forskning