Projects per year
Abstract
How does a national webarchive identify relevant national webpages outside a country’s Top Level Domain? As an increasing amount of national webpages moves to generic Top Level Domains like .com or .org this question becomes increasingly crucial for national webarchives for specific countries. One suggested approach has been to base extraction of national material on a snapshot of the entire internet in form of a worldwide crawl. Another suggested approach is more silo oriented, based on harvests of web pages referred to by webpages within a National Top Level Domain.
This paper describes the outcome of a Danish research project aiming at identifying automatic procedures for evaluating the two suggested approaches, and for identifying Danish web content on websites outside the national Top Level Domain “.dk”. The datasets used was links from a 30TB Danish 2012 bulk harvest and the 360 TB Internet Archive wide-0005 crawl, since these two harvests are comparable in time frame. At this refinement stage of the project there are indications that the two approaches are supplementary.
Important lessons learned are that the process of comparison is much more complex than anticipated, both in establishing technical infrastructures and automatic detection, but also in the more theoretical definitions of what can be defined as a national webpage.
The very basis for any harvesting approach is defining a collection scope by deciding what is seen as national webpage. Automation of such definitions is far more difficult than originally anticipated. The automation here is based on a wide range of factors, including language and character encoding as well as several others. An additional outcome of the project has been a generally applicable list of collection criteria. This list is a result of a cooperative effort divided between representatives within the fields of scholarship, the Danish webarchive, and computer science.
This paper describes the outcome of a Danish research project aiming at identifying automatic procedures for evaluating the two suggested approaches, and for identifying Danish web content on websites outside the national Top Level Domain “.dk”. The datasets used was links from a 30TB Danish 2012 bulk harvest and the 360 TB Internet Archive wide-0005 crawl, since these two harvests are comparable in time frame. At this refinement stage of the project there are indications that the two approaches are supplementary.
Important lessons learned are that the process of comparison is much more complex than anticipated, both in establishing technical infrastructures and automatic detection, but also in the more theoretical definitions of what can be defined as a national webpage.
The very basis for any harvesting approach is defining a collection scope by deciding what is seen as national webpage. Automation of such definitions is far more difficult than originally anticipated. The automation here is based on a wide range of factors, including language and character encoding as well as several others. An additional outcome of the project has been a generally applicable list of collection criteria. This list is a result of a cooperative effort divided between representatives within the fields of scholarship, the Danish webarchive, and computer science.
Original language | English |
---|---|
Publication date | 28 Apr 2015 |
Number of pages | 1 |
Publication status | Published - 28 Apr 2015 |
Event | International Internet Preservation Consortium: General Assembly - Stanford University, Palo Alto, California, United States Duration: 27 Apr 2015 → 1 May 2015 http://netpreserve.org/general-assembly/2015/overview |
Conference
Conference | International Internet Preservation Consortium |
---|---|
Location | Stanford University |
Country/Territory | United States |
City | Palo Alto, California |
Period | 27/04/2015 → 01/05/2015 |
Internet address |
Keywords
- Cultural heritage
- Online
- Outside TLD
Projects
- 1 Finished
-
WebDanica: Indsamling af dansk onlinekulturarv uden for .dk-domænet
Zierau , E. (Project manager, academic)
01/01/2014 → 31/12/2014
Project: Research