Identifying National Parts of the Internet Outside a Country’s Top Level Domain

    Publikation: KonferencebidragKonferenceabstrakt til konferenceForskningpeer review

    9 Downloads (Pure)


    How does a national webarchive identify relevant national webpages outside a country’s Top Level Domain? As an increasing amount of national webpages moves to generic Top Level Domains like .com or .org this question becomes increasingly crucial for national webarchives for specific countries. One suggested approach has been to base extraction of national material on a snapshot of the entire internet in form of a worldwide crawl. Another suggested approach is more silo oriented, based on harvests of web pages referred to by webpages within a National Top Level Domain.

    This paper describes the outcome of a Danish research project aiming at identifying automatic procedures for evaluating the two suggested approaches, and for identifying Danish web content on websites outside the national Top Level Domain “.dk”. The datasets used was links from a 30TB Danish 2012 bulk harvest and the 360 TB Internet Archive wide-0005 crawl, since these two harvests are comparable in time frame. At this refinement stage of the project there are indications that the two approaches are supplementary.
    Important lessons learned are that the process of comparison is much more complex than anticipated, both in establishing technical infrastructures and automatic detection, but also in the more theoretical definitions of what can be defined as a national webpage.

    The very basis for any harvesting approach is defining a collection scope by deciding what is seen as national webpage. Automation of such definitions is far more difficult than originally anticipated. The automation here is based on a wide range of factors, including language and character encoding as well as several others. An additional outcome of the project has been a generally applicable list of collection criteria. This list is a result of a cooperative effort divided between representatives within the fields of scholarship, the Danish webarchive, and computer science.
    Publikationsdato28 apr. 2015
    Antal sider1
    StatusUdgivet - 28 apr. 2015
    BegivenhedInternational Internet Preservation Consortium: General Assembly - Stanford University, Palo Alto, California, USA
    Varighed: 27 apr. 20151 maj 2015


    KonferenceInternational Internet Preservation Consortium
    LokationStanford University
    ByPalo Alto, California