Identifying National Parts of the Internet Outside a Country’s Top Level Domain

    Research output: Contribution to conferenceConference abstract for conferenceResearchpeer-review

    Abstract

    How does a national webarchive identify relevant national webpages outside a country’s Top Level Domain? As an increasing amount of national webpages moves to generic Top Level Domains like .com or .org this question becomes increasingly crucial for national webarchives for specific countries. One suggested approach has been to base extraction of national material on a snapshot of the entire internet in form of a worldwide crawl. Another suggested approach is more silo oriented, based on harvests of web pages referred to by webpages within a National Top Level Domain.

    This paper describes the outcome of a Danish research project aiming at identifying automatic procedures for evaluating the two suggested approaches, and for identifying Danish web content on websites outside the national Top Level Domain “.dk”. The datasets used was links from a 30TB Danish 2012 bulk harvest and the 360 TB Internet Archive wide-0005 crawl, since these two harvests are comparable in time frame. At this refinement stage of the project there are indications that the two approaches are supplementary.
    Important lessons learned are that the process of comparison is much more complex than anticipated, both in establishing technical infrastructures and automatic detection, but also in the more theoretical definitions of what can be defined as a national webpage.

    The very basis for any harvesting approach is defining a collection scope by deciding what is seen as national webpage. Automation of such definitions is far more difficult than originally anticipated. The automation here is based on a wide range of factors, including language and character encoding as well as several others. An additional outcome of the project has been a generally applicable list of collection criteria. This list is a result of a cooperative effort divided between representatives within the fields of scholarship, the Danish webarchive, and computer science.
    Original languageEnglish
    Publication date28 Apr 2015
    Number of pages1
    Publication statusPublished - 28 Apr 2015
    EventInternational Internet Preservation Consortium: General Assembly - Stanford University, Palo Alto, California, United States
    Duration: 27 Apr 20151 May 2015
    http://netpreserve.org/general-assembly/2015/overview

    Conference

    ConferenceInternational Internet Preservation Consortium
    LocationStanford University
    CountryUnited States
    CityPalo Alto, California
    Period27/04/201501/05/2015
    Internet address

    Keywords

    • Cultural heritage
    • Online
    • Outside TLD

    Cite this