Projekter pr. år
Abstract
An increasing amount of national webpages moves to generic Top Level Domains like .com or .org. The movement is so fast that we risk a too loose a lot of our cultural heritage, since we do not get in time to identify it in order to preserve it. Therefore this question becomes increasingly crucial for organizations covering digital national heritage including web archives for a specific country.
This poster presents the results from a research project that evaluated two different automated approaches to recognise the Internet outside a country’s Top Level Domain. One suggested approach has been to base extraction of national material on a snapshot of the entire internet in form of a worldwide crawl. Another suggested approach is more silo oriented, based on harvests of web pages referred to by webpages within a National Top Level Domain.
More specifically the research project aimed at identifying automatic procedures for evaluating the two suggested approaches, and for identifying Danish web content on websites outside the national Top Level Domain “.dk”. The datasets used was links from a 30TB Danish 2012 bulk harvest and the 360 TB Internet Archive wide-0005 crawl, since these two harvests are comparable in time frame.
The poster will present
• the two methods and the difference in their results
• Indications that the two approaches find very different material
• The general method used to evaluate the nationality of web material over time
The mentioned general method is here important, since the very basis for any harvesting approach is defining a collection scope by deciding what is seen as national webpage. Automation of such definitions is far more difficult than originally anticipated. The automation here is based on a wide range of general criteria that are implemented (e.g. language recognition, national terms like ‘je suis Charlie’ or phone number patterns). An additional outcome of the project has been a generally applicable list of collection criteria, which is based on a cooperative effort between representatives within the fields of scholarship, the Danish webarchive, and computer science.
It should be noted that the last mentioned method part has been presented at the RESAW 2015 conference, but in a closed forum, - and the first part with the results have been presented at the IIPC GA in a presentation, but not as a poster which opens a better possibility to discuss and understand in depth, as well as exchange ideas based on the results.
This poster presents the results from a research project that evaluated two different automated approaches to recognise the Internet outside a country’s Top Level Domain. One suggested approach has been to base extraction of national material on a snapshot of the entire internet in form of a worldwide crawl. Another suggested approach is more silo oriented, based on harvests of web pages referred to by webpages within a National Top Level Domain.
More specifically the research project aimed at identifying automatic procedures for evaluating the two suggested approaches, and for identifying Danish web content on websites outside the national Top Level Domain “.dk”. The datasets used was links from a 30TB Danish 2012 bulk harvest and the 360 TB Internet Archive wide-0005 crawl, since these two harvests are comparable in time frame.
The poster will present
• the two methods and the difference in their results
• Indications that the two approaches find very different material
• The general method used to evaluate the nationality of web material over time
The mentioned general method is here important, since the very basis for any harvesting approach is defining a collection scope by deciding what is seen as national webpage. Automation of such definitions is far more difficult than originally anticipated. The automation here is based on a wide range of general criteria that are implemented (e.g. language recognition, national terms like ‘je suis Charlie’ or phone number patterns). An additional outcome of the project has been a generally applicable list of collection criteria, which is based on a cooperative effort between representatives within the fields of scholarship, the Danish webarchive, and computer science.
It should be noted that the last mentioned method part has been presented at the RESAW 2015 conference, but in a closed forum, - and the first part with the results have been presented at the IIPC GA in a presentation, but not as a poster which opens a better possibility to discuss and understand in depth, as well as exchange ideas based on the results.
Originalsprog | Engelsk |
---|---|
Publikationsdato | 4 nov. 2015 |
Antal sider | 1 |
Status | Udgivet - 4 nov. 2015 |
Begivenhed | 12th International Conference on Preservation of Digital Objects - Chapel Hill, North Carolina, USA Varighed: 2 nov. 2015 → 6 nov. 2015 https://ipres-conference.org/ |
Konference
Konference | 12th International Conference on Preservation of Digital Objects |
---|---|
Land/Område | USA |
By | Chapel Hill, North Carolina |
Periode | 02/11/2015 → 06/11/2015 |
Internetadresse |
Projekter
- 1 Afsluttet
-
WebDanica: Indsamling af dansk onlinekulturarv uden for .dk-domænet
Zierau , E. (Projektleder, faglig)
01/01/2014 → 31/12/2014
Projekter: Projekt › Forskning