Abstract
Harvest of national web spheres has now existed for at least a decade, where the internet has been changing rapidly both with respect to contents and use. This has challenged not only the techniques of the harvesting, but also where to look for relevant web material for a national webs sphere.
This presentation starts with an historical overview of changes in collection strategies at the Danish web archive and ends by description of the latest implementations made in Denmark to cover web materials outside the national top level domain.
During the last decade an increasing amount of national web pages have moved to generic Top Level Domains like .com or .org (Mjøs 2012). This challenge has grown as the use of foreign web hotels, blogs and social media like twitter and Facebook has exploded, and where hosts are geographically located outside the country’s boarders.
The challenge is far bigger than anticipated, as a study last year indicated that different methods found different web material. The study used one approach based on Internet Archives data and one approach based on out-links from a national web archive (Zierau 2015). The conclusion is that more methods should be used to find data and embed them in a web archive.
This presentation will include a description of an operational setup to meet this challenge (to be implemented in 2016). The setup is designed to deal with different (present and future) sources. The sources can be any URL set to be investigated. It can also be derived data (e.g. text extracts) to be investigated. Finally, it can be known national URLs that need preparation before ingestion to a web archive. The setup output will be seeds, domains and sub-domains that can be feed into national bulk and selective harvests for the national web sphere.
[book] Mjøs, O. J. (2012). Music, social media and global mobility: MySpace, Facebook, YouTube. Routledge Advances in Internationalizing Media Studies.
[presentation/abstract] Zierau, E. (2015). Identifying National Parts of the Internet Outside a Country’s Top Level Domain. Presented at IIPC GA 2015, Stanford, California, USA.
This presentation starts with an historical overview of changes in collection strategies at the Danish web archive and ends by description of the latest implementations made in Denmark to cover web materials outside the national top level domain.
During the last decade an increasing amount of national web pages have moved to generic Top Level Domains like .com or .org (Mjøs 2012). This challenge has grown as the use of foreign web hotels, blogs and social media like twitter and Facebook has exploded, and where hosts are geographically located outside the country’s boarders.
The challenge is far bigger than anticipated, as a study last year indicated that different methods found different web material. The study used one approach based on Internet Archives data and one approach based on out-links from a national web archive (Zierau 2015). The conclusion is that more methods should be used to find data and embed them in a web archive.
This presentation will include a description of an operational setup to meet this challenge (to be implemented in 2016). The setup is designed to deal with different (present and future) sources. The sources can be any URL set to be investigated. It can also be derived data (e.g. text extracts) to be investigated. Finally, it can be known national URLs that need preparation before ingestion to a web archive. The setup output will be seeds, domains and sub-domains that can be feed into national bulk and selective harvests for the national web sphere.
[book] Mjøs, O. J. (2012). Music, social media and global mobility: MySpace, Facebook, YouTube. Routledge Advances in Internationalizing Media Studies.
[presentation/abstract] Zierau, E. (2015). Identifying National Parts of the Internet Outside a Country’s Top Level Domain. Presented at IIPC GA 2015, Stanford, California, USA.
Bidragets oversatte titel | Se tilbage, se fremad: Nye strategier for dækning af den nationale web-sfære |
---|---|
Originalsprog | Engelsk |
Publikationsdato | 14 apr. 2016 |
Antal sider | 1 |
Status | Udgivet - 14 apr. 2016 |
Begivenhed | IIPC Web Archiving Conference 2016 - Radisson Blu Saga Hotel, Reykjavík, Island Varighed: 13 apr. 2016 → 15 apr. 2016 https://netpreserveblog.wordpress.com/2015/11/24/2016-iipc-general-assembly-web-archiving-conference/ |
Konference
Konference | IIPC Web Archiving Conference 2016 |
---|---|
Lokation | Radisson Blu Saga Hotel |
Land/Område | Island |
By | Reykjavík |
Periode | 13/04/2016 → 15/04/2016 |
Internetadresse |