Making web collections for research Sustainable & Reusable: Possibilities and Challenges experienced

Bidragets oversatte titel: Skabelse af forsknings web-samlinger der er holdbare og genbrugelige: Erfarede muligheder og udfordringer

Eld Zierau , Per Møldrup-Dalum

Publikation: KonferencebidragKonferenceabstrakt til konferenceForskningpeer review

Abstract

This presentation concerns the lessons learned from work of ensuring persistency of web corpora (web collections) for later reuse and result verification.

The experiences were gained when finalizing the project “Probing a Nation's Web Domain” (presented at IIPC-2016) which looked at changes in the Danish web archive (Netarkivet) over time.

Persistency of the web collections is essential for future use of the project results. The only way to reconstruct or create new statistical results on basis of the corpora would be to run the extraction tool developed in the project. However, the architecture around Netarkivet may change (requiring changes to the extraction tool) and data in Netarkivet may be enriched. In both cases, there will be uncertainties on whether a new extract will be the same as the original one, even in short three-year horizon.

Persistency of the web corpora was achieved by using the recommendations of the RESAW-2017 paper “Data Management of Web Archive Research Data”. This includes use of Persistent Web Identifiers (PWIDs) for each corpora element (for Netarkivet on form pwid:netarkivet.dk:<UTC archiving time>:part:<archived URL>). In general, Netarkivet recommends use of PWIDs for reference to the materials. Previously, the recommendation was to refer to file/offset, but this broke when Netarkivet data was migrated to compressed files. Furthermore, we foresee that traditional archive URLs will break soon, since changes are planned for the access tools.

At first, we thought that generation of the web corpora specifications would be easy, since the extraction tool had created metadata for each web element in the corpora, including archived URL and time. However, a number of challenges arose when we went into the details:

- The timestamp in the final metadata for analysis was the crawl time and not the WARC time (archive time) which is the one used in PWIDs and in the Solr-index for the access tools. A temporary result with the WARC time was recovered, although the time was stored using the CET time zone, and not the UTC time zone. Transformation between time zones is trivial.

- Another issue with the recovered data was that, as the Solr-index do not have sufficient information about de-duplication in order to make PWIDs for de-duplicated elements. Complicated merge and joins had been performed on the Solr data to solve the problem, but only for pre-2011 data. So, even though this was only a matter of programming and compute time, more than the available time was needed. This will be completed in 2020. This is also another example on why Netarkivet would benefit from shifting to the use of revisit records.
Based on the persistent web corpora, the exact same metadata for the corpora can be produced as long as the Netarkivet exists.

The Netarkivet will also work on extensions to the SolrWayback tool to make better support for web collections defined by PWIDs, by offering search and rendering of web parts limited by a specified web collection.
Bidragets oversatte titelSkabelse af forsknings web-samlinger der er holdbare og genbrugelige: Erfarede muligheder og udfordringer
OriginalsprogEngelsk
Publikationsdato14 jun. 2021
Antal sider1
StatusUdgivet - 14 jun. 2021
BegivenhedIIPC Genrel Assembly & Web Archiving Conference 2021 - The National Library of Luxenborg/Virtuelt, Luxembourg, Luxemborg
Varighed: 14 jun. 202116 jun. 2021
https://netpreserve.org/ga2021/

Konference

KonferenceIIPC Genrel Assembly & Web Archiving Conference 2021
LokationThe National Library of Luxenborg/Virtuelt
Land/OmrådeLuxemborg
ByLuxembourg
Periode14/06/202116/06/2021
Internetadresse

Emneord

  • websamling
  • data management
  • persistent reference
  • PWID
  • Netarkivet
  • Web Sphere
  • forskningsdata

Citationsformater