Making web collections for research Sustainable & Reusable: Possibilities and Challenges experienced

Eld Zierau , Per Møldrup-Dalum

Research output: Contribution to conferenceConference abstract for conferenceResearchpeer-review

Abstract

This presentation concerns the lessons learned from work of ensuring persistency of web corpora (web collections) for later reuse and result verification.

The experiences were gained when finalizing the project “Probing a Nation's Web Domain” (presented at IIPC-2016) which looked at changes in the Danish web archive (Netarkivet) over time.

Persistency of the web collections is essential for future use of the project results. The only way to reconstruct or create new statistical results on basis of the corpora would be to run the extraction tool developed in the project. However, the architecture around Netarkivet may change (requiring changes to the extraction tool) and data in Netarkivet may be enriched. In both cases, there will be uncertainties on whether a new extract will be the same as the original one, even in short three-year horizon.

Persistency of the web corpora was achieved by using the recommendations of the RESAW-2017 paper “Data Management of Web Archive Research Data”. This includes use of Persistent Web Identifiers (PWIDs) for each corpora element (for Netarkivet on form pwid:netarkivet.dk:<UTC archiving time>:part:<archived URL>). In general, Netarkivet recommends use of PWIDs for reference to the materials. Previously, the recommendation was to refer to file/offset, but this broke when Netarkivet data was migrated to compressed files. Furthermore, we foresee that traditional archive URLs will break soon, since changes are planned for the access tools.

At first, we thought that generation of the web corpora specifications would be easy, since the extraction tool had created metadata for each web element in the corpora, including archived URL and time. However, a number of challenges arose when we went into the details:

- The timestamp in the final metadata for analysis was the crawl time and not the WARC time (archive time) which is the one used in PWIDs and in the Solr-index for the access tools. A temporary result with the WARC time was recovered, although the time was stored using the CET time zone, and not the UTC time zone. Transformation between time zones is trivial.

- Another issue with the recovered data was that, as the Solr-index do not have sufficient information about de-duplication in order to make PWIDs for de-duplicated elements. Complicated merge and joins had been performed on the Solr data to solve the problem, but only for pre-2011 data. So, even though this was only a matter of programming and compute time, more than the available time was needed. This will be completed in 2020. This is also another example on why Netarkivet would benefit from shifting to the use of revisit records.
Based on the persistent web corpora, the exact same metadata for the corpora can be produced as long as the Netarkivet exists.

The Netarkivet will also work on extensions to the SolrWayback tool to make better support for web collections defined by PWIDs, by offering search and rendering of web parts limited by a specified web collection.
Translated title of the contributionSkabelse af forsknings web-samlinger der er holdbare og genbrugelige: Erfarede muligheder og udfordringer
Original languageEnglish
Publication date14 Jun 2021
Number of pages1
Publication statusPublished - 14 Jun 2021
EventIIPC Genrel Assembly & Web Archiving Conference 2021 - The National Library of Luxenborg/Virtuelt, Luxembourg, Luxembourg
Duration: 14 Jun 202116 Jun 2021
https://netpreserve.org/ga2021/

Conference

ConferenceIIPC Genrel Assembly & Web Archiving Conference 2021
LocationThe National Library of Luxenborg/Virtuelt
Country/TerritoryLuxembourg
CityLuxembourg
Period14/06/202116/06/2021
Internet address

Keywords

  • web collection
  • research data management;
  • persistent reference
  • PWID
  • Netarkivet
  • Web Sphere

Cite this