Automate web archive crawl accessioning pipeline #124

aaron-collier · 2018-12-10T16:14:55Z

From: https://jirasul.stanford.edu/jira/browse/DEVQUEUE-189

Once web archive crawl data is retrieved from Archive-It, a manifest must be manually created, then accessioning manually run. An inventory of what has been accessioned must be manually maintained, to ensure that the same content isn't duplicatively retrieved and accessioned.

This work would extend the WASAPI local import utility (DEVQUEUE-14) to automate the rest of the crawl data accessioning. This would reduce the amount of effort needed to preserve our web archives and make them available for access through SWAP.

Level of effort estimate from @ndushay and @jmartin-sul (who both originally worked on the wasapi-downloader) is 3-4 developers for 4-6 weeks. Once DEVQUEUE-209 is completed, the services will largely be in place and just need to be connected, but there is some question of whether all of the manual steps are at this point fully automatable without capturing additional information that isn't currently stored.

aaron-collier added the Maint P2 label Dec 10, 2018

ndushay removed the Maint P2 label May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate web archive crawl accessioning pipeline #124

Automate web archive crawl accessioning pipeline #124

aaron-collier commented Dec 10, 2018

Automate web archive crawl accessioning pipeline #124

Automate web archive crawl accessioning pipeline #124

Comments

aaron-collier commented Dec 10, 2018