Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate web archive crawl accessioning pipeline #124

Open
aaron-collier opened this issue Dec 10, 2018 · 0 comments
Open

Automate web archive crawl accessioning pipeline #124

aaron-collier opened this issue Dec 10, 2018 · 0 comments

Comments

@aaron-collier
Copy link
Contributor

From: https://jirasul.stanford.edu/jira/browse/DEVQUEUE-189

Once web archive crawl data is retrieved from Archive-It, a manifest must be manually created, then accessioning manually run. An inventory of what has been accessioned must be manually maintained, to ensure that the same content isn't duplicatively retrieved and accessioned.

This work would extend the WASAPI local import utility (DEVQUEUE-14) to automate the rest of the crawl data accessioning. This would reduce the amount of effort needed to preserve our web archives and make them available for access through SWAP.

Level of effort estimate from @ndushay and @jmartin-sul (who both originally worked on the wasapi-downloader) is 3-4 developers for 4-6 weeks. Once DEVQUEUE-209 is completed, the services will largely be in place and just need to be connected, but there is some question of whether all of the manual steps are at this point fully automatable without capturing additional information that isn't currently stored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants