Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long chains of revisit records still causing problems #73

Open
anjackson opened this issue Dec 21, 2021 · 4 comments
Open

Long chains of revisit records still causing problems #73

anjackson opened this issue Dec 21, 2021 · 4 comments
Assignees

Comments

@anjackson
Copy link
Contributor

anjackson commented Dec 21, 2021

Ideally, we would include revisit records at playback time, as they indicate when we visited a page even if the content did not change. As of PyWB 2.6.2, large chains of redirects still seem to cause problems, and it is not clear that the closest_limit is working as expected. See webrecorder/pywb#606

Not sure how to handle this. For now, skipping redirects from CDX queries.

The redirect_to_exact setting doesn't seem to be working now either.

@anjackson
Copy link
Contributor Author

Ah, so the example I was looking at had a chain of over 300 warc/revisit entries before hitting the most recent copy of the GOV.UK robots.txt file. This is over the hard-coded 100, so this is why it didn't resolve. But even upping that, it's really slow.

@ikreymer
Copy link
Contributor

Hm. This is caused specifically by revisit lookups or bouncing between 3xx redirects, or a combination of both?
Probably the main optimization is just to include the redirect URL in the CDXJ, especially in case of redirects.
If it is a chain of revisits that ends up just being a 200, then probably not much that can be done?
Perhaps something to discuss also in the context of reindexing?

@anjackson
Copy link
Contributor Author

It's the latter, I think. The only option would to make closest_limit configurable so I can set it to some high value easily.

But it's not urgent, and arguably not needed in the playback service.

@anjackson anjackson self-assigned this Jan 4, 2022
@anjackson
Copy link
Contributor Author

Current status is that I'm filtering out all revisit records at playback time. This is sub-optimal, as you can't see when pages were seen unchanged, but can't be resolved until this issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants