Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement memento api #408

Open
wants to merge 86 commits into
base: master
Choose a base branch
from
Open

Implement memento api #408

wants to merge 86 commits into from

Conversation

VictorHarbo
Copy link
Collaborator

Implements the memento framework at the endpoint services/memento.
Two features are central for memento:
DatetimeNegotiation and Timemaps.

DatetimeNegotiation can be implemented in multiple ways. I've implemented the 2.2 pattern, which is recommended for webarchives. - https://www.rfc-editor.org/rfc/rfc7089.html#page-24

Pattern 2.1 can also be chosen through the property: memento.redirect

Timemaps can be delivered in two different formats link-type and json as specified in the memento specification: http://mementoweb.org/guide/rfc/#Pattern6

Closes #42

Thomas Egense added 8 commits December 26, 2023 10:15
Code compiles, but 3 unittest still fails because the expect "localhost
as return from the embedded solr, but it is hostname instead.
happens to be a solrwaybackweb.properties in home.

added line to initialize webproperties.
Minor unittest improvement+refactoring
does not try to access warc-files for payload by setting playback
allowed.
that expected localhost:8080 but on my maven build machine it
was <hostname>:8080 instead.
Copy link
Contributor

@thomasegense thomasegense left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard part of this task it to figure out that the API really is.
Good job with all the unittest.

The "timegate" (direct playback linking) part of the API is working and require no further work.
But only for the 2.1 spec with redirect 302. I have removed the payload, since
payload is never read from a 302. So this is a simple redirect to the playback-page.
There is no need to also support thre 2.2 spec. The current solution by returning the
playback payload directly does not even, if though it is the correct payload.
The reason is the playback API must run under solrwayback/sercvicesweb/ and not solrwayback/services/memento. All links are handling in root-servlet, serviceworker,
referrer fixing etc. require this specific url.
If we want to support 2.2 (which there is no need for), the only solution is to do as PyWb
bu returning a mimimum html page with a single frame that points to the playback url (as was constructed in 2.1). This will also remove the mixing of header-fields from memento and important playback header fields. (But no need to implement!)

But for the timemap API I think a few fixes of the output is required.

Timemap, link
Compared the two responses from PyWB and SW:

https://solrwb-test.kb.dk:4000/solrwayback/services/memento/timemap/link/http://prak10k.dk/?page_id=13
https://pywb-test.kb.dk/myindex/timemap/link/http://prak10k.dk/?page_id=13
Besides from order of the lines there is a difference:
rel="memento" vs rel="first memento"
Also collection name is missing.
But why so few results in SolrWayback when I compared?
https://solrwb-test.kb.dk:4000/solrwayback/services/memento/timemap/link/http://news.dk/
https://pywb-test.kb.dk/myindex/timemap/link/http://news.dk/

Timemap, json
These two responses are very different
https://solrwb-test.kb.dk:4000/solrwayback/services/memento/timemap/json/http://news.dk/
https://pywb-test.kb.dk/myindex/timemap/json/http://news.dk/

@VictorHarbo
Copy link
Collaborator Author

VictorHarbo commented Feb 1, 2024

I am looking at this this evening. I don't have access to the test servers at KB. Maybe you can help me gaining access tomorrow.

For once I found a good description of the API: here
Addressing the timemap link comments:

  • When examining the description of a timemap I see that the "coorect" way is to actually include the "first memento" and "last memento" in the timemap. Why it isn't so in PyWb I don't know. Should we do it the Pywb-way or follow the spec (See Figure 28)

I will look at the timemap json and link collections when I get access to the test server as this makes comparing way easier.

@VictorHarbo
Copy link
Collaborator Author

Looking into this today. the timemap link implemented in solr wayback uses paging of results, while pywb doesn't, thats why the results are looking different there. I'll change the paging amount from 2, which it seems to be as of now. Thinking of making it 20 or something like that.

I'll do a deeper dive into the json format as these are completely different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support memento API
2 participants