The single-concurrent lock system allows ukwa-pywb to 'lock' access to an individual url, such that only a single session can access the resource, in accordance with UK Legal Deposit restrictions. The restriction applies to 'top-level' pages only, and can be cleared manually. An admin interface is available to view all locked resources and sessions and to clear them as needed.
The system is implemented using a Redis server (or Redis mock-up) for cacheing, which provides key expiry support. The use of Redis ensures that session information is persisted across application restarts.
To enable the single-concurrent lock, set the single-use-lock: true
in the desired collections.
collection:
ukwa:
...
single-use-lock: true
The locks are per-url per-collection. (Embedded resources and AJAX requests are also excluded from the lock).
The REDIS_URL should be set to a url representing the redis server. A default setting for a local server on database 0 might be:
redis://localhost/0
.
If REDIS_URL is not set, an internal redis mock (using Python fakeredis
) will be created and a warning printed on startup.
It's possible to use the fakeredis mock, but the session data will not be persisted across restarts.
Configuring the Redis container with Docker Compose can be done as follows. Any recent version of Redis should work.
pywb:
...
depends_on:
- redis
environment:
- REDIS_URL=redis://redis:6379/0
redis:
image: redis:3.2.4
Redis automatically handles the expiration of session keys, clearing the locks. The expiry time is rounded up to the end of the next day by default.
For testing, it is possible to set a shorter expiry time by setting TEST_SESSION_LOCK_INTERVAL
environment variable.
For example, it is set to TEST_SESSION_LOCK_INTERVAL=30
to allow for testing expiry after 30 seconds.
This environment variable should not be set for production.
The locks can be expired more quickly via a client-side ping mechanism.
A client-side ping to /_locks/ping
is made to reset the expiry for the current page, based on the referrer.
If the LOCK_PING_EXTEND_TIME
env var is set, and the current session owns the referring page, the expiry is set
to LOCK_PING_EXTEND_TIME
seconds into the future.
The client-side ping is performed every LOCK_PING_INTERVAL
seconds (defaulting to 10)
The SELECT_WORD_LIMIT
env variable can be set to limit selection to a specified number of words
to restrict clipboard copying.
On modern browsers that support the Clipboard APIs, the restriction is made at the time of copying the text.
As a fallback and on other browsers, a restriction is applied when selecting the text. In this mode,
the selection is shrunk until it is no more than SELECT_WORD_LIMIT
words.
As a worse case, the clipboard copy operation fails altogether instead of allowing unrestricted copy.
Extra headers can be added to all responses via the main config.yaml
collection:
ukwa:
...
add_headers:
Cache-Control: 'max-age=0, no-cache, must-revalidate, proxy-revalidate, private'
Expires: 'Thu, 01 Jan 1970 00:00:00 GMT'
This option can apply to all collections, not only those that have single-use-lock: true
set.
The following config option allows for intercepting certain content types and issueing a redirect to a designated page.
collection:
ukwa:
...
content_type_redirects:
'text/rtf': 'https://example.com/viewer?{query}'
'application/pdf': 'https://example.com/viewer?{query}'
'application/': 'https://example.com/blocked?{query}'
'<any-download>': 'https://example.com/blocked?{query}'
With this config, any text/rtf
resource or application/pdf
resources encountered during replay are redirected to https://example.com/viewer?{query}
,
where the {query}
expands to the pywb url for downloading the raw/identity resource (with the id_
modifier).
(The raw urls are not redirected to allow a viewer to access the data).
For example, a request to https://myarchive.example.com/ukwa/2020010203mp_/http://example.com/myfile.rtf
might result in a redirect to http://example.com/viewer?url=http://myarchive.example.com/ukwa/2020010203id_/http://example.com/myfile.rtf
If an exact match is not found, the MIME prefix, such as application/
is also checked to allow for restricting a class of mime types.
To prevent arbitrary downloads, the <any-download>
match is made for any response that contains a Conteent-Disposition: attachment...
regardless of mime type.
For example, a text/plain
document with Content-Dispositon
would get redirected to https://example.com/blocked?...
The content_type_redirects
field is available to all collections, not only those that have single-use-lock: true
set.
The single-use lock does apply to the initial url, and will result in a 403 instead of a redirect if the lock can not be acquired. The lock will last for the full session expiry time (end of the day), as there is no client-side ping happening for redirects.
The admin page to view all locked urls and user sessions is available at http://<pywb-host>/_locks
.
If any sessions exist, it provides urls to clear all sessions, clear all urls in a session, clear individual url, or clear current
user session (logout).
The following routes are provided:
/_locks
-- Admin Page/_locks/reset
-- Clear All Sessions/_locks/clear/<id>
-- Clears session/_locks/clear_url/<url>
-- Clears locked url/_logout
-- Clears current session, logout current user
Access to the admin API paths (except /_logout
) can be restricted by defining the
LOCKS_AUTH
environment in the username:password form, eg. LOCKS_AUTH=admin:password
If LOCKS_AUTH
is properly set, access to any of the /_locks*
paths will return a 401 unless the username and password are provided as Basic Auth.
When accessing a locked resource, a 403 error code is returned. When accessing an admin page without proper credentials, a 401 error code is returned.
These are used to set customized error messages in the error.html template.
The template can be customized further as needed.