This folder contains the components used for access to our web archives. It's made up of a number of separate stacks, with the first, 'Access Data', providing support for the others.
These services can be deployed in different contexts (dev/beta/prod/etc.) but in all cases are designed to run (read-only!) against:
- The TrackDB, which knows where all the WARCs are an provides WebHDFS URLs for them.
- The WebHDFS API, that serves the WARC records from each HDFS cluster.
- The OutbackCDX API, that links URL searches to the WARCs that contain the records for that URL.
- The Solr full-text search API(s), that indicates which URLs contain which search terms.
- The Prometheus Push Gateway metrics API, that is used to monitor what's happening.
These are defined in the stack launch scripts, and can be changed as needed, based on deployment context if necessary.
The access service also depends on a number of data files, generated by the w3act_export
workflow run under Apache Airflow. This includes:
- The
allows.aclj
andblock.aclj
files needed by the pywb access control system. Theallows.aclj
file is generated from the data in W3ACT, based on the license status. Theblocks.aclj
file is managed in GitLab, and is downloaded from there. - The
allows.txt
andannotations.json
files needed for full-text Solr indexing. - Populates the Collections Solr index used to generate the Topics & Themes pages of the UKWA website.
The web site part is designed to be run behind an edge server than handles the SSL/non-SSL transition and proxies the requests downstream. More details are provided in the relevant Deployment section.
The access_website stack runs the services that actually provide the end-user website for https://www.webarchive.org.uk/ or https://beta.webarchive.org.uk/ or https://dev.webarchive.org.uk.
The stack is deployed using:
cd website/
./deploy-access-website.sh dev
As with the data stack, this script must be setup for the variations across deployment contexts. For example, DEV version is password protected and it configured to pick this up from our internal repository.
NOTE that this website stack generates and caches images of archived web pages, and hence will require a reasonable amount of storage for this cache (see below for details).
The website is designed to be run behind a boundary web proxy that handles SSL etc. To make use of this stack of services, the server that provides e.g. dev.webarchive.org.uk
will need to be configured to point to the right API endpoint, which by convention is website.dapi.wa.bl.uk
.
The set of current proxies and historical redirects associated with the website are now contained in the internal nginx.conf. This sets up a service on port 80 where all the site components can be accessed. Once running, the entire system should be exposed properly via the API gateway. For example, for accessing the dev system we want website.dapi.wa.bl.uk
to point to dev-swarm-members:80
.
Because most of the complexity of the NGINX setup is in the internal NGINX, the proxy setup at the edge is much simpler. e.g. for DEV, the external-facing NGINX configuration looks like:
location / {
# Used to tell downstream services what external host/port/etc. is:
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Forwarded-Port $server_port;
proxy_set_header X-Forwarded-For $remote_addr;
# Used for rate-limiting Mementos lookups:
proxy_set_header X-Real-IP $remote_addr;
proxy_pass http://website.dapi.wa.bl.uk/;
}
(Internal users can see the dev_443.conf
setup for details.)
The internal NGINX configuration is more complex, merging together the various back-end systems and passing on the configuration as appropriate. For example, the configuration for the public PyWB service includes:
uwsgi_param UWSGI_SCHEME $http_x_forwarded_proto;
uwsgi_param SCRIPT_NAME /wayback;
The service picks up the host name from the standard HTTP Host
header, but here we add the scheme (http/https, passed from the upstream NGINX server via the X-Forwarded-Proto
header) and fix the deployment path using the SCRIPT_NAME
CGI variable.
Having set this chain up, if we visit e.g. dev.webarchive.org.uk
the traffic should show up on the API server as well as the Docker container.
NOTE that changes to the internal NGINX configuration are only picked up when it starts, so necessary to run:
docker service update --force access_website_nginx
After which NGINX should restart and pick up any configuration changes and re-check whether it can connect to any proxied services inside the stack.
Because the chain of proxies is quite complicated, we also add a Via
header at each layer, e.g.
# Add header for tracing where issues occur:
add_header Via $hostname always;
This adds a hostname for every successful proxy request, so the number of Via
headers and their values can be used to trace problems with the proxies.
Behind the NGINX, we have a set of modular components:
- The ukwa-ui service that provides the main user interface.
- The ukwa-pywb service that provides access to archive web pages
- The mementos service that allows users to look up URLs via Memento.
- The shine and shinedb services that provide our older prototype researcher interface.
- The ukwa-access-api and related services (pywb-nobanner, webrender-api, Cantaloupe) that provide API services.
- The API services include a caching image server (Cantaloupe) that takes rendered versions of archived websites and exposes them via the standard IIIF Image API. This will need substantial disk space (~1TB).
- The Crawl Log Analyser: The
analyse
service connects to the Kafka crawl log of the frequent crawler, and aggregates statistics on recent crawling activity. This is summarised into a regularly-updated JSON file that the UKWA Access API part of the website stack makes available for users. This is used by the https://ukwa-vis.glitch.me/ live crawler glitch experiment.
Shine requires a PostgreSQL database, so additional setup is required using the scripts in ./scripts/postgres.
When modifying the database, and having deployed the stack, you first need to stop Shine itself from running, as otherwise it will attempt to start up and will insert and empty database into PostgreSQL and this will interfere with the restore process. So, use
$ docker service scale access_website_shine=0
This will drop the Shine service but leave all the rest of the stack running.
create-db.sh
create-user.sh
list-db.sh
Within scripts/postgres/
, you can run create-db.sh
to create the database itself. Then, run create-user.sh
to run the setup_user.sql
script and set up a suitable user with access to the database. Use list-db.sh
to check the database is there at this pont.
- Edit
download-shine-db-dump.sh
to use the most recent date version from HDFS download-shine-db-dump.sh
restore-shine-db-from-dump.sh
To do a restore, you need to grab a database dump from HDFS. Currently, the backups are dated and are in the HDFS /2_backups/access/access_shinedb/
folder, so you'll need to edit the file to use the appropriate date, then run download-shine-db-dump.sh
to actually get the database dump. Now, running restore-shine-db-from-dump.sh
should populate the database.
Once you have created and restored the database as needed, re-scale the service and Shine will restart using the restored database.
$ docker service scale access_website_shine=1
An additional helper script will download a dated dump file of the live database and push it to HDFS, backup-shine-db-to-hdfs.sh
.
./backup-shine-db-to-hdfs.sh dev
This should be run daily.
There should be a daily (early morning) backup of the Shine database.
The website regression tests can be used to ensure core functionality is in place. See the parent folder for details.
Having deployed all of the above, the cron jobs mentioned above should be in place.
The ukwa-monitor
service should be used to check that these are running, and that the W3ACT database export file on HDFS is being updated.
...monitoring setup TBC...