Skip to content

Latest commit

 

History

History
 
 

website

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

The Access Stacks

Introduction

This folder contains the components used for access to our web archives. It's made up of a number of separate stacks, with the first, 'Access Data', providing support for the others.

Integration Points

These services can be deployed in different contexts (dev/beta/prod/etc.) but in all cases are designed to run (read-only!) against:

  • The TrackDB, which knows where all the WARCs are an provides WebHDFS URLs for them.
  • The WebHDFS API, that serves the WARC records from each HDFS cluster.
  • The OutbackCDX API, that links URL searches to the WARCs that contain the records for that URL.
  • The Solr full-text search API(s), that indicates which URLs contain which search terms.
  • The Prometheus Push Gateway metrics API, that is used to monitor what's happening.

These are defined in the stack launch scripts, and can be changed as needed, based on deployment context if necessary.

The access service also depends on a number of data files, generated by the w3act_export workflow run under Apache Airflow. This includes:

  • The allows.aclj and block.aclj files needed by the pywb access control system. The allows.aclj file is generated from the data in W3ACT, based on the license status. The blocks.aclj file is managed in GitLab, and is downloaded from there.
  • The allows.txt and annotations.json files needed for full-text Solr indexing.
  • Populates the Collections Solr index used to generate the Topics & Themes pages of the UKWA website.

The web site part is designed to be run behind an edge server than handles the SSL/non-SSL transition and proxies the requests downstream. More details are provided in the relevant Deployment section.

The Website Stack

The access_website stack runs the services that actually provide the end-user website for https://www.webarchive.org.uk/ or https://beta.webarchive.org.uk/ or https://dev.webarchive.org.uk.

Deployment

The stack is deployed using:

cd website/
./deploy-access-website.sh dev

As with the data stack, this script must be setup for the variations across deployment contexts. For example, DEV version is password protected and it configured to pick this up from our internal repository.

NOTE that this website stack generates and caches images of archived web pages, and hence will require a reasonable amount of storage for this cache (see below for details).

NGINX Proxies

The website is designed to be run behind a boundary web proxy that handles SSL etc. To make use of this stack of services, the server that provides e.g. dev.webarchive.org.uk will need to be configured to point to the right API endpoint, which by convention is website.dapi.wa.bl.uk.

The set of current proxies and historical redirects associated with the website are now contained in the internal nginx.conf. This sets up a service on port 80 where all the site components can be accessed. Once running, the entire system should be exposed properly via the API gateway. For example, for accessing the dev system we want website.dapi.wa.bl.uk to point to dev-swarm-members:80.

Because most of the complexity of the NGINX setup is in the internal NGINX, the proxy setup at the edge is much simpler. e.g. for DEV, the external-facing NGINX configuration looks like:

    location / {
        # Used to tell downstream services what external host/port/etc. is:
        proxy_set_header        Host                    $host;
        proxy_set_header        X-Forwarded-Proto       $scheme;
        proxy_set_header        X-Forwarded-Host        $host;
        proxy_set_header        X-Forwarded-Port        $server_port;
        proxy_set_header        X-Forwarded-For         $remote_addr;
        # Used for rate-limiting Mementos lookups:
        proxy_set_header        X-Real-IP               $remote_addr;
        proxy_pass              http://website.dapi.wa.bl.uk/;
    }

(Internal users can see the dev_443.conf setup for details.)

The internal NGINX configuration is more complex, merging together the various back-end systems and passing on the configuration as appropriate. For example, the configuration for the public PyWB service includes:

            uwsgi_param UWSGI_SCHEME $http_x_forwarded_proto;
            uwsgi_param SCRIPT_NAME /wayback;

The service picks up the host name from the standard HTTP Host header, but here we add the scheme (http/https, passed from the upstream NGINX server via the X-Forwarded-Proto header) and fix the deployment path using the SCRIPT_NAME CGI variable.

Having set this chain up, if we visit e.g. dev.webarchive.org.uk the traffic should show up on the API server as well as the Docker container.

NOTE that changes to the internal NGINX configuration are only picked up when it starts, so necessary to run:

docker service update --force access_website_nginx

After which NGINX should restart and pick up any configuration changes and re-check whether it can connect to any proxied services inside the stack.

Because the chain of proxies is quite complicated, we also add a Via header at each layer, e.g.

    # Add header for tracing where issues occur:
    add_header Via $hostname always;

This adds a hostname for every successful proxy request, so the number of Via headers and their values can be used to trace problems with the proxies.

Components

Behind the NGINX, we have a set of modular components:

  • The ukwa-ui service that provides the main user interface.
  • The ukwa-pywb service that provides access to archive web pages
  • The mementos service that allows users to look up URLs via Memento.
  • The shine and shinedb services that provide our older prototype researcher interface.
  • The ukwa-access-api and related services (pywb-nobanner, webrender-api, Cantaloupe) that provide API services.
    • The API services include a caching image server (Cantaloupe) that takes rendered versions of archived websites and exposes them via the standard IIIF Image API. This will need substantial disk space (~1TB).
  • The Crawl Log Analyser: The analyse service connects to the Kafka crawl log of the frequent crawler, and aggregates statistics on recent crawling activity. This is summarised into a regularly-updated JSON file that the UKWA Access API part of the website stack makes available for users. This is used by the https://ukwa-vis.glitch.me/ live crawler glitch experiment.

Shine Database

Shine requires a PostgreSQL database, so additional setup is required using the scripts in ./scripts/postgres.

Stop the Shine service

When modifying the database, and having deployed the stack, you first need to stop Shine itself from running, as otherwise it will attempt to start up and will insert and empty database into PostgreSQL and this will interfere with the restore process. So, use

$ docker service scale access_website_shine=0

This will drop the Shine service but leave all the rest of the stack running.

Creating the Shine database

  • create-db.sh
  • create-user.sh
  • list-db.sh

Within scripts/postgres/, you can run create-db.sh to create the database itself. Then, run create-user.sh to run the setup_user.sql script and set up a suitable user with access to the database. Use list-db.sh to check the database is there at this pont.

Restoring the Shine database from a backup

  • Edit download-shine-db-dump.sh to use the most recent date version from HDFS
  • download-shine-db-dump.sh
  • restore-shine-db-from-dump.sh

To do a restore, you need to grab a database dump from HDFS. Currently, the backups are dated and are in the HDFS /2_backups/access/access_shinedb/ folder, so you'll need to edit the file to use the appropriate date, then run download-shine-db-dump.sh to actually get the database dump. Now, running restore-shine-db-from-dump.sh should populate the database.

Restart the Shine service

Once you have created and restored the database as needed, re-scale the service and Shine will restart using the restored database.

$ docker service scale access_website_shine=1

Creating a backup of the Shine database

An additional helper script will download a dated dump file of the live database and push it to HDFS, backup-shine-db-to-hdfs.sh.

 ./backup-shine-db-to-hdfs.sh dev

This should be run daily.

Cron Jobs

There should be a daily (early morning) backup of the Shine database.

Testing

The website regression tests can be used to ensure core functionality is in place. See the parent folder for details.

Monitoring

Having deployed all of the above, the cron jobs mentioned above should be in place.

The ukwa-monitor service should be used to check that these are running, and that the W3ACT database export file on HDFS is being updated.

...monitoring setup TBC...