diff --git a/2023-11-UKWA-Tech-Arch-Overview.png b/2023-11-UKWA-Tech-Arch-Overview.png new file mode 100644 index 0000000..47ae1c9 Binary files /dev/null and b/2023-11-UKWA-Tech-Arch-Overview.png differ diff --git a/CHANGELOG.md b/CHANGELOG.md deleted file mode 100644 index 4fe22d2..0000000 --- a/CHANGELOG.md +++ /dev/null @@ -1,23 +0,0 @@ -# Changelog - -All notable changes to this project will be documented in this file. - -The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), -and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). - -Headings are [Added, Changed, Deprecated, Removed, Fixed, Security](https://keepachangelog.com/en/1.0.0/#how) - -## [Unreleased] - -### Added - -### Changed - -* Re-arranged content into service areas (ingest/access/etc.) Structure is now `area/stack` for each deployable service stack. -* Updated the README to refer to this CHANGELOG. -* Setting up the website service stack to deploy the beta version as-is. -* Added initial cross-deployment automated test suite based on Robot Framework. -* Added example proxy config to ensure we can build and use images at work. - -### Removed - diff --git a/README.md b/README.md index 093de43..806ad3e 100755 --- a/README.md +++ b/README.md @@ -1,32 +1,147 @@ # ukwa-services +Deployment configuration for almost all UKWA services. + +## Contents + - [Introduction](#introduction) -- [Structure](#structure) -- [Deployment Process](#deployment-process) +- [Service Stacks](#service-stacks) +- [High-Level Technical Architecture](#high-level-technical-architecture) + - [Overview](#overview) + - [Areas](#areas) + - [Manage](#manage) + - [Ingest](#ingest) + - [Storage](#storage) + - [Process](#process) + - [Access](#access) + - [Monitoring](#monitoring) + - [Interfaces](#interfaces) + - [Networks](#networks) +- [Software](#software) + - [Deployment Process](#deployment-process) -## Introduction -Deployment configuration for all UKWA services stacks. +## Introduction These [Docker Stack](https://docs.docker.com/engine/reference/commandline/stack/) configurations and related scripts are used to launch and manage our main services. No internal or sensitive data is kept here -- that is stored in internal `ukwa-services-env` repository as environment variable scripts required for deployment, or as part of the CI/CD system. -See the [change log](./CHANGELOG.md) for information on how this setup has changed over time. +Note that some services are not deployed via containers, e.g. the Hadoop clusters, and the Solr and OutbackCDX indexes. Those are documented elsewhere, but the interaction with those other services will be made clear. -## Structure +## Service Stacks Service stacks are grouped by broad service area, e.g. [`access`](./access) contains the stacks that provides the access services, and the [access README](./access/README.md) provides detailed documentation on how the access services are deployed. The service areas are: - [`ingest`](./ingest) covers all services relating to the curation and ingest of web archives -- [`access`](./access) covers all services relating to how we make the web archives accessible to the public +- [`access`](./access) covers all services relating to how we make the web archives accessible to our audiences - [`manage`](./manage) covers all internal services relating to the management of the web archive, including automation and workflows that orchestrate activities from ingest to storage and then to access -Within each sub-folder, e.g. `access/website`, we should have a single `docker-compose.yml` file which should be used for all deployment contexts (e.g. `dev`,`beta` and `prod`). Any necessary variations should be defined via environment variables. +_For a high-level overview of how these service stacks interact, see the [section on technical architecture below](#high-level-technical-architecture)._ + +Within each sub-folder, e.g. `access/website`, we have a `docker-compose.yml` file which should be used for all deployment contexts (e.g. `dev`,`beta` and `prod`). Any necessary variations should be defined via environment variables. These variables, any other context-specific configuration, should be held in subdirectories. For example, if `access/website/docker-compose.yml` is the main stack definition file, any addtional services needed only on `dev` might be declared in `access/website/dev/docker-compose.yml` and would be deployed separately. -## Deployment Process +The process for updating and deploying components is described in [the deployment section below](#deployment-process). + +## High-Level Technical Architecture + +This is a high-level introduction to the technical components that make up our web archiving services. The primary purpose of this documentation to try and ensure the whole team have an overview of the whole system, and can work out which components are involved when something goes wrong. + +Some wider contextual information can be found at: + +* [http://data.webarchive.org.uk/ukwa-documentation/how-ukwa-works/\_index.html](http://data.webarchive.org.uk/ukwa-documentation/how-ukwa-works/_index.html) (source [https://github.com/ukwa/ukwa-documentation/tree/master/content/how-ukwa-works](https://github.com/ukwa/ukwa-documentation/tree/master/content/how-ukwa-works)) +* ...TBA... + +Note that the images on this page can be found at: + +* [This Google Slides presentation.](https://docs.google.com/presentation/d/1MnJfldL7MvJYJ28genZqjmDoHhOlo8dRNqmuZGqa5fc/edit?usp=sharing) +* ...TBA... + +### Overview + +![High-level technical overview of the UKWA systems](2023-11-UKWA-Tech-Arch-Overview.png) + +The life-cycle of our web archives can be broken down into five main stages, along with management and monitoring processes covering the whole process. Each stage is defined by it's interfaces, with the data standards and protocols that define what goes in to and out of that stage ([see below for more details](#interfaces)). This allows each stage to evolve independently, as long as it's 'contract' with the other stages is maintained. + +There are multiple ingest streams, covering different aspects of a single overall workflow, starting with the curation tools that we use to drive the web crawlers. Those harvesting processes pull resources off the web and store them in archival form, to be transferred on HDFS. From there, we can ingest the content into other long-term stores, and can then be used to provide access to individual resources both internally and externally, for all the Legal Deposit libraries. As the system complexities and service levels vary significantly across the different access channels, we identify them as distinct services, while only have one (unified) harvesting service. + +In order to be able to find items of interest among the billions of resources we hold, we run a range of data-mining processes on our collections that generate appropriate metadata, which is then combine with manually-generated annotations (supplied by our curators) and used to build our catalogue records and indexes. These records drive the discovery process, allowing users to find content which can then be displayed over the open web or via the reading room access service (as appropriate). + +### Areas + +#### Manage + +The critical management component is Apache Airflow, which orchestrates almost all web archive activity. For staff, it is accessible at [http://airflow.api.wa.bl.uk](http://airflow.api.wa.bl.uk). Each workflow (or DAG in Airflow terminology) is accessible via the management interface, and the description supplied with each one provides documentation on what the task does. Where possible, each individual task in a workflow involves running a command-line application wrapped in versioned Docker container. Developing our tools as individual command-line applications is intended to make them easier to maintain. The Airflow deployment and workflows are defined in the `./manage` folder, in [./manage/airflow](./manage/airflow) + +Another important component is `TrackDB`, which contains a list of all the files on our storage systems, and it used by Airflow tasks to keep track of what's been indexed, etc. + +See [`manage`](./manage/) for more details. + +#### Ingest + +Covers curation services and crawl services, everything leading to WARCs and logs to store, and metadata for access. + +See [`ingest`](./ingest/) for more details. + +#### Storage + +Storage systems are not deployed as containers, so there are no details here. We currently have multiple Hadoop clusters, and many of the tasks and components here rely on interacting with those clusters through their normal APIs. + +#### Process + +There are various Airflow tasks that process the data from W3ACT or from the Hadoop storage. We use the Python MrJob library to run tasks, which are defined in the `ukwa/ukwa-manage` repository. That is quite a complex system, as it supports Hadoop 0.20.x and Hadoop 3.x, and supports tasks written in Java and Python. See [`ukwa/ukwa-manage`](https://github.com/ukwa/ukwa-manage) for more information. + +#### Access + +Our two main access services are: + +* The _UK Web Archive_ open access service, online at https://www.webarchive.org.uk/ +* The _Legal Deposit Access Service_, only available in Legal Depost Library reading rooms. + +See [`access`](./access/) for more details. + +#### Monitoring + +Runs independently of all other systems, on separate dedicated hardware. Uses the Prometheus stack with alerts defined for major critical processes. See [https://github.com/ukwa/ukwa-monitor](https://github.com/ukwa/ukwa-monitor) for detail. + + +### Interfaces + +| Interface | Protocol | Further Details | +| --------- | -------- | --------------- | +| Curate > Crawl | Crawl feeds (seeds, frequencies, etc.), NEVER-CRAWL list. | Generated from W3ACT, see the [w3act\_export workflow](http://airflow.api.wa.bl.uk/dags/w3act_export/grid). | +| Crawl > Storage | WARC/WACZ files and logs. | These are stored locally then moved to HDFS using Cron jobs (FC) and Airflow (DC, see [copy\_to\_hdfs\_crawler08](http://airflow.api.wa.bl.uk/dags/copy_to_hdfs_crawler08/grid)). | See the [HDFS layout](HDFS-file-system-layout-and-content_154765461.html) page which describes how we expect content to be layed out so it's provenance and nature are clear. | +| Storage > Process | WARC/WACZ files and logs, Metadata from W3ACT exports. | This covers indexing tasks like CDX generation, full-text indexing etc. | +| Process > Access | WARCs/WACZ on HDFS via HTTP API + TrackDB. OutbackCDX API. Solr Full-text and Collections APIs. Data exported by w3act\_export (allows.aclj, blocks.aclj) | As the collection is large, access is powered by APIs rather than file-level standards.| + + +### Networks + +The systems configured or maintained by the web archiving technical team are located on the following networks. + +| Network Name | IP Range | Description | +| ------------ | -------- | ----------- | +| WA Public Network| 194.66.232.82 to .94 |All public services and crawlers. Note that the crawlers require unrestricted access to the open web, and so outgoing connections on any port are allowed from this network without going through the web proxy. However, very few incoming connections are allowed, each corresponding to a curatorial or access service component. These restrictions are implemented by the corporate firewall. +| WA Internal Network | - | Internal service component network. Service components are held here to keep them off the public network, but provide various back-end services for our public network and for systems held on other internal networks. This means the components that act as integration points with other service teams are held here. +| WA Private Network | - | The private network's primary role is to isolate the Hadoop cluster and HDFS from the rest of the networks, providing dedicated network capacity for cluster processes without affecting the wider network. +| DLS Access Network | - | The BSP, LDN, NLW and NLW Access VLANs. Although we are not responsible for these network areas, we also deploy service components onto specific machines within the DLS access infrastructure, as part of the _Legal Deposit Access Service_. | + + +## Software + +Almost our entire stack is open source, and the most critical components are co-maintained with other IIPC members. Specifically, the Heritrix crawler and the PyWB playback components (along with the standards and practices that they depend upon, like [WARC](http://iipc.github.io/warc-specifications/)) are crucial to the work of all the IIPC members, and to maintaining access to this content over the long term. + +Current upgrade work in progress: + +* Reading Room access currently depends on OpenWayback but should be replaced with a modernized PyWB service through the [TP0012 Legal Deposit Access Solution](https://wiki.bl.uk:8443/display/WAG/TP0012+Legal+Deposit+Access+Solution) project. +* Adoption of Browsertrix Cloud for one-off crawls, with the intent to move all Frequent Crawls into it eventually. +* A new approach is needed to manage monitoring and replication of content across H020, H3 BSP and H3 NLS. +* Full-scale fulltext indexing remains a challenge and new workflows are needed. +* All servers and containers need forward migration to e.g. to the latest version of RedHat, dependent libraries etc. As we have a fairly large estate, this is an ongoing task. Generally, this can be done without major downtime, e.g. using Hadoop means it's relatively straightforward to take a storage node out and upgrade its operating system without interrupting the service. + +### Deployment Process First, individual components should be developed and tested on developers' own machines/VMs, using the [Docker Compose](https://docs.docker.com/compose/compose-file/) files within each tool's repository. e.g. [w3act](https://github.com/ukwa/w3act/blob/master/docker-compose.yml). @@ -38,10 +153,13 @@ Once we're happy with the set of Swarm stacks, we can tag the whole configuratio Whoever is performing the roll-out will then review the tagged `ukwa-services` configuration: -- check they understand what has been changed, which should be indicated in the [change log](./CHANGELOG.md) +- check they understand what has been changed, which should be indicated in the relevant pull request(s) or commit(s) - review setup, especially the prod/beta/dev-specific configurations, and check they are up to date and sensible - check no sensitive data or secrets have crept into this repository (rather than `ukwa-services-env`) - check all containers specify a tagged version to deploy - check the right API endpoints are in us - run any tests supplied for the component + + + diff --git a/ingest/README.md b/ingest/README.md index adea22b..816cfbe 100644 --- a/ingest/README.md +++ b/ingest/README.md @@ -1,6 +1,28 @@ The Ingest Stacks ================= +- [Introduction](#introduction) +- [Workflows](#workflows) + - [How the Frequent Crawler works](#how-the-frequent-crawler-works) + - [How the Document Harvester works](#how-the-document-harvester-works) + - [Known Failure Modes](#known-failure-modes) + - [Debugging Approach](#debugging-approach) +- [Operations](#operations) + - [Crawler Service Operations](#crawler-service-operations) + - [Launching the Services](#launching-the-services) + - [Waiting for Kafka](#waiting-for-kafka) + - [Shutdown](#shutdown) + - [Crawl Operations](#crawl-operations) + - [Starting Crawls](#starting-crawls) + - [Stopping Crawls](#stopping-crawls) + - [Pause the crawl job(s)](#pause-the-crawl-jobs) + - [Checkpoint the job(s)](#checkpoint-the-jobs) + - [Shutdown](#shutdown-1) + + +Introduction +------------ + This section covers the service stacks that are used for curation and for crawling. - [`w3act`](./w3act/) - where curators define what should be crawled, and describe what has been crawled. @@ -8,3 +30,164 @@ This section covers the service stacks that are used for curation and for crawli - [`dc`](./dc/) - the Domain Crawler, which us used to crawl all UK sites, once per year. The [`crawl_log_db`](./crawl_log_db/) service is not in use, but contains a useful example of how a Solr service and it's associated schema can be set up using the Solr API rather than maintaining XML configuration files. + +Workflows +--------- + +The Ingest services work together in quite complicated ways, so this section attempts to describe some of the core workflows. This should help determine what's happened if anything goes wrong. + +### How the Frequent Crawler works + + + +### How the Document Harvester works + +1. Curators mark Targets as being Watched in W3ACT. +2. The [`w3act_export` workflow](http://airflow.api.wa.bl.uk/dags/w3act_export/grid) running on Airflow exports the data from W3ACT into files that contain this information. +3. The usual move-to-hdfs scripts move WARCs and logs onto the Hadoop store. +4. The TrackDB file tracking database gets updated so recent WARCs and crawl logs are known to the system. (See the `update_trackdb_*` tasks on [http://airflow.api.wa.bl.uk](http://airflow.api.wa.bl.uk/home)/). +5. The usual web archiving workflow indexes WARCs into the CDX service so items become available. +6. The Document Harvester [`ddhapt_log_analyse` workflow](http://airflow.api.wa.bl.uk/dags/ddhapt_log_analyse/grid) runs Hadoop jobs that take the W3ACT export data and use it to find potential documents in the crawl log. + 1. This currently means PDF files on Watched Targets. + 2. For each, a record is pushed to a dedicate PostgreSQL Document Database (a part of the W3ACT stack), with a status of _NEW_. +7. The Document Harvester [ddhapt\_process\_docs workflow](http://airflow.api.wa.bl.uk/dags/ddhapt_process_docs/grid) gets the most recent _NEW_ documents from the Document Database and attempts to enrich the metadata and post them to W3ACT. + 1. Currently, the metadata enrichment process talks to the live web rather than the web archive. + 2. In general, PDFs are associated with the website they are found from (the landing page), linked to the Target. + 3. For GOV.UK, we rely on the PDFs having a rel=up HTTP header that unambigiously links a PDF to it's landing page. + 4. The enriched metadata is then used to push a request to W3ACT. This metadata includes an access URL that points to the UKWA website on the public web ([see here for details](https://github.com/ukwa/ukwa-services/blob/aa95df6854382e6b6e84edc697dcb4da2804ef9c/access/website/config/nginx.conf#L154-L155)). + 5. W3ACT checks the file in question can be accessed via Wayback and calculates the checksum of the payload, or throws an error if it's not ready yet. + 6. If the submission works, the record is updated in the Document Database so it's no longer _NEW_. + 7. If it fails, it will be re-run in the future, so once it's available in Wayback it should turn up in W3ACT. +8. Curators review the Documents found for the Targets they own, and update the metadata as needed. +9. Curators then submit the Documents, which creats a XML SIP file that is passed to a DLS ingest process. +10. The DLS ingest process passes the metadata to MER and to Aleph. +11. The MER version is not used further. +12. The Aleph version then becomes the master metadata record, and is passed to Primo and LDLs via the Metadata Aggregator. +13. Links in e.g. Primo point to the access URLs included with the records, meaning users can find and access the documents. + +#### Known Failure Modes + +The Document Harvester has been fairly reliable in recent years, but some known failure modes may help resolve issues. + +* Under certain circumstances, Heritrix has been known to stop rotating crawl logs properly. If this happens, crawl log files may stop appearing or get lost. Fixing this may require creating an empty crawl.log file in the right place so a checkpoint can rotate the files correctly, or in the worst cases, a full crawler restart. If this happens, crawl logs will stop arriving on HDFS. +* If there is a problem with the file tracking database getting updated to slowly, then the Document Harvester Airflow workflows may run but see nothing to process. This can be determined by checking the logs via Airflow, and checking that the expected number of crawl log files for that day were found. Clearing the job so Airflow re-runs it will resolve any gaps. +* If there is a problem with W3ACT (either directly, or with how it talks to the curators Wayback instance), then jobs may fail to upload processed Documents to W3ACT. This can be spotted by checking the logs via Airflow, but note that any Documents that have not yet been CDX indexed are expected to be logged as errors at this point, so it can be difficult to tell things apart. It may be necessary to inspect the W3ACT container logs to determine if there's a problem with W3ACT itself. + +#### Debugging Approach + +Problems will generally be raised by Jennie Grimshaw, who is usually able and happy to supply some example Document URLs that should have been spotted. This is very useful in that it provides some test URLs to run checks with, e.g. + +* Check the URLs actually work and use `curl -v` to see if the `Link: rel=up` header is present (for GOV.UK) which helps find the landing page URL. +* Check the crawl-time CDX index (currently at [http://crawler06.bl.uk:8081/fc](http://crawler06.bl.uk:8081/fc)) to check if the URLs have been crawler at all. +* Check the access time CDX index (currently at [http://cdx.api.wa.bl.uk/data-heritrix](http://cdx.api.wa.bl.uk/data-heritrix)) to check if the items have been indexed correctly. +* Check the Curator Wayback service ([https://www.webarchive.org.uk/act/wayback/archive/](https://www.webarchive.org.uk/act/wayback/archive/)) to see if the URLs are accessible. +* Query the PostgreSQL Document Database to see if the URL was found by the crawl log processor and what the status of it is. + +Overall, the strategy is to work out where the problem has occurred in the chain of events outlined in the first section, and then modify and/or re-run the workflows as needed. + + +Operations +---------- + +This section covers some common operations when interacting with the Ingest services. In particular, the operations for the Frequent Crawler and the Domain Crawler are very similar, so these are documented here. + +### Crawler Service Operations + + +TBA move-to-S3? + +#### Launching the Services + + + docker system prune -f + +#### Waiting for Kafka + + docker service logs --tail 100 -f fc_kafka_kafka + +Depending on the + + ...Loading producer state from snapshot files... + +Check in UI too. Restart if not showing up. + + docker service update --force fc_kafka_ui + + +Check surts and exclusions. +Check GeoIP DB (GeoLite2-City.mmdb) is installed and up to date. + +JE cleaner threads +je.cleaner.threads to 16 (from the default of 1) - note large numbers went very badly causing memory exhaustion +Bloom filter +MAX_RETRIES=4 + +#### Shutdown + +At this point, all activity should have stopped, so it should not make much difference how exactly the service is halted. To attempt to keep things as clean as possible, first terminate and then teardown the job(s) via the Heritrix UI. + +Then remote the crawl stack: + + docker stack rm fc_crawl + +If this is not responsive, it may be necessary to restart Docker itself. This means all the services get restarted with the current deployment configuration. + + service docker restart + +Even this can be quite slow sometimes, so be patient. + + +### Crawl Operations + +The current crawl engine relies on Heritrix3 state management to keep track of crawl state, and this was not designed to cope under un-supervised system restarts. i.e. rather than being stateless, or delegating state management to something that ensures the live state is preserved immediately, we need to manage ensuring the runtime state is recorded on disk. This is why crawler operations are more complex than other areas. + +#### Starting Crawls + +As stated above, before going any further, we need to ensure that Kafka has completed starting up and is ready for producers and consumers to connect. + +- Build. +- Select Checkpoint. If expected checkpoints are not present, this means something went wrong while writing them. This should be reported to try to determine and address the root cause, but there's not much to be done other than select the most recent valid checkpoint. +- Launch. +- + +#### Stopping Crawls + +If possible, we wish to preserve the current state of the crawl, so we try to cleanly shut down while making a checkpoint to restart from. + +Note that for our frequent crawls, we run two Heritrix services, one for NPLD content and one for by-permission crawling. When performing a full stop of the frequent crawls, both services need to be dealt with cleanly. When running on crawler06, this means: + +- https://crawler06.bl.uk:8443/ is NPLD crawling. +- https://crawler06.bl.uk:9443/ is By-Permission crawling. + +#### Pause the crawl job(s) + +For all Heritrixes in the Docker Stack: log into the Heritrix3 control UI, and pause any job(s) on the crawler that are in the `RUNNING` state. This can take a while (say up to two hours) as each worker thread tries to finish it's work neatly. Sometimes pausing never completes because of some bug, in which case we proceed anyway and accept some inaccuracies in the crawl state. If it works, all `RUNNING` jobs will now be in the state `PAUSED`. + +#### Checkpoint the job(s) + +Via the UI, request a checkpoint. If there's not been one for a while, this can be quite slow (tens of minutes). If it works, a banner should flash up with the checkpoint ID, which should be noted so the crawl can be resumed from the right checkpoint. If the checkpointing fails, the logs will need to be checked for errors, as unless a new checkpoint is succefully completed, it will likely not be valid. + +As an example, under some circumstances the log rotation does not work correctly. This means non-timestamped log files may be missing, which means when the next checkpoint runs, there are errors like: + + $ docker logs --tail 100 fc_crawl_npld-heritrix-worker.1.h21137sr8l31niwsx3m3o7jri + .... + SEVERE: org.archive.crawler.framework.CheckpointService checkpointFailed Checkpoint failed [Wed May 19 12:47:13 GMT 2021] + java.io.IOException: Unable to move /heritrix/output/frequent-npld/20210424211346/logs/runtime-errors.log to /heritrix/output/frequent-npld/20210424211346/logs/runtime-erro + rs.log.cp00025-20210519124709 + +These errors can be avoided by adding empty files in the right place, e.g. + + touch /mnt/gluster/fc/heritrix/output/frequent-npld/20210424211346/logs/runtime-errors.log + +But immediately re-attempting to checkpoint a paused crawl will usually fail with: + + Checkpoint not made -- perhaps no progress since last? (see logs) + +This is because the system will not attempt a new checkpoint if the crawl state has not changed. Therefore, to force a new checkpoint, it is necessary to briefly un-pause the crawl so some progress is made, then re-pause and re-checkpoint. + + +#### Shutdown + +At this point, all activity should have stopped, so it should not make much difference how exactly the service is halted. To attempt to keep things as clean as possible, first terminate and then teardown the job(s) via the Heritrix UI. + +You can now shut down the services... diff --git a/ingest/fc/OPS_CRAWLS.md b/ingest/fc/OPS_CRAWLS.md deleted file mode 100644 index b76214a..0000000 --- a/ingest/fc/OPS_CRAWLS.md +++ /dev/null @@ -1,54 +0,0 @@ -# Crawl Operations - -The current crawl engine relies on Heritrix3 state management to keep track of crawl state, and this was not designed to cope under un-supervised system restarts. i.e. rather than being stateless, or delegating state management to something that ensures the live state is preserved immediately, we need to manage ensuring the runtime state is recorded on disk. This is why crawler operations are more complex than other areas. - -## Starting Crawls - -As stated above, before going any further, we need to ensure that Kafka has completed starting up and is ready for producers and consumers to connect. - -- Build. -- Select Checkpoint. If expected checkpoints are not present, this means something went wrong while writing them. This should be reported to try to determine and address the root cause, but there's not much to be done other than select the most recent valid checkpoint. -- Launch. -- - -## Stopping Crawls - -If possible, we wish to preserve the current state of the crawl, so we try to cleanly shut down while making a checkpoint to restart from. - -Note that for our frequent crawls, we run two Heritrix services, one for NPLD content and one for by-permission crawling. When performing a full stop of the frequent crawls, both services need to be dealt with cleanly. When running on crawler06, this means: - -- https://crawler06.bl.uk:8443/ is NPLD crawling. -- https://crawler06.bl.uk:9443/ is By-Permission crawling. - -### Pause the crawl job(s) - -For all Heritrixes in the Docker Stack: log into the Heritrix3 control UI, and pause any job(s) on the crawler that are in the `RUNNING` state. This can take a while (say up to two hours) as each worker thread tries to finish it's work neatly. Sometimes pausing never completes because of some bug, in which case we proceed anyway and accept some inaccuracies in the crawl state. If it works, all `RUNNING` jobs will now be in the state `PAUSED`. - -### Checkpoint the job(s) - -Via the UI, request a checkpoint. If there's not been one for a while, this can be quite slow (tens of minutes). If it works, a banner should flash up with the checkpoint ID, which should be noted so the crawl can be resumed from the right checkpoint. If the checkpointing fails, the logs will need to be checked for errors, as unless a new checkpoint is succefully completed, it will likely not be valid. - -As an example, under some circumstances the log rotation does not work correctly. This means non-timestamped log files may be missing, which means when the next checkpoint runs, there are errors like: - - $ docker logs --tail 100 fc_crawl_npld-heritrix-worker.1.h21137sr8l31niwsx3m3o7jri - .... - SEVERE: org.archive.crawler.framework.CheckpointService checkpointFailed Checkpoint failed [Wed May 19 12:47:13 GMT 2021] - java.io.IOException: Unable to move /heritrix/output/frequent-npld/20210424211346/logs/runtime-errors.log to /heritrix/output/frequent-npld/20210424211346/logs/runtime-erro - rs.log.cp00025-20210519124709 - -These errors can be avoided by adding empty files in the right place, e.g. - - touch /mnt/gluster/fc/heritrix/output/frequent-npld/20210424211346/logs/runtime-errors.log - -But immediately re-attempting to checkpoint a paused crawl will usually fail with: - - Checkpoint not made -- perhaps no progress since last? (see logs) - -This is because the system will not attempt a new checkpoint if the crawl state has not changed. Therefore, to force a new checkpoint, it is necessary to briefly un-pause the crawl so some progress is made, then re-pause and re-checkpoint. - - -### Shutdown - -At this point, all activity should have stopped, so it should not make much difference how exactly the service is halted. To attempt to keep things as clean as possible, first terminate and then teardown the job(s) via the Heritrix UI. - -You can now shut down the services... diff --git a/ingest/fc/OPS_SERVICES.md b/ingest/fc/OPS_SERVICES.md deleted file mode 100644 index d1a010d..0000000 --- a/ingest/fc/OPS_SERVICES.md +++ /dev/null @@ -1,48 +0,0 @@ -Service Operations -================== - -move-to-S3 - -## Docker Stacks - -### Launching the Services - - - docker system prune -f - -#### Waiting for Kafka - - docker service logs --tail 100 -f fc_kafka_kafka - -Depending on the - - ...Loading producer state from snapshot files... - -Check in UI too. Restart if not showing up. - - docker service update --force fc_kafka_ui - - -Check surts and exclusions. -Check GeoIP DB (GeoLite2-City.mmdb) is installed and up to date. - -JE cleaner threads -je.cleaner.threads to 16 (from the default of 1) - note large numbers went very badly causing memory exhaustion -Bloom filter -MAX_RETRIES=4 - - - -#### Shutdown - -At this point, all activity should have stopped, so it should not make much difference how exactly the service is halted. To attempt to keep things as clean as possible, first terminate and then teardown the job(s) via the Heritrix UI. - -Then remote the crawl stack: - - docker stack rm fc_crawl - -If this is not responsive, it may be necessary to restart Docker itself. This means all the services get restarted with the current deployment configuration. - - service docker restart - -Even this can be quite slow sometimes, so be patient.