Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OLRC migration #34

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
620bf94
stopgap kluge: fix large file handling
jefferya Dec 16, 2020
4c0abbd
fix rubocop errors
jefferya Dec 17, 2020
857579f
rubocop fix
jefferya Dec 17, 2020
a1aef65
fix race conndition: audit excutes while items added to Swift
jefferya Feb 17, 2021
0984097
update gems
jefferya Feb 18, 2021
f45e3b9
update CI to ruby 2.5
jefferya Feb 18, 2021
019851b
update travis config to remove warnings
jefferya Feb 18, 2021
5a1c3e1
update rubocop config and fix warnings after version update
jefferya Feb 19, 2021
e6ec5aa
update codebase to conform to new rubocop
jefferya Feb 22, 2021
5597f21
cleanup response logic
jefferya Feb 22, 2021
02a0436
cleanup response logic
jefferya Feb 22, 2021
c9e978f
cleanup response logic
jefferya Feb 22, 2021
1b7220a
cleanup response logic
jefferya Feb 22, 2021
b6f3e9d
refactor audit report
jefferya Mar 14, 2021
d28b379
added ideas on refactoring audit report for speed and memory usage
jefferya Mar 19, 2021
a75a9e2
enhance preformance; reduce OpenStack API calls #30
jefferya Jan 11, 2022
89bc695
Update config to conform with OLRC Swift authentication
jefferya Aug 14, 2023
b069998
Fixes audit error by specifying the continer name as part as the project
jefferya Aug 15, 2023
d571132
Update readme: grammar and OLRC details
jefferya Aug 15, 2023
906b61a
Remove discontinued Travis CI
jefferya Aug 16, 2023
1abf9d1
Fixes wrong content type added to Swift item plus
jefferya Aug 16, 2023
4bb11d6
Tweak README
jefferya Aug 17, 2023
0a9312c
Add proper content type handling depending response
jefferya Aug 18, 2023
daebffa
Draft: migrate script from local to OLRC
jefferya Sep 13, 2023
5178829
Adds validation and csv uploaded logging
jefferya Sep 15, 2023
490c66f
refactor tests
jefferya Sep 18, 2023
fa7d969
Update .gitignore: remove Python intermediate files
jefferya Sep 19, 2023
412c05c
Revise error handling
jefferya Sep 19, 2023
3e973ac
Cleanup source data: missing meta-project
jefferya Sep 20, 2023
b4c4ba1
Add README notes on how to run the migration process
jefferya Oct 13, 2023
ffc68bf
Set project to the project name instead of the similarly named contai…
jefferya Nov 9, 2023
fb154f6
Fix use of project instead of container name
jefferya Nov 10, 2023
7f0c7a9
Add script to compare two swift containers
jefferya Nov 14, 2023
e66ad22
Tweak output for easier debugging
jefferya Nov 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
/test/tmp/
/test/version_tmp/
/tmp/
/log/


# Used by dotenv library to load environment variables.
# .env
Expand Down Expand Up @@ -48,3 +50,7 @@ build-iPhoneSimulator/

# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
.rvmrc

#
*.pyc
.pytest_cache
25 changes: 13 additions & 12 deletions .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ AllCops:
- 'tmp/**/*'
- 'vendor/**/*'
ExtraDetails: true
TargetRubyVersion: 2.3
NewCops: enable
TargetRubyVersion: 2.5

# readability is Actually Good
Layout/EmptyLinesAroundClassBody:
Expand All @@ -21,6 +22,10 @@ Layout/IndentationConsistency:
Enabled: true
EnforcedStyle: normal

Layout/LineLength:
Enabled: true
Max: 120 # default is 80

# A calculated magnitude based on number of assignments,
# branches, and conditions.
Metrics/AbcSize:
Expand All @@ -34,10 +39,6 @@ Metrics/ClassLength:
Metrics/CyclomaticComplexity:
Enabled: false

Metrics/LineLength:
Enabled: true
Max: 120 # default is 80

# Avoid methods longer than 10 lines of code.
Metrics/MethodLength:
Enabled: false
Expand All @@ -52,20 +53,20 @@ Metrics/ModuleLength:
Metrics/PerceivedComplexity:
Enabled: false

Naming/FileName:
Exclude:
- Dangerfile
- Rakefile
- Gemfile

# indentation is an endangered resource
Style/ClassAndModuleChildren:
EnforcedStyle: compact

Style/Documentation:
Enabled: false

Naming/FileName:
Exclude:
- Dangerfile
- Rakefile
- Gemfile

# Checks if there is a magic comment to enforce string literals
# Checks if there is a magic comment to enforce string literals
Style/FrozenStringLiteralComment:
Enabled: false

Expand Down
12 changes: 0 additions & 12 deletions .travis.yml

This file was deleted.

6 changes: 3 additions & 3 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
source 'https://rubygems.org'

gem 'bundler', '~> 1.17'
gem 'bundler', '~> 2.0'
gem 'http-cookie'
gem 'json'
gem 'json', '>= 2.3.0'
gem 'logger'
gem 'rake'
gem 'rake', '>= 12.3.3'
gem 'swift_ingest', '~> 0.4.0'

group :development, :test do
Expand Down
80 changes: 41 additions & 39 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -1,79 +1,81 @@
GEM
remote: https://rubygems.org/
specs:
activesupport (5.2.3)
activesupport (5.2.4.5)
concurrent-ruby (~> 1.0, >= 1.0.2)
i18n (>= 0.7, < 2)
minitest (~> 5.1)
tzinfo (~> 1.1)
addressable (2.6.0)
public_suffix (>= 2.0.2, < 4.0)
ast (2.4.0)
concurrent-ruby (1.1.5)
crack (0.4.3)
safe_yaml (~> 1.0.0)
domain_name (0.5.20180417)
addressable (2.7.0)
public_suffix (>= 2.0.2, < 5.0)
ast (2.4.2)
concurrent-ruby (1.1.8)
crack (0.4.5)
rexml
domain_name (0.5.20190701)
unf (>= 0.0.5, < 1.0.0)
hashdiff (0.3.8)
hashdiff (1.0.1)
http-cookie (1.0.3)
domain_name (~> 0.5)
i18n (1.6.0)
i18n (1.8.9)
concurrent-ruby (~> 1.0)
jaro_winkler (1.5.2)
json (2.2.0)
logger (1.3.0)
minitest (5.11.3)
json (2.5.1)
logger (1.4.3)
minitest (5.14.3)
mysql2 (0.4.10)
openstack (3.3.20)
openstack (3.3.21)
json
parallel (1.17.0)
parser (2.6.2.1)
ast (~> 2.4.0)
power_assert (1.1.4)
psych (3.1.0)
public_suffix (3.0.3)
parallel (1.20.1)
parser (3.0.0.0)
ast (~> 2.4.1)
power_assert (2.0.0)
public_suffix (4.0.6)
rainbow (3.0.0)
rake (12.3.2)
rubocop (0.67.2)
jaro_winkler (~> 1.5.1)
rake (13.0.3)
regexp_parser (2.0.3)
rexml (3.2.4)
rubocop (0.93.1)
parallel (~> 1.10)
parser (>= 2.5, != 2.5.1.1)
psych (>= 3.1.0)
parser (>= 2.7.1.5)
rainbow (>= 2.2.2, < 4.0)
regexp_parser (>= 1.8)
rexml
rubocop-ast (>= 0.6.0)
ruby-progressbar (~> 1.7)
unicode-display_width (>= 1.4.0, < 1.6)
unicode-display_width (>= 1.4.0, < 2.0)
rubocop-ast (1.4.1)
parser (>= 2.7.1.5)
rubocop-rspec (1.15.1)
rubocop (>= 0.42.0)
ruby-progressbar (1.10.0)
safe_yaml (1.0.5)
ruby-progressbar (1.11.0)
swift_ingest (0.4.1)
activesupport (~> 5.0)
mysql2 (~> 0.4.6)
openstack (~> 3.3, >= 3.3.10)
test-unit (3.3.1)
test-unit (3.4.0)
power_assert
thread_safe (0.3.6)
tzinfo (1.2.5)
tzinfo (1.2.9)
thread_safe (~> 0.1)
unf (0.1.4)
unf_ext
unf_ext (0.0.7.5)
unicode-display_width (1.5.0)
unf_ext (0.0.7.7)
unicode-display_width (1.7.0)
vcr (3.0.3)
webmock (3.5.1)
webmock (3.11.2)
addressable (>= 2.3.6)
crack (>= 0.3.2)
hashdiff
hashdiff (>= 0.4.0, < 2.0.0)

PLATFORMS
ruby

DEPENDENCIES
bundler (~> 1.17)
bundler (~> 2.0)
http-cookie
json
json (>= 2.3.0)
logger
rake
rake (>= 12.3.3)
rubocop (~> 0.51)
rubocop-rspec (~> 1.15.1)
swift_ingest (~> 0.4.0)
Expand All @@ -82,4 +84,4 @@ DEPENDENCIES
webmock

BUNDLED WITH
1.17.3
2.2.11
59 changes: 43 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# CWRC Preservation

The CWRC Preservation toolkit contains Ruby applications for preserve content from the CWRC (cwrc.ca) repository. The primary objective is to manage the flow of content from the CWRC repository into an OpenStack Swift repository for preservation. Also, the repository provides an application to audit the contents of the source and preserved objects. The preservation tool is meant to run behind a firewall thus pulling content from CWRC.
> :warning: **These command-line scripts are only compatible with CWRC v1.0**. The next release of CWRC (Islandora v2.0 / Drupal 9+) renders these scripts obsolete therefore this repo is minimally supported and my plan is to not fix the URI obsolete warning (`RUBYOPT='-W0'` before the associated command will suppress the warning) plus limit code clean-up.

The CWRC Preservation toolkit contains Ruby applications for preserving content from the CWRC (cwrc.ca) repository. The primary objective is to manage the flow of content from the CWRC repository into an OpenStack Swift repository for preservation. Also, the repository provides an application to audit the contents of the source and preserved objects. The preservation tool is meant to run behind a firewall thus pulling content from CWRC.

The two main applications are:

Expand All @@ -10,30 +12,30 @@ The two main applications are:
## Workflow

- cwrc_preserver.rb executes at a regular interval
- sends request to the CWRC repository with authentication parameters that produces a manifest list of objects residing with the CWRC repository as a response
- sends a request to the CWRC repository with authentication parameters that produces a manifest list of objects residing with the CWRC repository as a response
- for each CWRC repository object, inspect the preserved object
- if the preserved copy does not exist or is outdated (comparing CWRC manifest timestamp to the timestamp on the preserved copy), request a new AIP (Bag) from the CWRC repository and deposit within the preservation environment
- if the preserved copy does not exist or is outdated (comparing CWRC manifest timestamp to the timestamp on the preserved copy, the swift object custom metadata field 'last-mod-timestamp'), request a new AIP (Bag) from the CWRC repository and deposit within the preservation environment
- generate an audit report via cwrc_audit_report.rb
- sends request to CWRC repository with authentication parameters to produce a manifest list of objects residing with the CWRC repository
- sends request to preservation environment with authentication parameters to produce a manifest list of objects residing with the preservation environment
- sends a request to CWRC repository with authentication parameters to produce a manifest list of objects residing with the CWRC repository
- sends a request to preservation environment with authentication parameters to produce a manifest list of objects residing with the preservation environment
- merge lists and output as a CSV file for interpretation / review (e.g., within a spreadsheet tool)

## Requirements

- Ruby 2.3+

- Associated Gems via `bundle install`
- CWRC API endpoint: https://github.com/cwrc/islandora_bagit_extension

- Configuration file - use [secrets_example.yml](secrets_example.yml) as a starting point and the `-C --config PATH` to specify the config file to utilize.

```
``` txt
# Openstack Swift parameters
SWIFT_TENANT:
SWIFT_AUTH_URL:
SWIFT_USERNAME:
SWIFT_PASSWORD:
SWIFT_AUTH_URL:
SWIFT_USER_DOMAIN_NAME:
SWIFT_PROJECT_DOMAIN_NAME:
SWIFT_PROJECT:
SWIFT_PROJECT_DOMAIN_ID:
SWIFT_PROJECT_NAME:
CWRC_SWIFT_CONTAINER:
CWRC_PROJECT_NAME:

Expand Down Expand Up @@ -66,7 +68,7 @@ SWIFT_ARCHIVED_OK:

![Preservation System Diagram (PNG 50px/cm)](docs/images/cwrc_preservation.png)

This application connects to the CWRC repository and the preservation environment, determines which CWRC objects need preservation (e.g., missing from the preservation environment or the preservation environment contains a stale copy) and deposits a copy within the preservation environment. Optionally, the command-line allows defining a list of object ids to trigger a forced preservation event for each specified object. The application uses a config file specified on the command-line to contain properties (e.g. authentication). Two files are created:
This application connects to the CWRC repository and the preservation environment, determines which CWRC objects need preservation (e.g., missing from the preservation environment or the preservation environment contains a stale copy) and deposits a copy within the preservation environment (with CWRC modification time metadata in ['last-mod-timestamp']). Optionally, the command-line allows defining a list of object ids to trigger a forced preservation event for each specified object. The application uses a config file specified on the command-line to contain properties (e.g. authentication). Two files are created:

- swift_archived_objs.txt: lists the IDs, size and archive rate of all CWRC successfully preserved objects,
- swift_failed_objs.txt: lists all CWRC objects that failed preservation - this will need review and are candidates for reprocessing (hence -r parameter)
Expand All @@ -77,7 +79,7 @@ Common usage:
- query CWRC repository items modified since a given date/time and preserve if needed (example #2)
- pass defined list of items and force preservation (example #3)

```
``` txt
Usage: cwrc_preserver [options]

options:
Expand Down Expand Up @@ -108,11 +110,36 @@ Example #3 - process objects via a list and with a forced update (i.e., deposit
./cwrc_preserver.rb -d --config="/opt/conf/cwrc_preserver_conf.yml" --reprocess=/tmp/cwrc_pid_list_one_per_line | tee /tmp/stdout_debug.txt
```

#### Results in Swift

Note the custom Swift object metadata item `Last-Mod-Timestamp`, this is the cwrc.ca Islandora7 last modification timestamp on the object (used by the audit to as on factor in determining whether or not the Swift instance needs to be updated).

``` bash
$ swift stat cwrc-test islandora:root
Account: AUTH_0d17ddb0b6834fc5be902e1a2df6f17b
Container: cwrc-test
Object: islandora:root
Content Type: application/zip
Content Length: 10083
Last Modified: Tue, 15 Aug 2023 15:08:50 GMT
ETag: 04009a3f93fd2c9b38706bddda4f86ea
Meta Project-Id: islandora:root
Meta Promise: bronze
Meta Aip-Version: 1.0
Meta Last-Mod-Timestamp: 2018-05-16T14:36:26.691Z
X-Timestamp: 1692112129.62648
Accept-Ranges: bytes
X-Trans-Id: txa72dc123c3784c959fa89-0064db97f9
X-Openstack-Request-Id: txa72dc123c3784c959fa89-0064db97f9
Strict-Transport-Security: max-age=15768000
```


<a name="cwrc_audit_report.rb"/>

### Reporting / Auditing: cwrc_audit_report.rb

# ![Audit System Diagram (PNG 50px/cm)](docs/images/cwrc_preservation_audit.png)
## ![Audit System Diagram (PNG 50px/cm)](docs/images/cwrc_preservation_audit.png)

```shell
Usage: cwrc_audit_report [options]
Expand All @@ -123,15 +150,15 @@ Usage: cwrc_audit_report [options]

Builds a CSV formatted audit report comparing content within the CWRC repository relative to UAL's OpenStack Swift preserved content.

The report pulls input from two disparate sources: CWRC repository and UAL OpenStack Swift preservation service. The report links the content based on object id and outputs the linked information in csv rows that included the fields: the CWRC object PIDs and modification date/times, UAL Swift ID, modification time, and size along with a column indicating the preservation status (i.e., indicating if modification time comparison between Swift and CWRC indicates a need for preservation, or if the size of the Swift object is zero, etc).
The report pulls input from two disparate sources: CWRC repository and UAL OpenStack Swift preservation service. The report links the content based on object id and outputs the linked information in csv rows that included the fields: the CWRC object PIDs and modification date/times, UAL Swift ID, modification time (metadata['last-mod-timestamp']), and size along with a column indicating the preservation status (i.e., indicating if modification time comparison between Swift and CWRC indicates a need for preservation, or if the size of the Swift object is zero, etc).

The output format is CSV with the following header columns:

```
CWRC PID,
CWRC modification,
Swift ID,
Swift modification time,
Swift modification time (metadata['last-mod-timestamp']),
Swift size,
Status

Expand Down
Loading