In 1096 ocw workflow #91

jonavellecuerdo · 2025-01-23T16:37:52Z

Purpose and background context

This PR creates the OpenCourseWare DSC workflow.

This also introduces the following changes:

Rename test module for SimpleCSV workflow. The result is that when we run make test, workflow test results appear in the terminal together.

How can a reviewer manually see the effects of these changes?

A. Review the added unit tests.
Note: The only custom method defined for OpenCourseWare without a unit test is the item_metadata_iter method. See method B for testing with MinIO server.

Unit tests for custom OpenCourseWare methods.

B. Optional but highly recommended (especially for future development).
Run OpenCourseWare commands using local MinIO server.

Prerequisite

Follow instructions in README: Running a Local MinIO Server.
Note: As of this writing, the root password set for the local MinIO server must be at least 8 characters long. Didn't want to write this requirement in the README as it is subject to change if/when we download updated versions of the MinIO Docker image.
Mock out the local MinIO server with test zip files.
Note: I did these steps via the WebUI.
- Create paths (i.e., prefix) in the dsc bucket:
  - dsc/opencourseware/batch-00/
    - Upload two (2) sample zip files with metadata
      It is not important to mock other files as the bitstream for OpenCourseWare deposits is the zip file itself.
      - abc123.zip: Zip file containing a single data.json.
      - def456.zip: Zip file containing a single data.json
  - dsc/opencourseware/batch-01/
    - Upload one (1) sample zip file without metadata.
      - ghi789.zip

Add the following environment variables in your .env file.

AWS_ENDPOINT_URL=http://localhost:9000/
AWS_ACCESS_KEY_ID=<local-minio-username>
AWS_SECRET_ACCESS_KEY=<local-minio-password>

OpenCourseWare commands
Launch Python in your terminal: pipenv run python

Check item_metadata_iter() result for batch-00.

from dsc.workflows import OpenCourseWare
opencourseware_workflow_instance = OpenCourseWare(collection_handle="blah", batch_id="batch-00", email_recipients="[email protected]")
item_metadata_iter = opencourseware_workflow_instance.item_metadata_iter()
list(item_metadata_iter)

You should see the following output:

[
    {
        "item_identifier": "abc123",
        "course_title": "Matrix Calculus for Machine Learning and Beyond",
        "course_description": "We all know that calculus courses.",
        "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8",
        "instructors": "Edelman, Alan|Johnson, Steven G.",
    },
    {
        "item_identifier": "def456",
        "course_title": "Burgers and Beyond",
        "course_description": "Investigating the paranormal, one burger at a time.",
        "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8",
        "instructors": "Burger, Cheese E.",
    },
]

Check item_metadata_iter() result for batch-01.

from dsc.workflows import OpenCourseWare
opencourseware_workflow_instance = OpenCourseWare(collection_handle="blah", batch_id="batch-01", email_recipients="[email protected]")
item_metadata_iter = opencourseware_workflow_instance.item_metadata_iter()
list(item_metadata_iter)

You should see the following output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jcuerdo/Documents/repos/dspace-submission-composer/dsc/workflows/opencourseware.py", line 60, in item_metadata_iter
    **self._extract_metadata_from_zip_file(zip_file, item_identifier),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/Documents/repos/dspace-submission-composer/dsc/workflows/opencourseware.py", line 76, in _extract_metadata_from_zip_file
    raise FileNotFoundError(
FileNotFoundError: The required file 'data.json' file was not found in the zip file: s3://dsc/opencourseware/batch-01/ghi789.zip

An FileNotFoundError is raised if any zip file is missing metadata (i.e., the data.json file)

CLI command: reconcile

Run reconcile for batch-00 in your terminal:

pipenv run dsc -w "opencourseware" -c "abc123" -b "batch-00" -e "[email protected]" reconcile

You should see the following output [REDACTED]:

Loading .env environment variables...
2025-01-27 10:03:11,074 INFO root.configure_logger(): INFO
2025-01-27 10:03:11,075 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-01-27 10:03:11,075 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-27 10:03:11,075 INFO dsc.cli.main(): Running process
2025-01-27 10:03:11,094 INFO botocore.credentials.load(): Found credentials in environment variables.
2025-01-27 10:03:11,340 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:03:11,426 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:03:11,438 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
. . .
2025-01-27 10:03:11,515 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
2025-01-27 10:03:11,515 INFO dsc.cli.reconcile(): All item identifiers and bitstreams successfully matched
2025-01-27 10:03:11,515 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-01-27 10:03:11,515 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:00.440784

Run reconcile for batch-01 in your terminal:

pipenv run dsc -w "opencourseware" -c "abc123" -b "batch-01" -e "[email protected]" reconcile

You should see the following output [REDACTED]:

Loading .env environment variables...
2025-01-27 10:06:44,845 INFO root.configure_logger(): INFO
2025-01-27 10:06:44,845 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-01-27 10:06:44,845 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-27 10:06:44,845 INFO dsc.cli.main(): Running process
2025-01-27 10:06:44,857 INFO botocore.credentials.load(): Found credentials in environment variables.
2025-01-27 10:06:44,977 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:06:45,015 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:06:45,023 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
...
2025-01-27 10:06:45,033 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
2025-01-27 10:06:45,035 ERROR dsc.workflows.opencourseware._identify_bitstreams_with_metadata(): The required file 'data.json' file was not found in the zip file: s3://dsc/opencourseware/batch-01/ghi789.zip
2025-01-27 10:06:45,036 ERROR dsc.cli.reconcile(): No item identifiers found for these bitstreams: {'ghi789'}
2025-01-27 10:06:45,036 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-01-27 10:06:45,036 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:00.191134

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/IN-1096

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

coveralls · 2025-01-23T18:11:52Z

Pull Request Test Coverage Report for Build 13116701898

Details

58 of 64 (90.63%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-1.1%) to 98.674%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
dsc/workflows/opencourseware.py	56	62	90.32%

Totals
Change from base Build 13080178348:	-1.1%
Covered Lines:	521
Relevant Lines:	528

💛 - Coveralls

jonavellecuerdo · 2025-01-23T18:16:58Z

dsc/workflows/metadata_mapping/opencourseware.json

+    "dc.contributor.author": {
+        "source_field_name": "instructor",
+        "language": "en_US",
+        "delimiter": "|"
+    }


This mapping is made possible by the transformation of the instructors property when _read_metadata_json_file() method is called.

jonavellecuerdo · 2025-01-23T18:22:43Z

dsc/workflows/opencourseware.py

+    def _construct_instructor_name(instructor: dict[str, str]) -> str:
+        """Given a dictionary of name fields, derive instructor name."""
+        if not (last_name := instructor.get("last_name")) or not (
+            first_name := instructor.get("first_name")
+        ):
+            return ""
+        return f"{last_name}, {first_name} {instructor.get("middle_initial", "")}".strip()


While it is plausible that all the metadata in data.json will always be formatted as needed (i.e., all instructor name fields provided), it would be a good idea to check in with stakeholders (IN-1156) on the "minimum required instructor name fields` to construct an instructor name.

In this sample mapping file we received, ocw_json_to_dspace_mapping.xlsx, it indicates the instructor names must be formatted as:

<last_name>, <first_name> <middle_initial>

The code above will return an empty string if either the last_name or first_name is missing; it allows for missing middle_initial values.

jonavellecuerdo · 2025-01-27T14:29:27Z

Met with @ghukill last Friday and wanted to share some thoughts from our discussion:

Clarify language around the reconcile method: The base Workflow.reconcile_bitstreams_and_metadata method uses the terms "item_identifiers_without_bitstreams" and "bitstreams_without_item_identifiers". The use of these terms felt a bit awkward with the OpenCourseWare workflow for the following reason:

. . . the 'reconcile' method only determines whether there are any bitstreams without metadata (any zip files without a 'data.json'). Metadata without bitstreams are basically impossible because the metadata ('data.json') is inside the bitstream (zip file)

My interpretation of reconcile is that it is really about determining whether there is metadata provided for the bitstream. In the case of OpenCourseWare, all the bitstreams (the zip files) have "item_identifiers" because these are provided in the bitstream (the zip file) filename. For this reason, I chose different naming conventions in OpenCourseWare.reconcile_bitstreams_and_metadata.

However, the language in the messages logged by the reconcile CLI command also uses the terms "item_identifiers_without_bitstreams" and "bitstreams_without_item_identifiers", which, again, feels a bit awkward in the case of OpenCourseWare workflow (and potentially other future workflows 🤔).

@ghukill had mentioned that you and him had discussed moving the logging out of the CLI command and into the workflow classes. As this work concerns the base Workflow class, I propose a separate ticket for work to clarify the language -- variable names and logged messages-- around the reconcile method.
Update method for creating DSpace metadata to support lists: The OpenCourseWare workflow includes a step to create a delimited string of instructor names. It includes a step where a list of formatted instructor names are retrieved and then joined by a delimiter. @ghukill proposed that the function could return a list instead if Workflow.create_dspace_metadata was updated to handle lists. As this workflow concerns the base Workflow class, I propose a separate ticket for this work -- and update OpenCourseWare workflow as part of that new ticket.

ehanson8 · 2025-01-27T15:25:12Z

Agree that 1 & 2 should be handled as separate tickets!

ehanson8

Looking great but a few requested changes!

dsc/workflows/opencourseware.py

ehanson8 · 2025-01-27T15:44:39Z

dsc/workflows/opencourseware.py

+    def process_deposit_results(self) -> list[str]:
+        """TODO: Stub method."""
+        return [""]


I think this can probably just default to Workflow.process_results but let's confirm with stakeholders!

To be addressed via ticket IN-1156.

tests/test_workflow_opencourseware.py

ghukill

Overall, looking good to me. Thanks for the discussion the other day, which was quite helpful.

I left a couple of comments/suggestions for fairly minor, syntactical things. None are required or blocking.

I did have another comment that this PR surfaced for me. Start by saying perhaps this is could be another ticket for exploring.

Should the CLI command reconcile potentially return a non-zero exit code? Thinking forward to automation, throwing an exit code like 1 or 2 would indicate that reconciliation had failed in some way. This could be helpful for humans, but more helpful for automation like StepFunctions.

I think this came to mind in this PR given the conversations about reconciling and what it's conceptually communicating.

dsc/workflows/opencourseware.py

* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'

Why these changes are being introduced: * Support OpenCourseWare deposits requested by Technical Services staff. How this addresses that need: * Define custom methods to extract metadata from 'data.json' * Define custom 'get_bitstream_s3_uris' to filter to zip files * Define custom methods to reconcile bitstreams with item metadata (i.e., identify zip files without 'data.json' files) * Create OpenCourseWare metadata mapping JSON file * Add unit tests Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1096

* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'

jonavellecuerdo · 2025-02-03T15:25:09Z

dsc/workflows/opencourseware.py

+        s3_client = S3Client()
+        for file in s3_client.files_iter(
+            bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip"
+        ):
+            zip_file = f"s3://{self.s3_bucket}/{file}"
+            item_identifier = file.split("/")[-1].removesuffix(".zip")


Note: This function contains similar steps to OpenCourseWare.item_metadata_iter; only, instead of yielding a dictionary of metadata, it checks whether metadata can be extracted from the zip file--tracking item identifiers in two lists:

item_identifiers: list of item identifiers for all zip files

bitstreams_without_metadata: list of item identifiers for all zip files without a 'data.json' file.

I was just going to comment here! Given the reconcile updates, and the code as-is, I think my proposal is much more refined now (where last time I was talking about passing around tuples of identifiers).

What about refactoring these shared, duplicated -- but important -- two lines into a standalone method?

item_identifier = file.split("/")[-1].removesuffix(".zip") item_identifiers.append(item_identifier)

This would ensure that item identifiers are always constructued in the same way. Then both reconcile_bitstreams_and_metadata() and item_metadata_iter() could utilize this method.

I don't want to go so far as to suggest that every workflow have a parse_item_identifier() method... but it might not be bad form to establish that pattern here and then organically see if it's helpful for other workflows.

@ghukill Did you mean these 2 lines? If so, I agree

zip_file = f"s3://{self.s3_bucket}/{file}" item_identifier = file.split("/")[-1].removesuffix(".zip")

ghukill

Think it's looking great!

My only request is a standalone method for the item identifier parsing, to ensure it's always done the same way for any part of the workflow that attempts to establish it for an item.

ghukill · 2025-02-03T15:56:03Z

dsc/workflows/opencourseware.py

+        s3_client = S3Client()
+        for file in s3_client.files_iter(
+            bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip"
+        ):
+            zip_file = f"s3://{self.s3_bucket}/{file}"
+            item_identifier = file.split("/")[-1].removesuffix(".zip")


I was just going to comment here! Given the reconcile updates, and the code as-is, I think my proposal is much more refined now (where last time I was talking about passing around tuples of identifiers).

What about refactoring these shared, duplicated -- but important -- two lines into a standalone method?

item_identifier = file.split("/")[-1].removesuffix(".zip") item_identifiers.append(item_identifier)

This would ensure that item identifiers are always constructued in the same way. Then both reconcile_bitstreams_and_metadata() and item_metadata_iter() could utilize this method.

I don't want to go so far as to suggest that every workflow have a parse_item_identifier() method... but it might not be bad form to establish that pattern here and then organically see if it's helpful for other workflows.

ghukill · 2025-02-03T15:57:50Z

dsc/workflows/opencourseware.py

+                },
+            }
+            logger.error(json.dumps(reconcile_error_message))
+            raise ReconcileError(json.dumps(reconcile_error_message))


I think all the reconcile discussions are paying dividends. This method feels true to this workflow now, raising exceptions when there is a problem.

ghukill · 2025-02-03T16:02:26Z

dsc/workflows/opencourseware.py

+            )
+            return source_metadata
+
+    def _get_instructors_delimited_string(self, instructors: list[dict[str, str]]) -> str:


Noting @jonavellecuerdo's comment #2 here, around exploring if multi-value lists/arrays can be passed for metadata parsing, to avoid creating a delimited string just to parse it again.

I would be in favor of either an issue or a new ticket and not incorporate that in this PR. But noting here, as I think this PR exposes a situation when we're creating delimited strings from structured data only to re-structure it moments later.

Agree that a ticket would be the best approach

ehanson8

Looks good to me, I concur with Graham's method and ticket suggestions and then full approval!

ehanson8 · 2025-02-03T17:01:57Z

dsc/workflows/opencourseware.py

-              (i.e., '<item_identifier>'.zip)
+        bitstreams without metadata (any zip files without a 'data.json').
+        Metadata without bitstreams is not calculated as for a 'data.json' file to
+        exist, the zip file must also exist.


Good docstring!

ehanson8 · 2025-02-03T17:10:24Z

dsc/workflows/opencourseware.py

+        s3_client = S3Client()
+        for file in s3_client.files_iter(
+            bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip"
+        ):
+            zip_file = f"s3://{self.s3_bucket}/{file}"
+            item_identifier = file.split("/")[-1].removesuffix(".zip")


@ghukill Did you mean these 2 lines? If so, I agree

zip_file = f"s3://{self.s3_bucket}/{file}" item_identifier = file.split("/")[-1].removesuffix(".zip")

jonavellecuerdo self-assigned this Jan 23, 2025

jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch 2 times, most recently from 9f54f9f to 5097b39 Compare January 23, 2025 18:09

jonavellecuerdo commented Jan 23, 2025

View reviewed changes

jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch 4 times, most recently from 7238447 to abcf2ae Compare January 24, 2025 20:50

jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch from abcf2ae to 762997e Compare January 27, 2025 15:00

jonavellecuerdo marked this pull request as ready for review January 27, 2025 15:08

jonavellecuerdo requested review from ghukill and ehanson8 January 27, 2025 15:08

ehanson8 requested changes Jan 27, 2025

View reviewed changes

ghukill reviewed Jan 27, 2025

View reviewed changes

dsc/workflows/opencourseware.py Outdated Show resolved Hide resolved

dsc/workflows/opencourseware.py Outdated Show resolved Hide resolved

jonavellecuerdo added a commit that referenced this pull request Jan 28, 2025

Address comments in PR #91

145447e

* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'

jonavellecuerdo marked this pull request as draft January 29, 2025 14:06

jonavellecuerdo mentioned this pull request Jan 29, 2025

Reconcile step should support all workflows #102

Closed

jonavellecuerdo added 3 commits January 31, 2025 15:38

Address comments in PR #91

0f5a3f4

* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'

Simplify 'reconcile' methods

655da2e

jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch from 145447e to 655da2e Compare February 3, 2025 15:17

jonavellecuerdo commented Feb 3, 2025

View reviewed changes

jonavellecuerdo marked this pull request as ready for review February 3, 2025 15:25

jonavellecuerdo requested review from ghukill and ehanson8 February 3, 2025 15:25

ghukill requested changes Feb 3, 2025

View reviewed changes

ehanson8 reviewed Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In 1096 ocw workflow #91

In 1096 ocw workflow #91

jonavellecuerdo commented Jan 23, 2025 •

edited

Loading

coveralls commented Jan 23, 2025 •

edited

Loading

jonavellecuerdo Jan 23, 2025

jonavellecuerdo Jan 23, 2025 •

edited

Loading

jonavellecuerdo commented Jan 27, 2025 •

edited

Loading

ehanson8 commented Jan 27, 2025

ehanson8 left a comment

ehanson8 Jan 27, 2025

jonavellecuerdo Jan 28, 2025

ghukill left a comment

jonavellecuerdo Feb 3, 2025

ghukill Feb 3, 2025

ehanson8 Feb 3, 2025

ghukill left a comment

ghukill Feb 3, 2025

ghukill Feb 3, 2025

ghukill Feb 3, 2025

ehanson8 Feb 3, 2025

ehanson8 left a comment

ehanson8 Feb 3, 2025

ehanson8 Feb 3, 2025

In 1096 ocw workflow #91

Are you sure you want to change the base?

In 1096 ocw workflow #91

Conversation

jonavellecuerdo commented Jan 23, 2025 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

coveralls commented Jan 23, 2025 • edited Loading

Pull Request Test Coverage Report for Build 13116701898

Details

💛 - Coveralls

Choose a reason for hiding this comment

jonavellecuerdo Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

jonavellecuerdo commented Jan 27, 2025 • edited Loading

ehanson8 commented Jan 27, 2025

ehanson8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo commented Jan 23, 2025 •

edited

Loading

coveralls commented Jan 23, 2025 •

edited

Loading

jonavellecuerdo Jan 23, 2025 •

edited

Loading

jonavellecuerdo commented Jan 27, 2025 •

edited

Loading