Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In 1096 ocw workflow #91

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

In 1096 ocw workflow #91

wants to merge 3 commits into from

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Jan 23, 2025

Purpose and background context

This PR creates the OpenCourseWare DSC workflow.

This also introduces the following changes:

How can a reviewer manually see the effects of these changes?

A. Review the added unit tests.
Note: The only custom method defined for OpenCourseWare without a unit test is the item_metadata_iter method. See method B for testing with MinIO server.


B. Optional but highly recommended (especially for future development).
Run OpenCourseWare commands using local MinIO server.

Prerequisite

  1. Follow instructions in README: Running a Local MinIO Server.
    Note: As of this writing, the root password set for the local MinIO server must be at least 8 characters long. Didn't want to write this requirement in the README as it is subject to change if/when we download updated versions of the MinIO Docker image.

  2. Mock out the local MinIO server with test zip files.
    Note: I did these steps via the WebUI.

    • Create paths (i.e., prefix) in the dsc bucket:
      • dsc/opencourseware/batch-00/
        • Upload two (2) sample zip files with metadata
          It is not important to mock other files as the bitstream for OpenCourseWare deposits is the zip file itself.
          • abc123.zip: Zip file containing a single data.json.
          • def456.zip: Zip file containing a single data.json
      • dsc/opencourseware/batch-01/
        • Upload one (1) sample zip file without metadata.
  3. Add the following environment variables in your .env file.

    AWS_ENDPOINT_URL=http://localhost:9000/
    AWS_ACCESS_KEY_ID=<local-minio-username>
    AWS_SECRET_ACCESS_KEY=<local-minio-password>
    

OpenCourseWare commands
Launch Python in your terminal: pipenv run python

  1. Check item_metadata_iter() result for batch-00.
from dsc.workflows import OpenCourseWare
opencourseware_workflow_instance = OpenCourseWare(collection_handle="blah", batch_id="batch-00", email_recipients="[email protected]")
item_metadata_iter = opencourseware_workflow_instance.item_metadata_iter()
list(item_metadata_iter)

You should see the following output:

[
    {
        "item_identifier": "abc123",
        "course_title": "Matrix Calculus for Machine Learning and Beyond",
        "course_description": "We all know that calculus courses.",
        "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8",
        "instructors": "Edelman, Alan|Johnson, Steven G.",
    },
    {
        "item_identifier": "def456",
        "course_title": "Burgers and Beyond",
        "course_description": "Investigating the paranormal, one burger at a time.",
        "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8",
        "instructors": "Burger, Cheese E.",
    },
]
  1. Check item_metadata_iter() result for batch-01.
from dsc.workflows import OpenCourseWare
opencourseware_workflow_instance = OpenCourseWare(collection_handle="blah", batch_id="batch-01", email_recipients="[email protected]")
item_metadata_iter = opencourseware_workflow_instance.item_metadata_iter()
list(item_metadata_iter)

You should see the following output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jcuerdo/Documents/repos/dspace-submission-composer/dsc/workflows/opencourseware.py", line 60, in item_metadata_iter
    **self._extract_metadata_from_zip_file(zip_file, item_identifier),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/Documents/repos/dspace-submission-composer/dsc/workflows/opencourseware.py", line 76, in _extract_metadata_from_zip_file
    raise FileNotFoundError(
FileNotFoundError: The required file 'data.json' file was not found in the zip file: s3://dsc/opencourseware/batch-01/ghi789.zip

An FileNotFoundError is raised if any zip file is missing metadata (i.e., the data.json file)

CLI command: reconcile

  1. Run reconcile for batch-00 in your terminal:
pipenv run dsc -w "opencourseware" -c "abc123" -b "batch-00" -e "[email protected]" reconcile

You should see the following output [REDACTED]:

Loading .env environment variables...
2025-01-27 10:03:11,074 INFO root.configure_logger(): INFO
2025-01-27 10:03:11,075 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-01-27 10:03:11,075 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-27 10:03:11,075 INFO dsc.cli.main(): Running process
2025-01-27 10:03:11,094 INFO botocore.credentials.load(): Found credentials in environment variables.
2025-01-27 10:03:11,340 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:03:11,426 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:03:11,438 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
. . .
2025-01-27 10:03:11,515 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
2025-01-27 10:03:11,515 INFO dsc.cli.reconcile(): All item identifiers and bitstreams successfully matched
2025-01-27 10:03:11,515 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-01-27 10:03:11,515 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:00.440784
  1. Run reconcile for batch-01 in your terminal:
pipenv run dsc -w "opencourseware" -c "abc123" -b "batch-01" -e "[email protected]" reconcile

You should see the following output [REDACTED]:

Loading .env environment variables...
2025-01-27 10:06:44,845 INFO root.configure_logger(): INFO
2025-01-27 10:06:44,845 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-01-27 10:06:44,845 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-27 10:06:44,845 INFO dsc.cli.main(): Running process
2025-01-27 10:06:44,857 INFO botocore.credentials.load(): Found credentials in environment variables.
2025-01-27 10:06:44,977 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:06:45,015 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:06:45,023 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
...
2025-01-27 10:06:45,033 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
2025-01-27 10:06:45,035 ERROR dsc.workflows.opencourseware._identify_bitstreams_with_metadata(): The required file 'data.json' file was not found in the zip file: s3://dsc/opencourseware/batch-01/ghi789.zip
2025-01-27 10:06:45,036 ERROR dsc.cli.reconcile(): No item identifiers found for these bitstreams: {'ghi789'}
2025-01-27 10:06:45,036 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-01-27 10:06:45,036 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:00.191134

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Jan 23, 2025
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch 2 times, most recently from 9f54f9f to 5097b39 Compare January 23, 2025 18:09
@coveralls
Copy link

coveralls commented Jan 23, 2025

Pull Request Test Coverage Report for Build 13116701898

Details

  • 58 of 64 (90.63%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.1%) to 98.674%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/workflows/opencourseware.py 56 62 90.32%
Totals Coverage Status
Change from base Build 13080178348: -1.1%
Covered Lines: 521
Relevant Lines: 528

💛 - Coveralls

Comment on lines 32 to 26
"dc.contributor.author": {
"source_field_name": "instructor",
"language": "en_US",
"delimiter": "|"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +132 to +163
def _construct_instructor_name(instructor: dict[str, str]) -> str:
"""Given a dictionary of name fields, derive instructor name."""
if not (last_name := instructor.get("last_name")) or not (
first_name := instructor.get("first_name")
):
return ""
return f"{last_name}, {first_name} {instructor.get("middle_initial", "")}".strip()
Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is plausible that all the metadata in data.json will always be formatted as needed (i.e., all instructor name fields provided), it would be a good idea to check in with stakeholders (IN-1156) on the "minimum required instructor name fields` to construct an instructor name.

In this sample mapping file we received, ocw_json_to_dspace_mapping.xlsx, it indicates the instructor names must be formatted as:

<last_name>, <first_name> <middle_initial>

The code above will return an empty string if either the last_name or first_name is missing; it allows for missing middle_initial values.

@jonavellecuerdo jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch 4 times, most recently from 7238447 to abcf2ae Compare January 24, 2025 20:50
@jonavellecuerdo
Copy link
Contributor Author

jonavellecuerdo commented Jan 27, 2025

Met with @ghukill last Friday and wanted to share some thoughts from our discussion:

  1. Clarify language around the reconcile method: The base Workflow.reconcile_bitstreams_and_metadata method uses the terms "item_identifiers_without_bitstreams" and "bitstreams_without_item_identifiers". The use of these terms felt a bit awkward with the OpenCourseWare workflow for the following reason:

    . . . the 'reconcile' method only determines whether there are any bitstreams without metadata (any zip files without a 'data.json'). Metadata without bitstreams are basically impossible because the metadata ('data.json') is inside the bitstream (zip file)

    My interpretation of reconcile is that it is really about determining whether there is metadata provided for the bitstream. In the case of OpenCourseWare, all the bitstreams (the zip files) have "item_identifiers" because these are provided in the bitstream (the zip file) filename. For this reason, I chose different naming conventions in OpenCourseWare.reconcile_bitstreams_and_metadata.

    However, the language in the messages logged by the reconcile CLI command also uses the terms "item_identifiers_without_bitstreams" and "bitstreams_without_item_identifiers", which, again, feels a bit awkward in the case of OpenCourseWare workflow (and potentially other future workflows 🤔).

    @ghukill had mentioned that you and him had discussed moving the logging out of the CLI command and into the workflow classes. As this work concerns the base Workflow class, I propose a separate ticket for work to clarify the language -- variable names and logged messages-- around the reconcile method.

  2. Update method for creating DSpace metadata to support lists: The OpenCourseWare workflow includes a step to create a delimited string of instructor names. It includes a step where a list of formatted instructor names are retrieved and then joined by a delimiter. @ghukill proposed that the function could return a list instead if Workflow.create_dspace_metadata was updated to handle lists. As this workflow concerns the base Workflow class, I propose a separate ticket for this work -- and update OpenCourseWare workflow as part of that new ticket.

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review January 27, 2025 15:08
@ehanson8
Copy link
Contributor

Agree that 1 & 2 should be handled as separate tickets!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great but a few requested changes!

dsc/workflows/opencourseware.py Outdated Show resolved Hide resolved
dsc/workflows/opencourseware.py Outdated Show resolved Hide resolved
dsc/workflows/opencourseware.py Outdated Show resolved Hide resolved
dsc/workflows/opencourseware.py Outdated Show resolved Hide resolved
Comment on lines +201 to +183
def process_deposit_results(self) -> list[str]:
"""TODO: Stub method."""
return [""]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can probably just default to Workflow.process_results but let's confirm with stakeholders!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be addressed via ticket IN-1156.

tests/test_workflow_opencourseware.py Show resolved Hide resolved
tests/test_workflow_opencourseware.py Outdated Show resolved Hide resolved
tests/test_workflow_opencourseware.py Show resolved Hide resolved
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looking good to me. Thanks for the discussion the other day, which was quite helpful.

I left a couple of comments/suggestions for fairly minor, syntactical things. None are required or blocking.

I did have another comment that this PR surfaced for me. Start by saying perhaps this is could be another ticket for exploring.

Should the CLI command reconcile potentially return a non-zero exit code? Thinking forward to automation, throwing an exit code like 1 or 2 would indicate that reconciliation had failed in some way. This could be helpful for humans, but more helpful for automation like StepFunctions.

I think this came to mind in this PR given the conversations about reconciling and what it's conceptually communicating.

dsc/workflows/opencourseware.py Outdated Show resolved Hide resolved
dsc/workflows/opencourseware.py Outdated Show resolved Hide resolved
jonavellecuerdo added a commit that referenced this pull request Jan 28, 2025
* Update docstring for OpenCourseWare workflow class
* Use 'removeprefix' over 'replace'
* Include assertion to check for logged 'FileNotFoundError'
@jonavellecuerdo jonavellecuerdo marked this pull request as draft January 29, 2025 14:06
Why these changes are being introduced:
* Support OpenCourseWare deposits requested by Technical Services staff.

How this addresses that need:
* Define custom methods to extract metadata from 'data.json'
* Define custom 'get_bitstream_s3_uris' to filter to zip files
* Define custom methods to reconcile bitstreams with item metadata
(i.e., identify zip files without 'data.json' files)
* Create OpenCourseWare metadata mapping JSON file
* Add unit tests

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1096
* Update docstring for OpenCourseWare workflow class
* Use 'removeprefix' over 'replace'
* Include assertion to check for logged 'FileNotFoundError'
Comment on lines +41 to +46
s3_client = S3Client()
for file in s3_client.files_iter(
bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip"
):
zip_file = f"s3://{self.s3_bucket}/{file}"
item_identifier = file.split("/")[-1].removesuffix(".zip")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This function contains similar steps to OpenCourseWare.item_metadata_iter; only, instead of yielding a dictionary of metadata, it checks whether metadata can be extracted from the zip file--tracking item identifiers in two lists:

  1. item_identifiers: list of item identifiers for all zip files
  2. bitstreams_without_metadata: list of item identifiers for all zip files without a 'data.json' file.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just going to comment here! Given the reconcile updates, and the code as-is, I think my proposal is much more refined now (where last time I was talking about passing around tuples of identifiers).

What about refactoring these shared, duplicated -- but important -- two lines into a standalone method?

item_identifier = file.split("/")[-1].removesuffix(".zip")
item_identifiers.append(item_identifier)

This would ensure that item identifiers are always constructued in the same way. Then both reconcile_bitstreams_and_metadata() and item_metadata_iter() could utilize this method.

I don't want to go so far as to suggest that every workflow have a parse_item_identifier() method... but it might not be bad form to establish that pattern here and then organically see if it's helpful for other workflows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Did you mean these 2 lines? If so, I agree

zip_file = f"s3://{self.s3_bucket}/{file}"
item_identifier = file.split("/")[-1].removesuffix(".zip")

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review February 3, 2025 15:25
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's looking great!

My only request is a standalone method for the item identifier parsing, to ensure it's always done the same way for any part of the workflow that attempts to establish it for an item.

Comment on lines +41 to +46
s3_client = S3Client()
for file in s3_client.files_iter(
bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip"
):
zip_file = f"s3://{self.s3_bucket}/{file}"
item_identifier = file.split("/")[-1].removesuffix(".zip")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just going to comment here! Given the reconcile updates, and the code as-is, I think my proposal is much more refined now (where last time I was talking about passing around tuples of identifiers).

What about refactoring these shared, duplicated -- but important -- two lines into a standalone method?

item_identifier = file.split("/")[-1].removesuffix(".zip")
item_identifiers.append(item_identifier)

This would ensure that item identifiers are always constructued in the same way. Then both reconcile_bitstreams_and_metadata() and item_metadata_iter() could utilize this method.

I don't want to go so far as to suggest that every workflow have a parse_item_identifier() method... but it might not be bad form to establish that pattern here and then organically see if it's helpful for other workflows.

},
}
logger.error(json.dumps(reconcile_error_message))
raise ReconcileError(json.dumps(reconcile_error_message))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the reconcile discussions are paying dividends. This method feels true to this workflow now, raising exceptions when there is a problem.

)
return source_metadata

def _get_instructors_delimited_string(self, instructors: list[dict[str, str]]) -> str:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting @jonavellecuerdo's comment #2 here, around exploring if multi-value lists/arrays can be passed for metadata parsing, to avoid creating a delimited string just to parse it again.

I would be in favor of either an issue or a new ticket and not incorporate that in this PR. But noting here, as I think this PR exposes a situation when we're creating delimited strings from structured data only to re-structure it moments later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that a ticket would be the best approach

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I concur with Graham's method and ticket suggestions and then full approval!

(i.e., '<item_identifier>'.zip)
bitstreams without metadata (any zip files without a 'data.json').
Metadata without bitstreams is not calculated as for a 'data.json' file to
exist, the zip file must also exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good docstring!

Comment on lines +41 to +46
s3_client = S3Client()
for file in s3_client.files_iter(
bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip"
):
zip_file = f"s3://{self.s3_bucket}/{file}"
item_identifier = file.split("/")[-1].removesuffix(".zip")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Did you mean these 2 lines? If so, I agree

zip_file = f"s3://{self.s3_bucket}/{file}"
item_identifier = file.split("/")[-1].removesuffix(".zip")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants