-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In 1096 ocw workflow #91
base: main
Are you sure you want to change the base?
Conversation
9f54f9f
to
5097b39
Compare
Pull Request Test Coverage Report for Build 13116701898Details
💛 - Coveralls |
"dc.contributor.author": { | ||
"source_field_name": "instructor", | ||
"language": "en_US", | ||
"delimiter": "|" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping is made possible by the transformation of the instructors
property when _read_metadata_json_file()
method is called.
def _construct_instructor_name(instructor: dict[str, str]) -> str: | ||
"""Given a dictionary of name fields, derive instructor name.""" | ||
if not (last_name := instructor.get("last_name")) or not ( | ||
first_name := instructor.get("first_name") | ||
): | ||
return "" | ||
return f"{last_name}, {first_name} {instructor.get("middle_initial", "")}".strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is plausible that all the metadata in data.json
will always be formatted as needed (i.e., all instructor
name fields provided), it would be a good idea to check in with stakeholders (IN-1156) on the "minimum required instructor
name fields` to construct an instructor name.
In this sample mapping file we received, ocw_json_to_dspace_mapping.xlsx, it indicates the instructor
names must be formatted as:
<last_name>, <first_name> <middle_initial>
The code above will return an empty string if either the last_name
or first_name
is missing; it allows for missing middle_initial
values.
7238447
to
abcf2ae
Compare
Met with @ghukill last Friday and wanted to share some thoughts from our discussion:
|
abcf2ae
to
762997e
Compare
Agree that 1 & 2 should be handled as separate tickets! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great but a few requested changes!
def process_deposit_results(self) -> list[str]: | ||
"""TODO: Stub method.""" | ||
return [""] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can probably just default to Workflow.process_results
but let's confirm with stakeholders!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be addressed via ticket IN-1156.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looking good to me. Thanks for the discussion the other day, which was quite helpful.
I left a couple of comments/suggestions for fairly minor, syntactical things. None are required or blocking.
I did have another comment that this PR surfaced for me. Start by saying perhaps this is could be another ticket for exploring.
Should the CLI command reconcile
potentially return a non-zero exit code? Thinking forward to automation, throwing an exit code like 1
or 2
would indicate that reconciliation had failed in some way. This could be helpful for humans, but more helpful for automation like StepFunctions.
I think this came to mind in this PR given the conversations about reconciling and what it's conceptually communicating.
* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'
Why these changes are being introduced: * Support OpenCourseWare deposits requested by Technical Services staff. How this addresses that need: * Define custom methods to extract metadata from 'data.json' * Define custom 'get_bitstream_s3_uris' to filter to zip files * Define custom methods to reconcile bitstreams with item metadata (i.e., identify zip files without 'data.json' files) * Create OpenCourseWare metadata mapping JSON file * Add unit tests Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1096
* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'
145447e
to
655da2e
Compare
s3_client = S3Client() | ||
for file in s3_client.files_iter( | ||
bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip" | ||
): | ||
zip_file = f"s3://{self.s3_bucket}/{file}" | ||
item_identifier = file.split("/")[-1].removesuffix(".zip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This function contains similar steps to OpenCourseWare.item_metadata_iter
; only, instead of yielding a dictionary of metadata, it checks whether metadata can be extracted from the zip file--tracking item identifiers in two lists:
item_identifiers
: list of item identifiers for all zip filesbitstreams_without_metadata
: list of item identifiers for all zip files without a 'data.json' file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just going to comment here! Given the reconcile updates, and the code as-is, I think my proposal is much more refined now (where last time I was talking about passing around tuples of identifiers).
What about refactoring these shared, duplicated -- but important -- two lines into a standalone method?
item_identifier = file.split("/")[-1].removesuffix(".zip")
item_identifiers.append(item_identifier)
This would ensure that item identifiers are always constructued in the same way. Then both reconcile_bitstreams_and_metadata()
and item_metadata_iter()
could utilize this method.
I don't want to go so far as to suggest that every workflow have a parse_item_identifier()
method... but it might not be bad form to establish that pattern here and then organically see if it's helpful for other workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Did you mean these 2 lines? If so, I agree
zip_file = f"s3://{self.s3_bucket}/{file}"
item_identifier = file.split("/")[-1].removesuffix(".zip")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's looking great!
My only request is a standalone method for the item identifier parsing, to ensure it's always done the same way for any part of the workflow that attempts to establish it for an item.
s3_client = S3Client() | ||
for file in s3_client.files_iter( | ||
bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip" | ||
): | ||
zip_file = f"s3://{self.s3_bucket}/{file}" | ||
item_identifier = file.split("/")[-1].removesuffix(".zip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just going to comment here! Given the reconcile updates, and the code as-is, I think my proposal is much more refined now (where last time I was talking about passing around tuples of identifiers).
What about refactoring these shared, duplicated -- but important -- two lines into a standalone method?
item_identifier = file.split("/")[-1].removesuffix(".zip")
item_identifiers.append(item_identifier)
This would ensure that item identifiers are always constructued in the same way. Then both reconcile_bitstreams_and_metadata()
and item_metadata_iter()
could utilize this method.
I don't want to go so far as to suggest that every workflow have a parse_item_identifier()
method... but it might not be bad form to establish that pattern here and then organically see if it's helpful for other workflows.
}, | ||
} | ||
logger.error(json.dumps(reconcile_error_message)) | ||
raise ReconcileError(json.dumps(reconcile_error_message)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the reconcile discussions are paying dividends. This method feels true to this workflow now, raising exceptions when there is a problem.
) | ||
return source_metadata | ||
|
||
def _get_instructors_delimited_string(self, instructors: list[dict[str, str]]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting @jonavellecuerdo's comment #2 here, around exploring if multi-value lists/arrays can be passed for metadata parsing, to avoid creating a delimited string just to parse it again.
I would be in favor of either an issue or a new ticket and not incorporate that in this PR. But noting here, as I think this PR exposes a situation when we're creating delimited strings from structured data only to re-structure it moments later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that a ticket would be the best approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I concur with Graham's method and ticket suggestions and then full approval!
(i.e., '<item_identifier>'.zip) | ||
bitstreams without metadata (any zip files without a 'data.json'). | ||
Metadata without bitstreams is not calculated as for a 'data.json' file to | ||
exist, the zip file must also exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good docstring!
s3_client = S3Client() | ||
for file in s3_client.files_iter( | ||
bucket=self.s3_bucket, prefix=self.batch_path, file_type=".zip" | ||
): | ||
zip_file = f"s3://{self.s3_bucket}/{file}" | ||
item_identifier = file.split("/")[-1].removesuffix(".zip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Did you mean these 2 lines? If so, I agree
zip_file = f"s3://{self.s3_bucket}/{file}"
item_identifier = file.split("/")[-1].removesuffix(".zip")
Purpose and background context
This PR creates the
OpenCourseWare
DSC workflow.This also introduces the following changes:
SimpleCSV
workflow. The result is that when we runmake test
, workflow test results appear in the terminal together.How can a reviewer manually see the effects of these changes?
A. Review the added unit tests.
Note: The only custom method defined for
OpenCourseWare
without a unit test is theitem_metadata_iter
method. See method B for testing with MinIO server.OpenCourseWare
methods.B. Optional but highly recommended (especially for future development).
Run
OpenCourseWare
commands using local MinIO server.Prerequisite
Follow instructions in README: Running a Local MinIO Server.
Note: As of this writing, the root password set for the local MinIO server must be at least 8 characters long. Didn't want to write this requirement in the README as it is subject to change if/when we download updated versions of the MinIO Docker image.
Mock out the local MinIO server with test zip files.
Note: I did these steps via the WebUI.
dsc
bucket:dsc/opencourseware/batch-00/
It is not important to mock other files as the bitstream for OpenCourseWare deposits is the zip file itself.
data.json
.data.json
dsc/opencourseware/batch-01/
Add the following environment variables in your
.env
file.OpenCourseWare
commandsLaunch Python in your terminal:
pipenv run python
item_metadata_iter()
result forbatch-00
.You should see the following output:
item_metadata_iter()
result forbatch-01
.You should see the following output:
An
FileNotFoundError
is raised if any zip file is missing metadata (i.e., thedata.json
file)CLI command:
reconcile
reconcile
forbatch-00
in your terminal:You should see the following output [REDACTED]:
reconcile
forbatch-01
in your terminal:You should see the following output [REDACTED]:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)