Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FCL-167] Import bailii tab from bailii-docx s3 bucket #172

Merged
merged 1 commit into from
Aug 27, 2024

Conversation

dragon-dxw
Copy link
Collaborator

@dragon-dxw dragon-dxw commented Jul 9, 2024

Why are we using -e staging?
Because we're using Dalmatian v2; we could use -i test and it'd work.

What's going on with the --acl?
We need each file to be world readable for the public bucket, but this isn't valid for the private one.

Example output (with dryrun on)

Skipping ewca/civ/2003/1/ewca_civ_2003_1.docx, exists
Skipping ewca/civ/2003/1002/ewca_civ_2003_1002.docx, exists
Writing rtf/EWCA/Civ/2003/1005.docx to ewca/civ/2003/1005/ewca_civ_2003_1005.docx...
['dalmatian', 'aws-sso', 'run-command', '-i', 'caselaw', '-e', 'staging', 's3', 'cp', 's3://bailii-docx/rtf/EWCA/Civ/2003/1005.docx', 's3://tna-caselaw-unpublished-assets/ewca/civ/2003/1005/ewca_civ_2003_1005.docx', '--acl', 'bucket-owner-full-control', '--dryrun']
(dryrun) copy: s3://bailii-docx/rtf/EWCA/Civ/2003/1005.docx to s3://tna-caselaw-unpublished-assets/ewca/civ/2003/1005/ewca_civ_2003_1005.docx
['dalmatian', 'aws-sso', 'run-command', '-i', 'caselaw', '-e', 'staging', 's3', 'cp', 's3://bailii-docx/rtf/EWCA/Civ/2003/1005.docx', 's3://tna-caselaw-assets/ewca/civ/2003/1005/ewca_civ_2003_1005.docx', '--acl', 'bucket-owner-full-control', '--acl', 'public-read', '--dryrun']
(dryrun) copy: s3://bailii-docx/rtf/EWCA/Civ/2003/1005.docx to s3://tna-caselaw-assets/ewca/civ/2003/1005/ewca_civ_2003_1005.docx

Jira

FCL-167

@dragon-dxw dragon-dxw marked this pull request as draft July 18, 2024 16:36
@dragon-dxw dragon-dxw force-pushed the docs/import-spreadsheet branch from 1317cdf to decc59a Compare July 18, 2024 16:46
@dragon-dxw dragon-dxw marked this pull request as ready for review July 19, 2024 11:59
@dragon-dxw dragon-dxw changed the title Document the import spreadsheet Import bailii tab from bailii-docx s3 bucket Jul 19, 2024
@dragon-dxw
Copy link
Collaborator Author

Discussion in standup: should also check is published. Log those that are not published.

Copy link
Collaborator

@jlhdxw jlhdxw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor comments that may not be an issue


class Row(BaseRow):
def source_key(self):
extension_match = re.search(r"^(.*)\.([a-z]*)$", self.filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ever possible this won't match? do we want to catch when the extension doesn't match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked and the extensions should be exactly the ones on L50. Anything else means we've changed the file. I'm 100% okay with this being terribly, terribly brittle (as long as it explodes safely) since it should only ever be running against one file.

return f"{path}/{filename}"

def has_docx_in_s3(self):
response = requests.head(f"{ASSETS_BASE}/{self.target_key()}", timeout=30)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this request fails, does the whole script end? are we happy for that to happen or could we ad a try/except here and allow processing other assets? (same for is_published)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We're going to be running this manually; I'm personally OK with brittleness over uncertain behaviour.


nice_data = []
for row in raw_data[1:]:
row_object = Row(**dict(zip(headers, row)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on the data here (assuming it's generated by the user of this script) but is it possible rows don't have all the data? We could check here first with if len(row) == len(headers): ... to ensure they have all the columns

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout.

@jacksonj04 jacksonj04 changed the title Import bailii tab from bailii-docx s3 bucket [FCL-167] Import bailii tab from bailii-docx s3 bucket Aug 1, 2024
@jacksonj04 jacksonj04 self-requested a review August 1, 2024 14:19
Copy link
Collaborator

@jacksonj04 jacksonj04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of immediate comments, now going through in depth

else:
raise RuntimeError

dryrun_bonus = [DRY_RUN] if DRY_RUN else []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear on the logic here, DRY_RUN is set to "--dryrun" so will always evaluate to True, should there be some argument parsing going on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've hard coded it to avoid screwups during development; but yes, argument parsing sounds like a plan.

@dragon-dxw dragon-dxw force-pushed the docs/import-spreadsheet branch 2 times, most recently from 463445d to 784d417 Compare August 1, 2024 17:18
@dragon-dxw dragon-dxw force-pushed the docs/import-spreadsheet branch 2 times, most recently from d81ef02 to 11e8bdb Compare August 27, 2024 11:54
@dragon-dxw dragon-dxw force-pushed the docs/import-spreadsheet branch from 11e8bdb to 87b107a Compare August 27, 2024 11:56
@dragon-dxw dragon-dxw merged commit 6ebac1b into main Aug 27, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants