-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FCL-167] Import bailii tab from bailii-docx s3 bucket #172
Conversation
1317cdf
to
decc59a
Compare
Discussion in standup: should also check is published. Log those that are not published. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some minor comments that may not be an issue
|
||
class Row(BaseRow): | ||
def source_key(self): | ||
extension_match = re.search(r"^(.*)\.([a-z]*)$", self.filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ever possible this won't match? do we want to catch when the extension doesn't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked and the extensions should be exactly the ones on L50. Anything else means we've changed the file. I'm 100% okay with this being terribly, terribly brittle (as long as it explodes safely) since it should only ever be running against one file.
return f"{path}/{filename}" | ||
|
||
def has_docx_in_s3(self): | ||
response = requests.head(f"{ASSETS_BASE}/{self.target_key()}", timeout=30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this request fails, does the whole script end? are we happy for that to happen or could we ad a try/except here and allow processing other assets? (same for is_published)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We're going to be running this manually; I'm personally OK with brittleness over uncertain behaviour.
|
||
nice_data = [] | ||
for row in raw_data[1:]: | ||
row_object = Row(**dict(zip(headers, row))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure on the data here (assuming it's generated by the user of this script) but is it possible rows don't have all the data? We could check here first with if len(row) == len(headers): ...
to ensure they have all the columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of immediate comments, now going through in depth
else: | ||
raise RuntimeError | ||
|
||
dryrun_bonus = [DRY_RUN] if DRY_RUN else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unclear on the logic here, DRY_RUN
is set to "--dryrun"
so will always evaluate to True
, should there be some argument parsing going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've hard coded it to avoid screwups during development; but yes, argument parsing sounds like a plan.
463445d
to
784d417
Compare
d81ef02
to
11e8bdb
Compare
11e8bdb
to
87b107a
Compare
Why are we using
-e staging
?Because we're using Dalmatian v2; we could use
-i test
and it'd work.What's going on with the
--acl
?We need each file to be world readable for the public bucket, but this isn't valid for the private one.
Example output (with dryrun on)
Jira
FCL-167