-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add post-process labels stage #4
Conversation
274d8ff
to
50109f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added small fixes after reading through PR code one last time. Everything for post-processing looks good.
Next PR will tackle TODOs about rewriting outputs
def create_dataset(dataset_id): | ||
try: | ||
dataset_path = f"{client.project}.{dataset_id}" | ||
dataset = bigquery.Dataset(dataset_path) | ||
dataset.location = "US" | ||
dataset = client.create_dataset(dataset, timeout=30) | ||
print(f"Created dataset {client.project}.{dataset.dataset_id}") | ||
except Conflict as e: | ||
if "Already Exists" in str(e): | ||
dataset = client.get_dataset(dataset_id) | ||
print(f"Dataset {client.project}.{dataset.dataset_id} already exists. Continuing.") | ||
else: | ||
raise e | ||
|
||
return dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap to first check if the table exists, then try to make table only when it does not, instead of the other way around, requiring a try/except.
def run(args): | ||
dataset = create_dataset(args.dataset_id) | ||
table = create_table(dataset.dataset_id, args.table_id) | ||
return table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand to include checks for all stags that will be writing to gcp
Final stage of the pipeline.
Will need to decide on how and what to store in the final outputs.
If we are storing all intermediate stage outputs, we can just join on key or encounter ids, to avoid storing duplicate audio data. Only need to store encounter id and pooled-classification (average).
Need to think about this a bit before implementing.