Add post-process labels stage #4

pmhalvor · 2024-09-29T20:56:51Z

Final stage of the pipeline.

Will need to decide on how and what to store in the final outputs.
If we are storing all intermediate stage outputs, we can just join on key or encounter ids, to avoid storing duplicate audio data. Only need to store encounter id and pooled-classification (average).

Need to think about this a bit before implementing.

…vert)

pmhalvor

Added small fixes after reading through PR code one last time. Everything for post-processing looks good.

Next PR will tackle TODOs about rewriting outputs

src/config/common.yaml

src/config/local.yaml

pmhalvor · 2024-10-05T07:36:37Z

src/create_table.py

+def create_dataset(dataset_id):
+    try:
+        dataset_path = f"{client.project}.{dataset_id}"
+        dataset = bigquery.Dataset(dataset_path)
+        dataset.location = "US"
+        dataset = client.create_dataset(dataset, timeout=30)
+        print(f"Created dataset {client.project}.{dataset.dataset_id}")
+    except Conflict as e:
+        if "Already Exists" in str(e):
+            dataset = client.get_dataset(dataset_id)
+            print(f"Dataset {client.project}.{dataset.dataset_id} already exists. Continuing.")
+        else:
+            raise e
+
+    return dataset


Swap to first check if the table exists, then try to make table only when it does not, instead of the other way around, requiring a try/except.

pmhalvor · 2024-10-05T07:37:30Z

src/create_table.py

+def run(args):
+    dataset = create_dataset(args.dataset_id)
+    table = create_table(dataset.dataset_id, args.table_id)
+    return table


Expand to include checks for all stags that will be writing to gcp

src/model_server.py

src/stages/postprocess.py

pmhalvor added 15 commits September 29, 2024 22:51

start postprocess labels stage

d3b2394

begin discussion on saving output

ebba929

Merge branch 'master' of github.com:pmhalvor/whale-speech

f01bfef

Merge branch 'master' into add-postprocess-labels

554fb7d

only run todo to issue on merge to main

37535f1

start integration w/ bigquery (broken on non picklable client)

7e02333

example write to bigquery from beam

2866d30

messy working impl of write to big query

d31efb9

hack: adjust plotting to show detection in final result plot (todo re…

3454531

…vert)

finalize decision on output storage (mainly for new PR)

d6afeb5

clean output writing for postprocessing (local & cloud)

8eee93b

clean table spec def

4672085

add unit tests

2721fb0

add resource

a2eeb95

updating naming for model uri and inference url

50109f2

pmhalvor force-pushed the add-postprocess-labels branch from 274d8ff to 50109f2 Compare October 5, 2024 07:59

pmhalvor commented Oct 5, 2024

View reviewed changes

Apply inline suggestions from code review

b4c6a23

pmhalvor merged commit 725794e into master Oct 5, 2024
1 check passed

pmhalvor mentioned this pull request Oct 7, 2024

Refactor for Dataflow runner #3

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add post-process labels stage #4

Add post-process labels stage #4

pmhalvor commented Sep 29, 2024

pmhalvor left a comment

pmhalvor Oct 5, 2024

pmhalvor Oct 5, 2024

Add post-process labels stage #4

Add post-process labels stage #4

Conversation

pmhalvor commented Sep 29, 2024

pmhalvor left a comment

Choose a reason for hiding this comment

pmhalvor Oct 5, 2024

Choose a reason for hiding this comment

pmhalvor Oct 5, 2024

Choose a reason for hiding this comment