Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds shell script for unbounded table read , modifies table_read to execute the same. #77

Conversation

prashastia
Copy link
Collaborator

Adds shell script that

Creates the asynchronously running unbounded job - the dataproc job is created in mode unbounded
Dynamically adds partitions - insert_dynamic_partitions.py is executed with the necessary parameters.
kills the dataproc job - As the read has completed and its correctness needs to be checked further.

/gcbrun

This module is similar to the BigQueryExample. A few changes to count the number of records and log them.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
This test reads a simpleTable.
Shell script and python script to check the number of records read.
comments CODECOV_TOKEN usage.
…ds to different tables required for the e2e tests.
…nded-read-2-3

# Conflicts:
#	cloudbuild/python-scripts/utils/utils.py
Comment on lines 39 to 41
else
echo "Unbounded Mode!"
source cloudbuild/e2e-test-scripts/unbounded_table_read.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly check if mode is unbounded, just like the bounded case. If mode is neither bounded nor unbounded, throw error.

python3 cloudbuild/python-scripts/insert_dynamic_partitions.py -- --project_name "$PROJECT_NAME" --dataset_name "$DATASET_NAME" --table_name "$TABLE_NAME" --refresh_interval "$PARTITION_DISCOVERY_INTERVAL"

# Wait for a bit, as mapping and output of records takes some time.
sleep 3m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we arrive at this number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our code we wait 2.5 mins for the read streams to form (prior to insertion).
This also makes sure that the previously inserted records are read properly.
But after the last insertion, since there is no wait in the python code, we wait through the script.

Can remove the extra 30 seconds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a cushion is alright. Just wanted to understand the reason behind this number.

@@ -36,7 +36,13 @@ if [ "$MODE" == "bounded" ]
then
echo "Bounded Mode!"
source cloudbuild/e2e-test-scripts/bounded_table_read.sh

elif [ "$MODE" == "unbounded" ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use shell case statement, as here

@jayehwhyehentee jayehwhyehentee merged commit a50014b into GoogleCloudDataproc:main Jan 9, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants