Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds python script for incremental partition insertion. #76

Merged
Changes from all commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
a4d0d25
Adds a new module for nightly tests.
prashastia Dec 18, 2023
f674638
Modifies the docstring.
prashastia Dec 19, 2023
6ec22dc
Modifies the docstring.
prashastia Dec 19, 2023
f0cd825
Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh
prashastia Dec 19, 2023
c814bd9
Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh
prashastia Dec 19, 2023
f5d3f31
Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh
prashastia Dec 19, 2023
b9cad97
Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh
prashastia Dec 19, 2023
b3ce92b
Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh
prashastia Dec 19, 2023
4e6ef67
Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh
prashastia Dec 19, 2023
dd8e5be
Modifies IntegrationTest to check query correctness.
prashastia Dec 19, 2023
137030d
Adds spotless:apply.
prashastia Dec 20, 2023
957ebf0
Renames the table_read file to bounded_table_read.sh
prashastia Dec 21, 2023
2ff4032
Addresses review comments.
prashastia Dec 21, 2023
394132d
Creates separate shell script for bounded jobs.
prashastia Dec 21, 2023
5425806
Adds a new common shell script for common actions performed for both …
prashastia Dec 21, 2023
ea79601
Addresses review comments.
prashastia Dec 21, 2023
c6a564f
Addresses review comments.
prashastia Dec 21, 2023
a2a49b3
Formats the file.
prashastia Dec 21, 2023
3da1b14
Fixes checkstyle violations.
prashastia Dec 21, 2023
0f34c03
Fixes checkstyle violations.
prashastia Dec 21, 2023
0822681
Fixes checkstyle violations.
prashastia Dec 21, 2023
1770526
Fixes checkstyle violations.
prashastia Dec 21, 2023
e416b1f
Changes test name, adds a new e2e test for checking table read for ta…
prashastia Dec 21, 2023
471aca2
Changes test name, adds a new e2e test for checking table read for ta…
prashastia Dec 21, 2023
68d48d0
Adds a new e2e test for checking query read.
prashastia Dec 21, 2023
44f39f6
Fixes cause of error in query read execution.
prashastia Dec 21, 2023
b8a5883
Fixes cause of error in query read execution.
prashastia Dec 21, 2023
46ac1e6
Adds a new e2e test for checking large table ~200GBs read.
prashastia Dec 21, 2023
74afc54
Adds utils.py - a class containing implementations for writing recor…
prashastia Dec 21, 2023
ea0cba5
Addresses review comments in parse_logs.py.
prashastia Dec 22, 2023
05bbd95
Addresses review comments in nightly.sh and modifies nightly.yaml for…
prashastia Dec 22, 2023
b731be7
Addresses review comments in requirements.txt
prashastia Dec 22, 2023
be260e4
Addresses review comments in table_read.sh
prashastia Dec 22, 2023
f172b67
Fixes checkstyle violations, addresses review comments.
prashastia Dec 22, 2023
37bcd09
Update pom.xml
prashastia Dec 22, 2023
485d4b7
Addresses review comments in table_read.sh
prashastia Dec 22, 2023
61e4709
Fixes metric value regex to capture digits.
prashastia Dec 22, 2023
8138578
Addresses review comments on parse_logs.py
prashastia Dec 23, 2023
ecf38d0
Fixes indentation problems in pom.xml
prashastia Dec 23, 2023
bd4d02c
Fixes indentation problems in pom.xml
prashastia Dec 23, 2023
926a9d4
Adds utils.py containing helper function for dynamic record addition.
prashastia Dec 23, 2023
e9a0526
Merge remote-tracking branch 'origin/nightly-tests' into nightly-test…
prashastia Dec 23, 2023
dce318f
undo.
prashastia Dec 23, 2023
77a3994
Merge remote-tracking branch 'origin/nightly-tests-large-table-read' …
prashastia Dec 23, 2023
8adfb9b
Adapting parse_log to use the utils argument input class.
prashastia Dec 23, 2023
3dd668d
Adds a new abstract class table_type.py contains records creation and…
prashastia Dec 23, 2023
b456523
Adds a python script to create a partitioned table and add initial va…
prashastia Dec 23, 2023
2c2875e
insert_dynamic_partitions.py A python script to insert partitions inc…
prashastia Dec 23, 2023
18bc65f
fixes error.
prashastia Dec 24, 2023
045cfd9
Fixes error in parse_logs.py to accept the argument.
prashastia Dec 24, 2023
d36e17c
Merge remote-tracking branch 'origin/nightly-tests-unbounded-read-1' …
prashastia Dec 24, 2023
d822421
Merge remote-tracking branch 'origin/nightly-tests-unbounded-read-1-2…
prashastia Dec 24, 2023
695f629
Merge remote-tracking branch 'origin/nightly-tests-unbounded-read-2-3…
prashastia Dec 24, 2023
d2bc679
Fixes error in the script regarding string size.
prashastia Dec 26, 2023
b53de8e
Merge remote-tracking branch 'origin/nightly-tests-unbounded-read-1' …
prashastia Dec 26, 2023
1b153ed
Merge remote-tracking branch 'origin/nightly-tests-unbounded-read-1-2…
prashastia Dec 26, 2023
faae51a
Merge remote-tracking branch 'dataproc/main' into nightly-tests-unbou…
prashastia Dec 28, 2023
6ad3fbc
Modifies utils.py to remove redundant error messages.
prashastia Dec 28, 2023
7498bd7
Modifies parse_logs.py to input on the utils.
prashastia Dec 28, 2023
f4bb759
Merge remote-tracking branch 'origin/nightly-tests-unbounded-read-1' …
prashastia Dec 29, 2023
8c10500
Adds table_type.py. An abstract class containing implementations of w…
prashastia Dec 29, 2023
5aada3c
Addresses a few review comments,
prashastia Dec 29, 2023
e56ed19
Addresses a few review comments,
prashastia Dec 29, 2023
442a8af
Addresses review comments.
prashastia Dec 29, 2023
a6a4f68
Moves the avro file identifier to the file which uses it.
prashastia Dec 29, 2023
ceb531e
removes table_type.py since it is not used in the e2e tests.
prashastia Dec 29, 2023
af215bd
Merge remote-tracking branch 'origin/nightly-tests-unbouned-read-1' i…
prashastia Dec 29, 2023
dd0cf39
Works on create_partitioned_table.py to accommodate for utils.py rest…
prashastia Dec 29, 2023
b3b3dbf
Reformats utils.py
prashastia Dec 29, 2023
fcd91fb
Addresses review comments.
prashastia Dec 29, 2023
ab856aa
Addresses review comments.
prashastia Dec 29, 2023
f344f5d
Adds argparse - for trial
prashastia Dec 29, 2023
446dbd6
Adds argparse - for trial
prashastia Dec 29, 2023
3f86339
Adds argparse - for trial
prashastia Dec 29, 2023
321b3f3
Adds argparse. Removes ArgumentInputUtils from utils.py corrects argu…
prashastia Dec 29, 2023
f5231f6
Adds argparse. Removes ArgumentInputUtils from utils.py corrects argu…
prashastia Dec 29, 2023
227319c
Merge remote-tracking branch 'origin/nightly-tests-unbouned-read-1' i…
prashastia Dec 29, 2023
06fe799
Adds argparse. Fixes formatting.
prashastia Dec 29, 2023
3830028
Merge remote-tracking branch 'origin/nightly-tests-unbounded-read-1-2…
prashastia Dec 29, 2023
361e148
Adds argparse. Fixes formatting.
prashastia Dec 29, 2023
4127553
Merge remote-tracking branch 'dataproc/main' into nightly-tests-unbou…
prashastia Jan 2, 2024
6078ebe
Changes insert_dynamic_partitions.py to account for increased rows. N…
prashastia Jan 2, 2024
54170ed
Addresses review comments.
prashastia Jan 2, 2024
f2f21d5
Addresses review comments.
prashastia Jan 2, 2024
543817e
Addresses review comments.
prashastia Jan 2, 2024
b814589
Addresses review comments.
prashastia Jan 2, 2024
b68ea5b
Addresses review comments.
prashastia Jan 2, 2024
820dcc8
Addresses review comments.
prashastia Jan 2, 2024
2bd9fad
Reformats the file.
prashastia Jan 3, 2024
a02570f
Addresses review comments.
prashastia Jan 3, 2024
2c7cab2
Addresses review comments.
prashastia Jan 3, 2024
d48c345
Addresses review comments.
prashastia Jan 3, 2024
4118b97
Addresses review comments.
prashastia Jan 3, 2024
5eee53a
Addresses review comments. Takes number of rows per partition as an a…
prashastia Jan 3, 2024
3f5367e
Addresses review comments.
prashastia Jan 4, 2024
a8fbfb8
Addresses review comments.
prashastia Jan 4, 2024
393a33f
Reduces the wait time - an experiment
prashastia Jan 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions cloudbuild/python-scripts/insert_dynamic_partitions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
"""Python script to dynamically partitions to a BigQuery partitioned table."""

import argparse
from collections.abc import Sequence
import datetime
import logging
import threading
import time
from absl import app
from utils import utils


def sleep_for_seconds(duration):
logging.info(
'Going to sleep, waiting for connector to read existing, Time: %s',
datetime.datetime.now()
)
# Buffer time to ensure that new partitions are created
# after previous read session and before next split discovery.
time.sleep(duration)


def main(argv: Sequence[str]) -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
'--refresh_interval',
dest='refresh_interval',
help='Minutes between checking new data',
type=int,
required=True,
)
parser.add_argument(
'--project_name',
dest='project_name',
help='Project Id which contains the table to be read.',
type=str,
required=True,
)
parser.add_argument(
'--dataset_name',
dest='dataset_name',
help='Dataset Name which contains the table to be read.',
type=str,
required=True,
)
parser.add_argument(
'--table_name',
dest='table_name',
help='Table Name of the table which is read in the test.',
type=str,
required=True,
)
parser.add_argument(
'-n',
'--number_of_rows_per_partition',
dest='number_of_rows_per_partition',
help='Number of rows to insert per partition.',
type=int,
required=False,
default=30000,
)

args = parser.parse_args(argv[1:])

# Providing the values.
project_name = args.project_name
dataset_name = args.dataset_name
table_name = args.table_name
number_of_rows_per_partition = args.number_of_rows_per_partition

execution_timestamp = datetime.datetime.now(tz=datetime.timezone.utc).replace(
hour=0, minute=0, second=0, microsecond=0
)
refresh_interval = int(args.refresh_interval)

# Set the partitioned table.
table_id = f'{project_name}.{dataset_name}.{table_name}'

# Now add the partitions to the table.
# Hardcoded schema. Needs to be same as that in the pre-created table.
simple_avro_schema_fields_string = (
'"fields": [{"name": "name", "type": "string"},{"name": "number",'
'"type": "long"},{"name" : "ts", "type" : {"type" :'
'"long","logicalType": "timestamp-micros"}}]'
)
simple_avro_schema_string = (
'{"namespace": "project.dataset","type": "record","name":'
' "table","doc": "Avro Schema for project.dataset.table",'
f'{simple_avro_schema_fields_string}'
'}'
)

# hardcoded for e2e test.
# partitions[i] * number_of_rows_per_partition are inserted per phase.
partitions = [2, 1, 2]
# BQ rate limit is exceeded due to large number of rows.
number_of_threads = 2
number_of_rows_per_thread = number_of_rows_per_partition // number_of_threads

avro_file_local = 'mockData.avro'
table_creation_utils = utils.TableCreationUtils(
simple_avro_schema_string,
number_of_rows_per_thread,
table_id,
)

# Insert iteratively.
prev_partitions_offset = 0
for number_of_partitions in partitions:
start_time = time.time()
# Wait for read stream formation.
sleep_for_seconds(2.5 * 60)

# This represents one iteration.
for partition_number in range(number_of_partitions):
threads = list()
# Insert via concurrent threads.
for thread_number in range(number_of_threads):
avro_file_local_identifier = avro_file_local.replace(
'.', '_' + str(thread_number) + '.'
)
thread = threading.Thread(
target=table_creation_utils.avro_to_bq_with_cleanup,
kwargs={
'avro_file_local_identifier': avro_file_local_identifier,
'partition_number': partition_number + prev_partitions_offset,
'current_timestamp': execution_timestamp,
},
)
threads.append(thread)
thread.start()
for _, thread in enumerate(threads):
thread.join()

time_elapsed = time.time() - start_time
prev_partitions_offset += number_of_partitions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is prev_partitions_offset being incremented multiple times in the same iteration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the same iteration we are adding rows spread amongst 1 or more partitions.
so that at a new read, we make sure that multiple partitions are being read from.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take an example.

First iteration:

    for number_of_partitions in partitions:   # number_of_partitions is 2
        ...
        prev_partitions_offset += 1   # prev_partitions_offset is 1
        ...
        # called avro_to_bq_with_cleanup with partition_number as 1
        # called avro_to_bq_with_cleanup with partition_number as 2
        ...
        prev_partitions_offset += number_of_partitions   # prev_partitions_offset is 3
        ...

Second iteration:

    for number_of_partitions in partitions:   # number_of_partitions is 1
        ...
        prev_partitions_offset += 1   # prev_partitions_offset is 4
        ...
        # called avro_to_bq_with_cleanup with partition_number as 4
        ...
        prev_partitions_offset += number_of_partitions   # prev_partitions_offset is 5
        ...

Third iteration:

    for number_of_partitions in partitions:   # number_of_partitions is 2
        ...
        prev_partitions_offset += 1   # prev_partitions_offset is 6
        ...
        # called avro_to_bq_with_cleanup with partition_number as 6
        # called avro_to_bq_with_cleanup with partition_number as 7
        ...
        prev_partitions_offset += number_of_partitions   # prev_partitions_offset is 8
        ...

So, we've skipped partition offsets 3 and 5. If that is intentional, then why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is correct.
So to maintain time consistency, I took time at UTC which would be 18:30 hrs.
So (18:30 + 2 = 20:30) to (18:30 + 3 = 21:30) will generate values in partitions 20hrs, 21 hrs.

So in next phase if we generate for 18:30 + 3 - 18:30 + 4, the partitions will clash.

I think this is getting too confusing.
I'll fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed now.


# We wait until the read streams are formed again.
# So that the records just created can be read.
sleep_for_seconds(float(60 * refresh_interval) - time_elapsed)


if __name__ == '__main__':
app.run(main)
Loading