Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize and Improve Operator Runtime Statistics Handling #3171

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kunwp1
Copy link
Collaborator

@kunwp1 kunwp1 commented Dec 19, 2024

This PR addresses schema normalization and logic improvements for tracking operator runtime statistics in a workflow execution system. It introduces changes to the database schema, migration scripts, and Scala code responsible for inserting and managing runtime statistics. The goal is to reduce redundancy, improve maintainability, and ensure data consistency between operator_executions and operator_runtime_statistics.

Schema Design

  1. New Table Design:
  • operator_executions: Tracks execution metadata for each operator in a workflow execution. Each row contains operator_execution_id, workflow_execution_id, operator_id, and num_workers. This table ensures that operator executions are uniquely identifiable.
  • operator_runtime_statistics: Tracks runtime statistics for each operator execution at specific timestamps. It includes operator_execution_id as a foreign key, ensuring a direct reference to operator_executions.
  1. Normalization Improvements:
  • Replaced repeated execution_id and operator_id in workflow_runtime_statistics with a single foreign key operator_execution_id, pointing to operator_executions.
  • Split the previous large workflow_runtime_statistics table into smaller, more manageable tables, eliminating redundancy and improving data integrity.
  1. Indexes and Keys:
  • Added a composite index on operator_execution_id and time in operator_runtime_statistics to speed up joins and queries ordered by time.

Testing

The core/scripts/sql/update/19.sql will create the two new tables, operator_executions and operator_runtime_statistics, and migrate the data from workflow_runtime_statistics to those two tables. After the review is approved, I will add a drop table workflow_runtime_statistics later in the script to remove the table.

@kunwp1 kunwp1 requested a review from shengquan-ni December 19, 2024 22:18
@kunwp1 kunwp1 self-assigned this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant