Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/#232 redesign sql stage input output #237

Merged
merged 11 commits into from
Dec 5, 2024

Conversation

ckunki
Copy link
Contributor

@ckunki ckunki commented Dec 4, 2024

Closes #232

"""

data_partitions: Dict[DataPartitionName, DataPartition]
table_like: TableLike
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically,

  • data_partition was removed and
  • sample_columns and target_columns have been merged into a simple columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made some tests obsolete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docsting was also removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I re-added the docstring and updated its content to match the redesign.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take from it:
The table_like is the data container. As its type name suggests, the data comes in the form of multiple columns. identifier_columns and columns are the descriptions of these columns. The meaning of the dependencies is unclear, but hopefully the docsting of the Dependencies class gives an explanation. If I am wrong please rewrite the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit, that to me the meaning of dependencies is also a bit unclear.

The plural is only defined as a type:

Dependencies = Dict[Enum, Dependency]

The Docstring of class Dependency has some issues in wording etc., so I would prefer to update it in the scope of the current PR.

The docstring to me sounds more like a structured comment targeted for human readers rather than to be used by an algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See last push.



@dataclasses.dataclass(frozen=True, eq=True)
@dataclass(frozen=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, class SQLStageInputOutput has been simplified by leaving most implementation details up to the user.
Again this made some tests obsolete and also required tests to instantiate MultiDatasetSQLStageInputOutput instead of the abstract parent class.

check_dataclass_types(self)
@dataclass(frozen=True)
class MultiDatasetSQLStageInputOutput(SQLStageInputOutput):
datasets: Dict[object, Dataset]
Copy link
Contributor Author

@ckunki ckunki Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a single dataset (potentially containing multiple partitions) we now have a dict of potentially multiple datasets indexed by arbitrary objects.

This simplifies the tests but required some updates to the tests.

and updated planned release date
ahsimb
ahsimb previously approved these changes Dec 4, 2024
"""

data_partitions: Dict[DataPartitionName, DataPartition]
table_like: TableLike
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take from it:
The table_like is the data container. As its type name suggests, the data comes in the form of multiple columns. identifier_columns and columns are the descriptions of these columns. The meaning of the dependencies is unclear, but hopefully the docsting of the Dependencies class gives an explanation. If I am wrong please rewrite the docstring.

An instance of this class expresses that a object depends on something
which in fact can depend on something else. The exact meaning of a
dependency is user-defined. For example, a dependency could express that
a view depends on a certain table.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a better English, but the phrase "object depends on something which in fact can depend on something else" still sounds a bit mystical. An instance of this class represents a node in the dependency graph, right? That's how I would describe it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that is probably a better wording. Had first to think, if it is really a graph or if it is only a tree, but yes it can be a graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds a bit mystical

I definitely agree 🙂
Please see updated PR.

@ckunki ckunki merged commit 223a124 into main Dec 5, 2024
23 checks passed
@ckunki ckunki deleted the feature/#232-Redesign-SQLStageInputOutput branch December 5, 2024 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redesign SQLStageInputOutput
3 participants