Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7028][CH][Part-8] Support one pipeline write for partition mergetree #7924

Merged
merged 4 commits into from
Nov 13, 2024

Conversation

baibaichen
Copy link
Contributor

@baibaichen baibaichen commented Nov 12, 2024

What changes were proposed in this pull request?

(Fixes: #7028)
The following digram shows the current class hierarchy, SparkPartitionedBaseSink inherits from ch's DB::PartitionedSink

WriteStatsBase
  |- MergeTreeStats  <--- collect stats at finish  -----------------------|
  |- WriteStats      <--- collect stats at consume ---|                   |
                                                      |                   |
SparkPartitionedBaseSink                              |                   |
  |- SubstraitPartitionedFileSink      ---create --> SubstraitFileSink    |
  |- SparkMergeTreePartitionedFileSink ---create --> SparkMergeTreeSink --|

The partition MergeTree in pipeline write looks like this, it squashes block before partitiion for whole input:

  // spark 3.5
  Input pipeline 
    => PlanSquashingTransform 
      => ApplySquashingTransform 
       => SparkMergeTreePartitionedFileSink
          => SparkMergeTreeSink
          => SparkMergeTreeSink
          => ...
        => MergeTreeStats

It differs from spark 3.3 which squashes block after partitiion for each partition, since parition is triggerd by JVM.

The new implemwentation is same as clickhouse.

How was this patch tested?

Using existed UTs

Copy link

#7028

Copy link

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen merged commit c1a3f7b into apache:main Nov 13, 2024
11 checks passed
@baibaichen baibaichen deleted the feature/partition_mergetree branch November 13, 2024 01:32
PHILO-HE pushed a commit to PHILO-HE/gluten that referenced this pull request Nov 13, 2024
…rgetree (apache#7924)

* [Refactor] simple refactor
* [Refactor] Remove setStats
* [Refactor] SparkPartitionedBaseSink and WriteStatsBase
* [Refactor] Add explicit SparkMergeTreeWriteSettings(const DB::ContextPtr & context);
* [New] Support writing partition mergetree in one pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Fully Support writing parquet and mergetree in spark 3.5.x with delta protocol
2 participants