Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from StarRocks:main #5

Merged
merged 116 commits into from
May 17, 2024
Merged

[pull] main from StarRocks:main #5

merged 116 commits into from
May 17, 2024

Conversation

pull[bot]
Copy link

@pull pull bot commented May 9, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

trueeyu and others added 13 commits May 9, 2024 17:20
Signed-off-by: Albert T. Wong <[email protected]>
Signed-off-by: evelyn.zhaojie <[email protected]>
Signed-off-by: evelynzhaojie <[email protected]
Co-authored-by: evelyn.zhaojie <[email protected]>
Co-authored-by: evelyn.zhaojie <[email protected]>
Signed-off-by: simo <[email protected]>
Signed-off-by: evelynzhaojie <[email protected]
Co-authored-by: evelyn.zhaojie <[email protected]>
Signed-off-by: starrocks-xupeng <[email protected]>
Signed-off-by: evelynzhaojie <[email protected]
Co-authored-by: evelyn.zhaojie <[email protected]>
Co-authored-by: evelyn.zhaojie <[email protected]>
Signed-off-by: Kevin Xiaohua Cai <[email protected]>
@github-actions github-actions bot added documentation Improvements or additions to documentation title needs [type] labels May 9, 2024
@pull pull bot added ⤵️ pull and removed documentation Improvements or additions to documentation title needs [type] labels May 9, 2024
…um` (#43616)

Why I'm doing:
Rigjht now hdfs scanner optimization on count(1) is to output const column of expected count.

And we can see in extreme case(large dataset), the chunk number flows in pipeline will be extremely huge, and operator time and overhead time is not neglectable.

And here is a profile of select count(*) from hive.hive_ssb100g_parquet.lineorder. To reproduce this extreme case, I've changed code to scale morsels by 20x and repeat row groups by 10x.

in concurrency=1 case , total time is 51s

         - OverheadTime: 25s37ms
           - __MAX_OF_OverheadTime: 25s111ms
           - __MIN_OF_OverheadTime: 24s962ms

             - PullTotalTime: 12s376ms
               - __MAX_OF_PullTotalTime: 13s147ms
               - __MIN_OF_PullTotalTime: 11s885ms
What I'm doing:
Rewrite the count(1) query to sum like. So each row group reader will only emit at one chunk(size = 1).

And total time is 9s.

Original plan is like

+----------------------------------+
| Explain String                   |
+----------------------------------+
| PLAN FRAGMENT 0                  |
|  OUTPUT EXPRS:18: count          |
|   PARTITION: UNPARTITIONED       |
|                                  |
|   RESULT SINK                    |
|                                  |
|   4:AGGREGATE (merge finalize)   |
|   |  output: count(18: count)    |
|   |  group by:                   |
|   |                              |
|   3:EXCHANGE                     |
|                                  |
| PLAN FRAGMENT 1                  |
|  OUTPUT EXPRS:                   |
|   PARTITION: RANDOM              |
|                                  |
|   STREAM DATA SINK               |
|     EXCHANGE ID: 03              |
|     UNPARTITIONED                |
|                                  |
|   2:AGGREGATE (update serialize) |
|   |  output: count(*)            |
|   |  group by:                   |
|   |                              |
|   1:Project                      |
|   |  <slot 20> : 1               |
|   |                              |
|   0:HdfsScanNode                 |
|      TABLE: lineorder            |
|      partitions=1/1              |
|      cardinality=600037902       |
|      avgRowSize=5.0              |
+----------------------------------+
And rewritted plan is like

+-----------------------------------+
| Explain String                    |
+-----------------------------------+
| PLAN FRAGMENT 0                   |
|  OUTPUT EXPRS:18: count           |
|   PARTITION: UNPARTITIONED        |
|                                   |
|   RESULT SINK                     |
|                                   |
|   3:AGGREGATE (merge finalize)    |
|   |  output: sum(18: count)       |
|   |  group by:                    |
|   |                               |
|   2:EXCHANGE                      |
|                                   |
| PLAN FRAGMENT 1                   |
|  OUTPUT EXPRS:                    |
|   PARTITION: RANDOM               |
|                                   |
|   STREAM DATA SINK                |
|     EXCHANGE ID: 02               |
|     UNPARTITIONED                 |
|                                   |
|   1:AGGREGATE (update serialize)  |
|   |  output: sum(19: ___count___) |
|   |  group by:                    |
|   |                               |
|   0:HdfsScanNode                  |
|      TABLE: lineorder             |
|      partitions=1/1               |
|      cardinality=1                |
|      avgRowSize=1.0               |
+-----------------------------------+
Fixes #45242

Signed-off-by: yanz <[email protected]>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 10, 2024
xiangguangyxg and others added 10 commits May 10, 2024 10:31
…45266)

Signed-off-by: xiangguangyxg <[email protected]>
Signed-off-by: 絵空事スピリット <[email protected]>
Co-authored-by: 絵空事スピリット <[email protected]>
Signed-off-by: trueeyu <[email protected]>
Signed-off-by: evelynzhaojie <[email protected]
Co-authored-by: evelyn.zhaojie <[email protected]>
Signed-off-by: hellolilyliuyi <[email protected]>
Signed-off-by: 絵空事スピリット <[email protected]>
Signed-off-by: evelyn.zhaojie <[email protected]>
Co-authored-by: 絵空事スピリット <[email protected]>
Co-authored-by: evelyn.zhaojie <[email protected]>
MatthewH00 and others added 29 commits May 15, 2024 18:57
Why I'm doing:
For the CVE problem, we need to upgrade Hadoop SDK from 3.3.6 -> 3.4.0
It will introduce aws java SDK v2, so we can delete SDK v1.

Signed-off-by: Smith Cruise <[email protected]>
Add titles for the intro pages. In the future if we use auto-generated nav these are required.

Signed-off-by: DanRoscigno <[email protected]>
@node node merged commit 3e1b7e2 into vivo:main May 17, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⤵️ pull documentation Improvements or additions to documentation title needs [type]
Projects
None yet
Development

Successfully merging this pull request may close these issues.