Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add support for controlling the NettyTransport's byteBuf alloc… #1709

Open
wants to merge 1 commit into
base: 1.1.x
Choose a base branch
from

Conversation

He-Pin
Copy link
Member

@He-Pin He-Pin commented Jan 14, 2025

…ator type. (#1707)

  • chore: Add support for controlling the NettyTransport's byteBuf allocator type.

  • chore: extract deriveByteBufAllocator method

(cherry picked from commit dbc9ed3)

related thread: https://lists.apache.org/thread/6lj5lbg18hz4n9zckn0b89p9mv9nz14g

…ator type. (apache#1707)

* chore: Add support for controlling the NettyTransport's byteBuf allocator type.

* chore: extract deriveByteBufAllocator method

(cherry picked from commit dbc9ed3)
@pjfanning
Copy link
Contributor

This is grey area for me. Whether a change like this be in a patch. If it is really useful then I'd support merging it. That said, I'd prefer if the Flink team test with a 1.2 snapshot and then if they confirm that they get the behaviour that they want then we can merge this.

@He-Pin
Copy link
Member Author

He-Pin commented Jan 18, 2025

image

@pjfanning
Copy link
Contributor

pjfanning commented Jan 18, 2025

image

I'm the type of person who puts more faith in opinions if they are backed up by work like in testing Flink with a Pekko snapshot that supports this config before we go to the effort of doing releases.

Example build with this change.

https://repository.apache.org/content/groups/snapshots/org/apache/pekko/pekko-actor_2.13/1.2.0-M0+55-a75bc7a7-SNAPSHOT/

@He-Pin
Copy link
Member Author

He-Pin commented Jan 21, 2025

@pjfanning I still think we should merge this and make sure we get it in the next 1.1.4 to help the Flink community.

@He-Pin
Copy link
Member Author

He-Pin commented Jan 21, 2025

refs: apache/flink#26042 too

@pjfanning
Copy link
Contributor

I am -1 on this. Let the Flink team approach us and convince us. I have seen noone say that this is a must for Flink.

@afedulov
Copy link

@pjfanning A brief explanation of why this is generally needed: Flink uses Netty not only for RPC but also for data shuffling. Migrating to Netty 4 for RPC caused some tests to become flaky, primarily due to differences in memory allocation in the newer version. The safest option for us would be to set Pekko allocation to unpooled, as this would roughly replicate the memory requirements of the previous version and thereby reduce the probability of tipping containers at the memory limit into OOM in production. The problem is that we cannot apply this as a global setting, because other parts of the framework, such as the core data exchange network stack, also use Netty 4 and would consequently be affected by the change.

@pjfanning
Copy link
Contributor

@afedulov have you tested with the pekko 1.2 snapshot jars that already support this?

@ferenc-csaky
Copy link

@pjfanning I just created a local Pekko build (more details in apache/flink#25955) from the 1.1.x branch with this PR applied on top and executed our failing test with setting the bytebuf-allocator-type to unpooled. With that setting, at least on my machine the failing test was successful from 5/5 runs.

So we believe this config option would be useful for us if you guys can get it into 1.1.4. From our side it is not something that is urgent, but it could be useful for use if in case there will be a 1.1.4 release, this change would be part of it.

@pjfanning
Copy link
Contributor

pjfanning commented Jan 22, 2025

This PR is not merged. You can only get this support by using the latest 1.2 snapshots.

See #1709 (comment)

@He-Pin
Copy link
Member Author

He-Pin commented Jan 22, 2025

This PR is not merged. You can only get this support by using the latest 1.2 snapshots.

See #1709 (comment)

And I built 1.1.x after I applied your commit from

Seems he did a local patch

@ferenc-csaky
Copy link

Yes, I built Pekko myself locally after I applied this patch to 1.1.x locally, and built Flink against that version with the added setting to run the test.

@pjfanning
Copy link
Contributor

@raboof @nvollmar What do you think? This looks not too bad for a patch - not ideal to have a non-bug fix in a patch but the change is small enough and doesn't have an API impact. The default behaviour remains unchanged and you have to opt in via a config change to see any behaviour change.

@mdedetrich
Copy link
Contributor

PR looks fine to me but I would prefer to have it released in a non patch update

@He-Pin
Copy link
Member Author

He-Pin commented Jan 24, 2025

@mdedetrich it's already in the main, here is to decide if we include it in 1.1.4

@pjfanning
Copy link
Contributor

There is an option to do a 1.2.0-M1 release soon.

@mdedetrich
Copy link
Contributor

Then let's do a 1.20 release, we can backport strictly fixes from main to 1.1.x branch if needed.

@He-Pin
Copy link
Member Author

He-Pin commented Jan 24, 2025

I'm not sure if I can have all my features ready in 1.2.0, Chinese Spring Festival is coming. I still have some features in mind but not yet there.

@pjfanning
Copy link
Contributor

pjfanning commented Jan 24, 2025

I'm not sure if I can have all my features ready in 1.2.0, Chinese Spring Festival is coming. I still have some features in mind but not yet there.

I'm thinking 1.2.0-M1, 1.2.0-M2, maybe some more milestones and later this year, a full 1.2.0 release.

Flink team should be fine using 1.2.0-M1 in production if we don't add any extra changes to what is already on main branch.

@He-Pin
Copy link
Member Author

He-Pin commented Jan 24, 2025

I think pekko's M1 release should be very stable too, @afedulov @ferenc-csaky wdyt about it.

@ferenc-csaky
Copy link

Does the 1.2 branch includes a Netty version update compared to 1.1.x? Cause if yes, it might be not as straightforward for Flink to jump to it.

@pjfanning
Copy link
Contributor

pjfanning commented Jan 27, 2025

Does the 1.2 branch includes a Netty version update compared to 1.1.x? Cause if yes, it might be not as straightforward for Flink to jump to it.

netty is a provided dependency of pekko-remote - so choose your own patch version of netty as long it is 4.1.x.

@ferenc-csaky
Copy link

@pjfanning in that case it should be no problem on our end to wait for 1.2, thanks! 👍

@afedulov
Copy link

afedulov commented Jan 28, 2025

@He-Pin @pjfanning @mdedetrich
First of all, thank you for your support of the Flink project and for attempting to deliver this patch in a timely manner! At this stage, we decided to proceed with upgrading to Netty 4 in our own patch release, accepting the potential increase in memory usage. Without @He-Pin's patch, we are unable to replicate the Netty 3 behavior since the allocator setting is global, and Netty 4 (with its default settings) is already used in the performance-critical data exchange network stack.

Since we are already at the stage of creating release candidates, we cannot wait for the 1.2-M1 releases. However, it would still be valuable to have the 1.2-M1 release available as a fallback option in case we underestimated the impact of the upgrade to Netty 4 and need to provide a timely patch on our side.

@ferenc-csaky
Copy link

ferenc-csaky commented Jan 29, 2025

@afedulov To be clear, I did not mean to bump Pekko for the currently in-flight patch releases, but for the next one against Flink 1.20. By that time, I am sure Pekko 1.2 will be released properly. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants