Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flyte-agent configure pod securityContext #4785

Merged

Conversation

ddl-ebrown
Copy link
Contributor

@ddl-ebrown ddl-ebrown commented Jan 27, 2024

  • Follow security best practices and disable privilege escalation by default in the agent. Also make both the pod and container security contexts user configurable. I wanted to drop caps on the container, but left that out given I'm not familiar with what the default agent needs. If running through CI is enough to detect whether anything breaks, then I'm happy to add those back and we can see what happens. Thoughts @davidmirror-ops ?

I will put up a separate PR for the flyte-core charts

Tracking issue

https://github.com/flyteorg/flyte/issues/

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@ddl-ebrown ddl-ebrown force-pushed the add-helm-security-context branch 2 times, most recently from 0b849c3 to 6376a60 Compare January 27, 2024 07:51
Copy link

codecov bot commented Jan 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (abef3f5) 58.54% compared to head (b1e72c6) 59.18%.
Report is 17 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4785      +/-   ##
==========================================
+ Coverage   58.54%   59.18%   +0.63%     
==========================================
  Files         625      644      +19     
  Lines       53669    52491    -1178     
==========================================
- Hits        31423    31068     -355     
+ Misses      19731    18849     -882     
- Partials     2515     2574      +59     
Flag Coverage Δ
unittests 58.32% <ø> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

type: spc_t
# -- Security context for container
securityContext:
allowPrivilegeEscalation: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the container also needs no Linux caps and this would drop like

capabilities:
     drop:
     - ALL

But given I'm not super familiar with what's inside your agent code (and considering there are more than just "default" agents), I left it off as a default. The most common caps I've seen containers need added back are CAP_CHOWN and CAP_MKNOD. CAP_CHOWN generally results from the container image being setup incorrectly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to test this in your branch? I don't think the CI will catch a potential problem, but in general, I don't see the Task containers needing root caps so dropping all should be fine.
@pingsutw do you see the Agent needing elevated permissions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a Python-based agent (based on the Flytekit SDK examples), we haven't seen it require any of the default container capabilities. Note that depending on whether or not you're on containerd or cri-o, the defaults are different. OpenShift maintains a smaller list -- so if you have customers on OpenShift that should be good signal that you don't need things like NET_ADMIN

@@ -54,6 +54,13 @@ serviceAccount:
annotations: {}
# -- ImagePullSecrets to automatically assign to the service account
imagePullSecrets: []
# -- Security context for pod
podSecurityContext:
seLinuxOptions:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we could also set something like:

runAsNonRoot: true
runAsUser: 1000

When I ran the default container it died on startup though -- so there's some additional work to do to make it run non-root which is out of scope for making these settings configurable in Helm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were previous attempts to enable at least the flytesnacks containers to run rootless, but nothing concrete yet. Probably that could explain your Issue running it.
Agree that it probably is out of scope here, but definitely it should be configurable.

@ddl-ebrown ddl-ebrown marked this pull request as ready for review January 27, 2024 17:25
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. documentation Improvements or additions to documentation enhancement New feature or request security Issues related to Security improvements labels Jan 27, 2024
@ddl-ebrown ddl-ebrown changed the title Configure Flyte agent pod securityContext Flyte-agent configure pod securityContext Feb 1, 2024
@@ -54,6 +54,13 @@ serviceAccount:
annotations: {}
# -- ImagePullSecrets to automatically assign to the service account
imagePullSecrets: []
# -- Security context for pod
podSecurityContext:
seLinuxOptions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were previous attempts to enable at least the flytesnacks containers to run rootless, but nothing concrete yet. Probably that could explain your Issue running it.
Agree that it probably is out of scope here, but definitely it should be configurable.

type: spc_t
# -- Security context for container
securityContext:
allowPrivilegeEscalation: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to test this in your branch? I don't think the CI will catch a potential problem, but in general, I don't see the Task containers needing root caps so dropping all should be fine.
@pingsutw do you see the Agent needing elevated permissions?

@@ -71,6 +71,9 @@ spec:
helm.sh/chart: flyteagent-v0.1.10
app.kubernetes.io/managed-by: Helm
spec:
securityContext:
seLinuxOptions:
type: spc_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically says "don't use SELinux" -- but happy to remove it if you don't find it useful. This is more likely to come up in RedHat / OpenShift land

https://jaosorior.dev/2019/selinux-and-kubernetes/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful. I'd just make it configurable and disabled by default. We haven't seen much on-prem users running RHEL/Openshift but if this is useful for them, at least making it available would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already configurable, so I just removed the seLinuxOptions altogether. Users can decide if they want it for their agents or not.

 - Follow security best practices and allow for more granular
   configuration. Right now, the default flyteagent container cannot run
   with the following desirable podSecurityContext:

   runAsNonRoot: true
   runAsUser: XXX

   The container crashes on startup

   It would also be desirable to set a container securityContext with:

   capabilities:
     drop:
     - ALL

   The container launched with caps dropped, but I wasn't familiar
   enough with the code in that container to know if that will actually
   work, so didn't set a default

Signed-off-by: ddl-ebrown <[email protected]>
@ddl-ebrown ddl-ebrown force-pushed the add-helm-security-context branch from 6376a60 to b1e72c6 Compare February 2, 2024 17:51
Copy link
Contributor

@davidmirror-ops davidmirror-ops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.
I guess this is just the beginning and we'll still need to revisit the security hardening for the Agent.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 5, 2024
@ddl-ebrown
Copy link
Contributor Author

Yeah I would generally advocate for secure by default, with explicit changes required to decrease security posture.

That said, you have a lot of different agents right now, with more on the way -- so I didn't want to guess wrong and break everything on you. :)

Longer term it might be good to take a swing at checking the various agents out to see if they behave properly as non-root

@davidmirror-ops davidmirror-ops merged commit 1bbe867 into flyteorg:master Feb 5, 2024
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request lgtm This PR has been approved by a maintainer security Issues related to Security improvements size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants