-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8S client issues #2973
Comments
Anyway, I would be happy to work on the new implementation. |
/assign alacuku |
Thank you very much for this! We really need it as soon as possible for this reason I will put a milestone /0.11.0! This is huge so of course, in case of blockers we can move it to the next milestone! /milestone 0.11.0 BTW thank you very much for taking care of this! |
What a well-written proposal. So kudos! It would definitely resolve our performance issues. Strong +1 from me. |
I will be happy to help any way that I can help on this. Looking forward to seeing this implemented to resolve #2485. Thanks in advance :) |
/milestone 0.12.0 |
I think we are too late to have this in Falco 0.35, so |
Issues go stale after 90d of inactivity. Mark the issue as fresh with Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with Provide feedback via https://github.com/falcosecurity/community. /lifecycle stale |
/remove-lifecycle stale |
Cross-linking falcosecurity/plugins#378 |
yep to be honest I would move it to Falco because we will address it in Falco 0.37.0 rather than in libs 0.14.0 |
Hi, as this is to be a few separate components, and it will be talking with the API server, I wanted to ask to consider putting these components into a separate namespace from the main falco pods. Better still would be a separate namespace for each component, but I realize that the single shared namespace is a common pattern, so I hope to simply keep the syscall pods separate from the API server pods as it will greatly simplify the implementation of these (due to security policies in our clusters) for our org once the new components are released. |
yes, this is the idea, we will see if during the integration we will face some issues in doing this. We will keep you posted! |
I like the principle too. |
Thank you for clarifying It looks like the issue I might be having is not going to be addressed by this change. For me, it is those basic K8s.* labels I mentioned that disappear resulting in excessive alerts. I can resolve it by restarting the Falco pods. I am surprised the K8s namespace metadata comes from the container runtime and not the API server itself. I guess I may need to raise a separate issue for that if I determine the exact steps to reproduce it. |
This might be related to drops or other issues related to the container runtime. Or perhaps, some conflict with the old k8s client (have you tried to remove -k/-K?). In any case, I agree yours is likely a different issue.
This would be great 👍
Magic 🤯 This was introduced a couple of years ago in Falco and is likely not well-documented. I hope we will do better in documenting the feature coming with the new release (cc @LucaGuerra ). |
This should be solved by Falco 0.37.0! Feel free to reopen if this is still an issue |
Hi @Andreagit97 Since the update, it looks like Falco isn't populating the Restarted the daemonset but the situation continues. |
ei @mateuszdrab some questions:
|
Hi @Andreagit97 Values for the deployment are:
Version is latest so In the query below, you can see a sample alert that used to have a namespace in the json fields but now no longer has it |
Not sure which container runtime you are using but As an alternative, you can enable the |
Hey Andrea From our previous conversation I thought I can see the socket is mounted in Startup log:
Looks like I need to fix up the deprecations in the rules as well |
Ok, I updated my previous answer because it was not clear. Yes,
the helm chart do the magic under the hood, passing this value through the Falco command line https://github.com/falcosecurity/charts/blob/68db0665712d657be00be36e5b7933626220d3e7/charts/falco/templates/pod-template.tpl#L67 |
From what I can see in Grafana, since Feb 6th the field was never populated.
Indeed that's correct
That's weird because the socket is mounted in |
Okay this is fine because it's not extracted directly from the container runtime, what about the
Yep this is fine because internally Falco will add |
That makes sense, and no, the container image and name are missing. Example log line (sorry I posted it above but it was hard to notice under the Falco startup warnings):
Here is a log line from January (during time when sometimes the labels were missing until I restarted Falco):
Here is a log line from January (during time when labels were working):
|
Ok, thank you for all the info! yes, this seems a regression on As a temp workaround, you could try to enable the |
Sure @Andreagit97 Just did a quick (slow) Loki query over the past 90 days and we can see the below Since Feb 6th, there have been thousands of alerts with no namespace and none with namespace. |
Ok got it and Feb 6th is the date on which you switched from 0.36.0 to 0.37.1... pretty interesting
Unfortunately this seems yet another issue :/ |
True, let's focus on this one for now as it's not working at all. I've enabled I'm upgrading k3s now to v1.27.11-k3s1 - will report back soon. |
Please note that when you enable the |
@mateuszdrab here you can find all the fields exposed by the k8smeta plugin: https://github.com/falcosecurity/plugins/tree/master/plugins/k8smeta#supported-fields |
I can confirm Falco 0.37.1 doesn't extract |
Unfortunately, despite enabling the plugin, no |
Question: Talking about k3s shouldn't the socket be located at For example the CNCF Green Reviews TAG has k3s containerd on their cluster and I am getting the container information. See the setup here: We explicitly mount the k3s path as well https://github.com/falcosecurity/cncf-green-review-testing/blob/main/kustomize/falco-driver/modern_ebpf/daemonset.yaml#L77 and https://github.com/falcosecurity/cncf-green-review-testing/blob/main/kustomize/falco-driver/modern_ebpf/daemonset.yaml#L124 plus it's one of the default locations, see our Troubleshooting Guide https://falco.org/docs/troubleshooting/missing-fields/#missing-container-images therefore no need to pass the socket location with the |
That was it, the
I was checking the paths earlier and I was trying to remember how it was set up since I've not touched it in ages. I think the reason why I left it like that is because I wanted containerd to follow K3s's version as sharing them required extra configuration. It remains bizzare how it used to work and now it doesn't. Thank you @incertum! Maybe worth adding some sort of validation at start-up that the socket is working but has no pods or something like that? Btw, k8smeta collector still isn't populating its fields
|
Oh ok now it is clear, Falco 0.36.2 works also without the right container path because these
I don't see any |
That makes sense regarding the old client - I'm glad I got this working. Hopefully the other issue about the k8s fields going missing after some time will no longer be present due to the switch of the metadata sourcing.
Shouldn't the alerts have those fields in the json though, even if the rules are not specifically using it at the moment? |
Nope at the moment you can use them just adding them to the rules |
Ahh, okay. Would be nice to see them in the log. Thanks for your time and support |
Motivation
Falco has a built-in functionality called Kubernetes Metadata Enrichment. It provides k8s metadata, fetched from the k8s api-server, used by Falco to enrich the system-call events. Furthermore, these metadata are available to users as events fields to be used in the
conditions
and 'outputs' of Falco rules.Current implementation
The built-in k8s client is written in C++ and lives in the libs repo. It has a number of issues discussed in the next sections.
Poor performances
The metadata are fetched asynchronously from the api-server by opening a connection to the
watch
endpoint. Periodically, it checks for new events and processes them. The processing is done in the main loop, bloking the processing of syscall stream coming from the driver/probe. In large environments this causes syscall events to be dropped. For more info, please see: #2129. The following scheme shows how the client handles events coming from the api-server (some calls have been ommitted for the sake of clarity):Stability issues
The network connections to the api-server are far from being stable. It does not check for temporary network failures causing Falco to crash and restart. When deploying Falco in large clusters, all the Falco instances( one per node) connects to the api-server, and it could happen that some of them are throtled and the connection is closed. In such cases Falco pods will crash and restart. For more info see: #1909
Scalability issues
A watch is opened toward the api-serverfor each resource. Currently, we collect metadata for the following resources:
In a 5K nodes cluster the Falco instances will open 40000 watches. This approach does not scale well and contributes to control plane outages in large clusters.
Only Pods objects are efficiently retrieved by using the NodeSelector. For the other resource types, there is no way to filter them out, but we cache them locally leading to a significant waste of resources, especially memory.
Known issues:
Native libs (client-go) based implementation
The idea is to introduce a centralized component that retrieves the K8s metadata required by Falco instances. The new component should be based on modern libraries such as the one provided by the k8s project.
Motivation
See the section about current implementation.
Goals
api-server
using a centralized component;Non-Goals
Proposal
Provide a new component to be deployed in the K8s cluster responsible for providing the
metadata
required by each Falco instance.Metadata
to be sent to Falco:Each Falco instance will subscribe to the
centralized component
providing thenode name
where they reside and will receive metadata through a stream of updates. The communication between the Falco instances and thecentralized component
uses GRPC over HTTP2.Design Details
The Meta Collector will be implemented using Go relying on controller-runtime to efficiently retrieve and cache K8s resources. An
operator/controller
will be implemented for each resource needed by Falco. Each operator will connect to theapi-server
and fetch the initial state for the watched resource. It will create a state in memory and send updates to the Falco instances that need the data. Since we are limited to a subset of the fields of the watched resources update events will be sent only when those fields change. Updates should be sent to Falco as soon as possible. Each operator will:Modified
event received from theapi-server
produces anupdate
to be sent to the Falco instances;message broker
which will send it to the relevant Falco instances.After the
operators
complete the initialization a GRPC server (message broker) will start and wait for Falco subscriptions. The message broker will compute an initial list ofevents/updates
to send to the Falco instance upon a new subscription. Remember that the state is cached, so subscriptions will not generate requests to theapi-server
. Then theupdates
will be sent to the Falco instances as they come from theapi-server
.NOTE This issue will be updated with more implementation details in the upcoming weeks.
The text was updated successfully, but these errors were encountered: