-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update(userspace/falco): add libsinsp state metrics option #2883
Conversation
/hold until the libs PR falcosecurity/libs#1433 is merged Here is some example output showing the new counters and metrics when running Falco on a dev box while monitoring all spawned processes and any file opens: {
"output_fields": {
"evt.source": "syscall",
"evt.time": 1697844214047532110,
"falco.container_memory_used": 0,
"falco.cpu_usage_perc": 4.8,
"falco.cpu_usage_perc_total_host": 5.7,
"falco.duration_sec": 80,
"falco.evts_rate_sec": 793.6313179616168,
"falco.host_boot_ts": 1697665647000000000,
"falco.host_num_cpus": 20,
"falco.hostname": "",
"falco.kernel_release": "",
"falco.memory_pss": 65,
"falco.memory_rss": 68,
"falco.memory_used_host": 10734,
"falco.memory_vsz": 1133,
"falco.n_added_fds": 94985,
"falco.n_added_threads": 131,
"falco.n_cached_fd_lookups": 190444,
"falco.n_cached_thread_lookups": 485510,
"falco.n_containers": 1,
"falco.n_drops_full_threadtable": 0,
"falco.n_failed_fd_lookups": 8903,
"falco.n_failed_thread_lookups": 149,
"falco.n_fds": 154329,
"falco.n_missing_container_images": 0,
"falco.n_noncached_fd_lookups": 195010,
"falco.n_noncached_thread_lookups": 11954,
"falco.n_removed_fds": 95217,
"falco.n_removed_threads": 134,
"falco.n_retrieve_evts_drops": 0,
"falco.n_retrieved_evts": 97367,
"falco.n_store_evts_drops": 0,
"falco.n_stored_evts": 97403,
"falco.n_threads": 2425,
"falco.num_evts": 389371,
"falco.num_evts_prev": 385403,
"falco.open_fds_host": 22280,
"falco.outputs_queue_num_drops": 0,
"falco.procs_running_host": 2,
"falco.start_ts": 1697844133957361054,
"falco.version": "",
"scap.engine_name": "bpf"
},
"sample": 16
} |
a48c9ae
to
eadff28
Compare
6804506
to
1c7bad8
Compare
Signed-off-by: Melissa Kilby <[email protected]>
Signed-off-by: Melissa Kilby <[email protected]>
Signed-off-by: Melissa Kilby <[email protected]>
1c7bad8
to
6690436
Compare
6690436
to
4fc07d5
Compare
…e correction Signed-off-by: Melissa Kilby <[email protected]>
b1f6fe8
to
b7dc0ac
Compare
b7dc0ac
to
de0da82
Compare
… version Signed-off-by: Melissa Kilby <[email protected]>
de0da82
to
6b05bcf
Compare
@@ -34,4 +34,4 @@ limitations under the License. | |||
// It represents the fields supported by this version of Falco, | |||
// the event types, and the underlying driverevent schema. It's used to | |||
// detetect changes in engine version in our CI jobs. | |||
#define FALCO_ENGINE_CHECKSUM "dbc34e88ab420320994d85f155dee6baff2dd018aacc00e249f897edc8b1e0f4" | |||
#define FALCO_ENGINE_CHECKSUM "5d488b68856d70300ae37453295383821822d8423af170eb28e1bef52042f0b3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasondellaluce the instructions above to calculate the checksum seem outdated. I tried it locally before pushing and it was wrong, had to re-push the correct value based on CI error logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the right number by running the command in this branch 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I will double check later, thanks for checking! Maybe I messed up somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to have some stale Falco version in my local branch somehow.
FALCO="build/userspace/falco/falco -c ./falco.yaml";
echo $($FALCO --version | grep 'Engine:' | awk '{print $2}') $(echo $($FALCO --version | grep 'Schema version:' | awk '{print $3}') $($FALCO --list --markdown | grep '^`' | sort) $($FALCO --list-events | sort) | sha256sum);
Tue Nov 28 05:29:32 2023: Falco version: 0.33.0-1218+6b05bcf (x86_64)
Tue Nov 28 05:29:32 2023: Falco initialized with configuration file: ./falco.yaml
Tue Nov 28 05:29:32 2023: Falco version: 0.33.0-1218+6b05bcf (x86_64)
Tue Nov 28 05:29:32 2023: Falco initialized with configuration file: ./falco.yaml
Tue Nov 28 05:29:32 2023: Falco version: 0.33.0-1218+6b05bcf (x86_64)
Tue Nov 28 05:29:32 2023: Falco initialized with configuration file: ./falco.yaml
Tue Nov 28 05:29:32 2023: Falco version: 0.33.0-1218+6b05bcf (x86_64)
Tue Nov 28 05:29:32 2023: Falco initialized with configuration file: ./falco.yaml
0.28.0 1b160cdb7e22279bcc43cd381c66750e3db9ab4431c7fb74b93070d4cb37c7d8 -
cmake \
-DUSE_BUNDLED_DEPS=ON \
-DBUILD_LIBSCAP_GVISOR=ON \
-DBUILD_BPF=ON \
-DBUILD_DRIVER=ON \
-DBUILD_FALCO_MODERN_BPF=OFF \
-DCREATE_TEST_TARGETS=ON \
-DBUILD_FALCO_UNIT_TESTS=ON ..
Not sure what's up here, but it's also not that important as I don't really have any issues on my fedora box ...
@@ -764,6 +773,7 @@ metrics: | |||
output_rule: true | |||
# output_file: /tmp/falco_stats.jsonl | |||
resource_utilization_enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about having a list of metrics instead of many boolean values? maybe something like
enabled_metrics: ["kernel_countes", ... ]
if we like the idea we could deprecate the old flags and introduce this new config. If we go in this way probably I would merge libbpf_stats_enabled
into kernel_event_counters_enabled
, in the end, to have the libbpf_stats_enabled
you need to enable the stats sudo sysctl -w kernel.bpf_stats_enabled=1
, so we already have a knob. Moreover, it is something only related to bpf
so maybe it is better to be more agnostic and hide it inside kernel_countes
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss a enabled_metrics
list. Given we are deprecating more urgent old settings I am not be sure if this is worth it in the near-term. Since metrics is now "Stable" we would need to follow a deprecation cycle. I would propose to open a discussion issue, tag all libs and falco maintainers and if the majority votes in favor of enabled_metrics: ["kernel_countes", ... ]
start the proper deprecation cycle. WDYT?
Kernel settings can be shared amongst teams deploying their tools. Therefore, it can be fair to assume that maybe another team enabled libbpf stats, but you still have a use case of not wanting to collect libbpf stats for Falco. I would prefer giving the end user full control and be very selective with the metrics especially given it's quite a lot of metrics and perhaps you only need some of them sometimes. With that I would also not merge libbf stats and the kernel event counters. All these reasons were the motivation to have these flags that are all teh way pushed down to libs in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we can discuss it without blocking the PR, i will tag all here so even if we merge this we can discuss with the clear scope of this PR @falcosecurity/falco-maintainers
Re libbpf_stats_enabled
: my idea was to use the sysctl option kernel.bpf_stats_enabled
as a knob, so if a team wants to enable libbpf stats together with the kernel stats it has just to type a sudo sysctl -w kernel.bpf_stats_enabled=1
on the node, while if they want to disable them they could use sudo sysctl -w kernel.bpf_stats_enabled=0
. The libsinsp code at every interval could check kernel.bpf_stats_enabled
(or maybe just at init time, it depends on what we want) and add the libbpf metrics if needed. BTW it was just an idea, no strong opinion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about using sysctl
value kernel.bpf_stats_enabled
to chose whether to enable/disable our kernel stats without further configuration by the user, it's a great idea actually! We could use its value as a default value for the configuration key and ship it commented? (same for kernel_event_counters_enabled
eventually)
We can discuss a enabled_metrics list
I don't have a strong opinion on this one; i think from an UX perspective it is better to already have all the keys and just set "true"/"false" whether i need them, than to add strings inside a set.
@@ -54,6 +54,11 @@ falco::app::run_result falco::app::actions::open_live_inspector( | |||
{ | |||
try | |||
{ | |||
if((s.config->m_metrics_flags & PPM_SCAP_STATS_STATE_COUNTERS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we enable these stats with a source plugin should we expect all empty values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think metrics and plugins is currently a bit broken because of some recent libs refactors (linux platform which does not apply for plugins hence we don't fetch some constants that are needed to for example calculate the CPU utilization). Since the kernel drivers are Falco's primary use case, we can work on the plugin cases later. End goal: When running metrics with plugins you should still be able to fetch all libsinsp metrics that are applicable (depending on the nature of the plugin).
m_metrics_resource_utilization_enabled(true), | ||
m_metrics_kernel_event_counters_enabled(true), | ||
m_metrics_libbpf_stats_enabled(true), | ||
m_metrics_flags((PPM_SCAP_STATS_KERNEL_COUNTERS | PPM_SCAP_STATS_LIBBPF_STATS | PPM_SCAP_STATS_RESOURCE_UTILIZATION | PPM_SCAP_STATS_STATE_COUNTERS)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change seems to go in the same direction proposed above, one unique config 'to rule them all'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented above we can discuss.
} | ||
if (m_writer->m_config->m_metrics_convert_memory_to_mb) | ||
{ | ||
if (strncmp(sinsp_stats_v2_snapshot[stat].name, "container_memory_used", 22) == 0) // exact str match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe in the future, we could add a flag inside the sinsp struct saying that the metric is relative to the "memory" In this way we could avoid the specific strncmp
or maybe we could use just one base unit as suggested by Prometheus best practises https://prometheus.io/docs/practices/naming/#base-units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes had something like that in mind for the libs refactor I will start in Dec.
For example, in libs there is still a todo comment to add a unit item to the new metrics schema struct. All these if conditions are not what we want as falco should just be a simple loop over the metrics. In addition, will look into the base units concept you shared.
output_fields[metric_name] = n_evts; | ||
output_fields["scap.n_evts_prev"] = m_last_n_evts; | ||
n_evts_delta = n_evts - m_last_n_evts; | ||
if (n_evts_delta != 0 && stats_snapshot_time_delta_sec > 0) | ||
{ | ||
/* n_evts is total number of kernel side events. */ | ||
output_fields["scap.evts_rate_sec"] = (double)(n_evts_delta / stats_snapshot_time_delta_sec); | ||
output_fields["scap.evts_rate_sec"] = std::round((double)(n_evts_delta / stats_snapshot_time_delta_sec) * 10.0) / 10.0; // round to 1 decimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe at some point we could compute these metrics directly in sisnp and expose just a list to Falco in this way we could avoid specific strncmp
on various metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes totally agreed to push almost everything (or what we can) down into libs.
Do we want to expose all these metrics? some of them seem to expose some internal details, like Does this metric mean that we have 2 hosts running on the system? "falco.procs_running_host": 2, Side note: we added "falco.n_added_threads": 131,
"falco.n_failed_thread_lookups": 149, |
The idea was me deploying this so that we have data on what libsinsp state metrics are actually useful. I wouldn't be surprised if the data revealed some improvement opportunities for our state handling. Re |
It seems like we don't want users to rely on internal details stats because they might get confused but we are willing to experiment with them. Before final release we need to make sure we're not exposing stuff that is not useful at all, or possibly mark it as internal. It looks to me (but correct me if I'm wrong) that this PR does not change which metrics are exposed but only makes it build properly with the latest libs. I would suggest an issue to keep track of this. Especially after experimentation there is stuff that will be much clearer. |
We have already an issue falcosecurity/libs#1463 for this. |
ok, thank you. Yes maybe we can find a way to document what each metric means somewhere
Yep maybe we can exclude them by default with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM label has been added. Git tree hash: 0a3ec79560f2960531c2e90e96702eb0f70393c7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for the discussion and for opening an issue to check pending items for 0.37.0 🙏
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Andreagit97, incertum, LucaGuerra The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
What type of PR is this?
/kind cleanup
/kind feature
Any specific area of the project related to this PR?
/area engine
What this PR does / why we need it:
See the corresponding libs PR falcosecurity/libs#1433
Which issue(s) this PR fixes:
falcosecurity/libs#1347
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: