-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleanup(libsinsp): consolidation and extension of libsinsp stats / metrics sinsp_stats_v2
#1433
cleanup(libsinsp): consolidation and extension of libsinsp stats / metrics sinsp_stats_v2
#1433
Conversation
The corresponding Falco PR is up and here is some example output of the new metrics falcosecurity/falco#2883 (comment) |
532e92d
to
9e79c35
Compare
Had a new idea on how to expose additional stats / metrics around the container engine cache. Missing container images in the logs has been called out for a very very long time. While in some cases when a new container just comes up we can't do much about it, we at least want to make sure everything we can improve has been considered. It will help to investigate falcosecurity/falco#2708 better. |
ca550c7
to
3b2dc0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this PR; generally speaking, i'd avoid the check for m_inspector != nullptr
since that is already taken for granted in so many places.
Also, i'd disable stats by default and i'd add a sinsp_enable_stats
to enable them, making the m_sinsp_stats_v2
field a unique_ptr that is only allocated if stats are enabled.
userspace/libsinsp/container.cpp
Outdated
@@ -88,11 +88,24 @@ bool sinsp_container_manager::remove_inactive_containers() | |||
}); | |||
|
|||
auto containers = m_containers.lock(); | |||
if (m_inspector != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these checks are necessary since eg: L69 above is already assuming that m_inspector
cannot be null.
userspace/libsinsp/fdinfo.cpp
Outdated
#ifdef GATHER_INTERNAL_STATS | ||
m_inspector->m_stats->m_n_added_fds++; | ||
#endif | ||
if (m_inspector != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as other comment, 3 lines above we are already accessing m_inspector
with no checks; i'd avoid polluting the code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same below.
userspace/libsinsp/parsers.cpp
Outdated
#ifdef GATHER_INTERNAL_STATS | ||
m_inspector->m_thread_manager->m_failed_lookups->decrement(); | ||
#endif | ||
if (m_inspector != nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again!
userspace/libsinsp/sinsp.h
Outdated
@@ -1121,7 +1122,8 @@ VISIBILITY_PRIVATE | |||
bool m_is_dumping; | |||
const scap_machine_info* m_machine_info; | |||
const scap_agent_info* m_agent_info; | |||
scap_stats_v2 m_sinsp_stats_v2[SINSP_MAX_RESOURCE_UTILIZATION]; | |||
sinsp_stats_v2 m_sinsp_stats_v2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since stats were protected by a macro before, can we retains a similar approach by making this back a unique_ptr
and only creating it if requested (ie: by a sinsp_enable_stats()
API or similar that must be called before the open
ing of the inspector).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need a unique_ptr, just ensure the struct is zero-initialized and have an option to disable the stats updates?
If you disable the stats and then read them and expect non-zero values, I'm not sure you have someone else to blame :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would save us lots of if m_sinsp_stats_v2 != nil
checks, neat!
EDIT: on a second thought, we still have the checks, and still have to embed a "huge" struct inside the inspector + a boolean to know if stats are enabled.
I think a single unique_ptr
would allow us:
- to not waste space when not needed
- do not use the bool at all since we already have the ptr != nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gnosek ^ you also onboard, if yes can go ahead and re-push later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FedeDP pushed a suggestion, plz let me know. It seems checking the inspector would only be needed in one place and in fact in most cases it's already de-referenced anyways before.
In addition, luckily had a pod sandbox running in the background and realized we should not be counting sandboxes as missing container images ... that's the one important metric I really would like to check on in prod, so we better get it right.
@FedeDP thanks for answering the question if we want a flag to enable everything. Re the null checks, some unit tests create conditions where the inspector in fact would be NULL, hence follow up question:
Since you didn't object against removing the old implementation I assume that means it's ok? |
Uh i totally missed that and answered it randomly 🤣
I'd say either 1 or 3; checking if inspector is null "sometimes" makes no great sense since it has a penalty and we cannot still be sure that inspector is always not null.
I haven't got strong opinion, i think it is ok to remove/improve it like you did. |
764b093
to
140f5d7
Compare
@FedeDP and @gnosek I added the flag in for now and kept is as simple struct as before as @gnosek suggested, but I can always change it if needed. Personally feel these stats should be as simple as possible in any regards. Also cleaned up the git history a little bit. Once we decide on #1433 (comment) I'll make those changes. |
@@ -59,7 +59,8 @@ class sinsp_with_test_input : public ::testing::Test { | |||
sinsp m_inspector; | |||
|
|||
void open_inspector(scap_mode_t mode = SCAP_MODE_TEST) { | |||
m_inspector.open_test_input(m_test_data.get(), mode); | |||
m_inspector.open_test_input(m_test_data.get(), mode); | |||
m_inspector.set_sinsp_stats_v2_enabled(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Setting this in the unit test only did not work.
140f5d7
to
c58282c
Compare
userspace/libsinsp/sinsp.cpp
Outdated
@@ -2708,6 +2660,27 @@ void sinsp::set_proc_scan_log_interval_ms(uint64_t val) | |||
m_proc_scan_log_interval_ms = val; | |||
} | |||
|
|||
void sinsp::set_sinsp_stats_v2_enabled() | |||
{ | |||
m_sinsp_stats_v2 = std::make_unique<sinsp_stats_v2>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FedeDP forgot to add checks if it's NULL in case you would call it multiple times. Will re-push, but I'll wait for your feedback first in case there is more.
c2f67ce
to
1d50e28
Compare
@jasondellaluce is also on board with this interim refactor that will allow us to consume the new metrics. One more question: Do we or can we merge this and the Falco PR this week? Else I would propose the week of Nov 13th as Jason and I will be out at KubeCon NA. This Falco PR falcosecurity/falco#2879 also needs to be merged before the corresponding Falco PR falcosecurity/falco#2883 or I consolidate those 2 PRs in one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. To be honest, I'm not fully convinced by the approach we are following with these stats, we have sinsp_stats, machine_info_stats, agent_info_stats, and resource_stats ... some in the platforms some with a shared pointer in the inspector and then populated with an external method libsinsp::stats::get_...
. Probably before going on with further steps, I would like to plan a final design to understand what is the final aim because at the moment it is not clear to me
{ | ||
if (m_sinsp_stats_v2 == nullptr) | ||
{ | ||
m_sinsp_stats_v2 = std::make_unique<sinsp_stats_v2>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want a shared or a unique pointer? if consumers want to manage these metrics probably we want to use a shared pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct needs to be a shared pointer exactly because of what you wrote here
uint64_t m_n_failed_thread_lookups; | ||
uint64_t m_n_added_threads; | ||
uint64_t m_n_removed_threads; | ||
uint32_t m_n_missing_container_images; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why uint32_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems large enough and uint64_t likely never needed. I doubt machines would go above 200-500 containers per host at a time anytime soon ... and let me know if we need additional comments to clarify that these container counters are not monotonic and the current view only similar to n_threads and n_fds. We could make them monotonic, but then we need to hook into the container engine differently. I know it's always a mess and believe me likely no winning ever with finding the perfect metrics.
userspace/libsinsp/sinsp.cpp
Outdated
@@ -44,6 +44,7 @@ limitations under the License. | |||
#include "plugin_filtercheck.h" | |||
#include "strl.h" | |||
#include "scap-int.h" | |||
#include "stats.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we include sinsp.h
so probably we don't need stats.h
also here
m_inspector.set_sinsp_stats_v2_enabled(); | ||
// Extra call to see we don;t fail | ||
m_inspector.set_sinsp_stats_v2_enabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would enable the stats directly on the tests we need, not in all tests by default
m_failed_lookups->increment(); | ||
#endif | ||
// Extra m_inspector nullptr check | ||
if (m_inspector != nullptr && m_inspector->m_sinsp_stats_v2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thread manager is contained into the inspector so probably we don't need this null check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the problem child condition: asan will fail if we don't check here because of TEST(sinsp_thread_manager, remove_non_existing_thread)
, change the unit test?
@@ -1444,14 +1444,6 @@ void sinsp_thread_manager::clear() | |||
m_last_tid = 0; | |||
m_last_flush_time_ns = 0; | |||
m_n_drops = 0; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SInce we have removed all these fields at this point i would remove also m_n_drops
and add it to sinsp_stats. Maybe we could use a more meaningful name like full_threadtable_drops
. In this way we can handle this stat like all the others without a dedicated method get_m_n_drops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but came at a small costs in terms of changing a g_logger log that now only fires if sinsp stats v2 are enabled.
userspace/libsinsp/threadinfo.cpp
Outdated
if (m_inspector->m_sinsp_stats_v2) | ||
{ | ||
m_inspector->m_sinsp_stats_v2->m_n_added_threads++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this to the end of the function because here it is not counting the added_threads
, the table could be full. This counts the requests
of adding a thread
buffer[SINSP_STATS_V2_N_FDS].value.u64 = 0; | ||
threadinfo_map_t* threadtable = thread_manager->get_threads(); | ||
threadtable->loop([&] (sinsp_threadinfo& tinfo) { | ||
sinsp_fdtable* fdtable = tinfo.get_fd_table(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we need a check sinsp_fdtable != nullptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ more segfaults is the last thing we need, which makes we wonder if we could come up with a more proper project inspection in this regard as sometimes we say we trust the pointer to be never NULL and prefer to save the overhead of checking vs sometimes we definitely need to check ...
Before going on with code changes, I have some general questions not related to the code:
WDYT? |
- Prepare m_sinsp_stats_v2 buffer to hold more stats and metrics than resource utilization metrics - Expand resource utilization metrics (e.g. include overall host CPU and memory usages) Signed-off-by: Melissa Kilby <[email protected]>
Signed-off-by: Melissa Kilby <[email protected]>
- Finalize exposing valuable legacy libsinsp stats / metrics / counters related to the sinsp state tables (e.g. adding, removing, storing, failed lookup activities) to the consolidated libsinsp stats - Add counter of currently cached containers as well - Includes additional cleanups Signed-off-by: Melissa Kilby <[email protected]>
Signed-off-by: Melissa Kilby <[email protected]>
Signed-off-by: Melissa Kilby <[email protected]>
- Get container cache stats via hijacking the periodic container cache flush functionality - Includes some general cleanup Signed-off-by: Melissa Kilby <[email protected]>
Co-authored-by: Federico Di Pierro <[email protected]> Signed-off-by: Melissa Kilby <[email protected]>
* make m_sinsp_stats_v2 a smart pointer initialized only when enabling sinsp_stats_v2 counters * only keep m_inspector nullptr check for `sinsp_thread_manager::remove_thread(int64_t tid)` Co-authored-by: Federico Di Pierro <[email protected]> Signed-off-by: Melissa Kilby <[email protected]>
Signed-off-by: Melissa Kilby <[email protected]>
Co-authored-by: Andrea Terzolo <[email protected]> Signed-off-by: Melissa Kilby <[email protected]>
1d50e28
to
0c856e8
Compare
@Andreagit97 thank you, I have moved the discussion to relevant issues. |
thank you, I will take a look in these days! sorry, but I'm a bit busy with k8s plugin + falco config/falco-driver-loader + various cleanup 😖 In the meanwhile if someone has other ideas please jump in 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM label has been added. Git tree hash: b1e038d84fffa650a145d188fae3ef2244f92de1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Andreagit97, incertum, jasondellaluce The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
/kind feature
Any specific area of the project related to this PR?
/area libsinsp
Does this PR require a change in the driver versions?
What this PR does / why we need it:
See #1347: Expanding libsinsp stats will help to debug and optimize libsinsp functionality around the threadtable and other aspects in production settings.
In essence, this PR refactors / consolidates / simplifies the libsinsp stats interface and adds valuable metrics that already existed, but weren't easily accessible by clients such as Falco. Plus added overall CPU and memory usages of the host to put the agents' usage easier in relation to the current server load.
Justification for removing the old sinsp stats:
First every metric was preserved with the replacement solution, except the counters below were duplicative of the scap counters, hence we do not re-copy them in the new sinsp stats solution
The following 2 metrics are snapshot metrics anyways, therefore they now take place in
libsinsp::stats::get_sinsp_stats_v2
just like the resource utilization snapshots.As said all other counters that were on the hot path are preserved in the simple
sinsp_stats_v2
struct.Please note that
GATHER_INTERNAL_STATS
is currently broken in 2 places: When activated libsinsp would not compile because of missing updates infdinfo.h
and a legacy threadtable loop approachvoid sinsp_thread_manager::update_statistics()
-> This is why I believe no one could have seriously used those stats for some time.Additional information:
During the October core maintainer meeting we discussed, that we are all fine with having all these stats / metrics in the
sinsp_stats_v2_buffer
and while this PR at least removes and consolidate legacy libsinsp stats, we are all on the same page that as we further cleanup and refactor libsinsp the libsinsp stats functionality will be part of future cleanups.Question:
Are we ok with just having all these counters just like that on the hot path or do we want to gate them? Meaning another if condition or is it cheaper to just increment or decrement the respective count real fast?
Which issue(s) this PR fixes:
#1347
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: