Skip to content

Latest commit

 

History

History
352 lines (266 loc) · 39.9 KB

kata-2-0-metrics.md

File metadata and controls

352 lines (266 loc) · 39.9 KB

Kata 2.0 Metrics Design

Kata implement CRI's API and support ContainerStats and ListContainerStats interfaces to expose containers metrics. User can use these interface to get basic metrics about container.

But unlike runc, Kata is a VM-based runtime and has a different architecture.

Limitations of Kata 1.x and the target of Kata 2.0

Kata 1.x has a number of limitations related to observability that may be obstacles to running Kata Containers at scale.

In Kata 2.0, the following components will be able to provide more details about the system.

  • containerd shim v2 (effectively kata-runtime)
  • Hypervisor statistics
  • Agent process
  • Guest OS statistics

Note: In Kata 1.x, the main user-facing component was the runtime (kata-runtime). From 1.5, Kata then introduced the Kata containerd shim v2 (containerd-shim-kata-v2) which is essentially a modified runtime that is loaded by containerd to simplify and improve the way VM-based containers are created and managed.

For Kata 2.0, the main component is the Kata containerd shim v2, although the deprecated kata-runtime binary will be maintained for a period of time.

Any mention of the "Kata runtime" in this document should be taken to refer to the Kata containerd shim v2 unless explicitly noted otherwise (for example by referring to it explicitly as the kata-runtime binary).

Metrics architecture

Kata 2.0 metrics strongly depend on Prometheus, a graduated project from CNCF.

Kata Containers 2.0 introduces a new Kata component called kata-monitor which is used to monitor the other Kata components on the host. It's the monitor interface with Kata runtime, and we can do something like these:

  • Get metrics
  • Get events

In this document we will cover metrics only. And until now it only supports metrics function.

This is the architecture overview metrics in Kata Containers 2.0.

Kata Containers 2.0 metrics

And the sequence diagram is shown below:

Kata Containers 2.0 metrics

For a quick evaluation, you can check out this how to.

Kata monitor

kata-monitor is a management agent on one node, where many Kata containers are running. kata-monitor's work include:

Note: node is a single host system or a node in K8s clusters.

  • Aggregate sandbox metrics running on this node, and add sandbox_id label
  • As a Prometheus target, all metrics from Kata shim on this node will be collected by Prometheus indirectly. This can easy the targets count in Prometheus, and also need not to expose shim's metrics by ip:port

Only one kata-monitor process are running on one node.

kata-monitor is using a different communication channel other than that conatinerd communicating with Kata shim, and Kata shim listen on a new socket address for communicating with kata-monitor.

The way kata-monitor get shim's metrics socket file(monitor_address) like that containerd get shim address. The socket is an abstract socket and saved as file abstract with the same directory of address for containerd.

Note: If there is no Prometheus server is configured, i.e., there is no scrape operations, kata-monitor will do nothing initiative.

Kata runtime

Runtime is responsible for:

  • Gather metrics about shim process
  • Gather metrics about hypervisor process
  • Gather metrics about running sandbox
  • Get metrics from Kata agent(through ttrpc)

Kata agent

Agent is responsible for:

  • Gather agent process metrics
  • Gather guest OS metrics

And in Kata 2.0, agent will add a new interface:

rpc GetMetrics(GetMetricsRequest) returns (Metrics);

message GetMetricsRequest {}

message Metrics {
	string metrics = 1;
}

The metrics field is Prometheus encoded content. This can avoid defining a fixed structure in protocol buffers.

Performance and overhead

Metrics should not become the bottleneck of system, downgrade the performance, and run with minimal overhead.

Requirements:

  • Metrics MUST be quick to collect
  • Metrics MUST be small.
  • Metrics MUST be generated only if there are subscribers to the Kata metrics service
  • Metrics MUST be stateless

In Kata 2.0, metrics are collected mainly from /proc filesystem, and consumed by Prometheus, based on a pull mode, that is mean if there is no Prometheus collector is running, so there will be zero overhead if nobody cares the metrics.

Metrics service also doesn't hold any metrics in memory.

* No Sandbox 1 Sandbox 2 Sandboxes
Metrics count 39 106 173
Metrics size(bytes) 9K 144K 283K
Metrics size(gzipped, bytes) 2K 10K 17K

Metrics size: Response size of one Prometheus scrape request.

It's easy to estimated that if there are 10 sandboxes running in the host, the size of one metrics fetch request issued by Prometheus will be about to 9 + (144 - 9) * 10 = 1.35M (not gzipped) or 2 + (10 - 2) * 10 = 82K (gzipped). Of course Prometheus support gzip compression, that can reduce the response size of every request.

And here is some test data:

  • End-to-end (from Prometheus server to kata-monitor and kata-monitor write response back): 20ms(avg)
  • Agent(RPC all from shim to agent): 3ms(avg)

Test infrastructure:

  • OS: Ubuntu 20.04
  • Hardware: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz, 6 Cores, and 16GB memory.

Scrape interval

Prometheus default scrape_interval is 1 minute, and usually it is set to 15s. Small scrape_interval will cause more overhead, so user should set it on monitor demand.

Metrics list

Here listed is all supported metrics by Kata 2.0. Some metrics is dependent on guest kernels in the VM, so there may be some different by your environment.

Metrics is categorized by component where metrics are collected from and for.

Note:

  • Labels here are not include instance and job labels that added by Prometheus.
  • Notes about metrics unit
    • Kibibytes, abbreviated KiB. 1 KiB equals 1024 B.
    • For some metrics (like network devices statistics from file /proc/net/dev), unit is depend on label( for example recv_bytes and recv_packets are having different units).
    • Most of these metrics is collected from /proc filesystem, so the unit of metrics are keeping the same unit as /proc. See the proc(5) manual page for further details.

Metric types

Prometheus offer four core metric types.

  • Counter: A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase.

  • Gauge: A gauge metric represents a single numerical value that can go up and down, typically used for measured values like current memory usage.

  • Histogram: A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.

  • Summary: A summary samples observations like histogram, it can calculate configurable quantiles over a sliding time window.

See Prometheus metric types for detailed explanations about these metric types.

Kata agent metrics

Agent's metrics contains metrics about agent process.

Metric name Type Units Labels Introduced in Kata version
kata_agent_io_stat:
Agent process IO stat.
GAUGE
  • item (see /proc/<pid>/io)
    • cancelled_write_byte
    • rchar
    • read_bytes
    • syscr
    • syscw
    • wchar
    • write_bytes
  • sandbox_id
2.0.0
kata_agent_proc_stat:
Agent process stat.
GAUGE
  • item (see /proc/<pid>/stat)
    • cstime
    • cutime
    • stime
    • utime
  • sandbox_id
2.0.0
kata_agent_proc_status:
Agent process status.
GAUGE
  • item (see /proc/<pid>/status)
    • hugetlbpages
    • nonvoluntary_ctxt_switches
    • rssanon
    • rssfile
    • rssshmem
    • vmdata
    • vmexe
    • vmhwm
    • vmlck
    • vmlib
    • vmpeak
    • vmpin
    • vmpte
    • vmrss
    • vmsize
    • vmstk
    • vmswap
    • voluntary_ctxt_switches
  • sandbox_id
2.0.0
kata_agent_process_cpu_seconds_total:
Total user and system CPU time spent in seconds.
COUNTER seconds
  • sandbox_id
2.0.0
kata_agent_process_max_fds:
Maximum number of open file descriptors.
GAUGE
  • sandbox_id
2.0.0
kata_agent_process_open_fds:
Number of open file descriptors.
GAUGE
  • sandbox_id
2.0.0
kata_agent_process_resident_memory_bytes:
Resident memory size in bytes.
GAUGE bytes
  • sandbox_id
2.0.0
kata_agent_process_start_time_seconds:
Start time of the process since unix epoch in seconds.
GAUGE seconds
  • sandbox_id
2.0.0
kata_agent_process_virtual_memory_bytes:
Virtual memory size in bytes.
GAUGE bytes
  • sandbox_id
2.0.0
kata_agent_scrape_count:
Metrics scrape count
COUNTER
  • sandbox_id
2.0.0
kata_agent_total_rss:
Agent process total rss size
GAUGE
  • sandbox_id
2.0.0
kata_agent_total_time:
Agent process total time
GAUGE
  • sandbox_id
2.0.0
kata_agent_total_vm:
Agent process total vm size
GAUGE
  • sandbox_id
2.0.0

Firecracker metrics

Metrics for Firecracker vmm.

Metric name Type Units Labels Introduced in Kata version
kata_firecracker_api_server:
Metrics related to the internal API server.
GAUGE
  • item
    • process_startup_time_cpu_us
    • process_startup_time_us
    • sync_response_fails
    • sync_vmm_send_timeout_count
  • sandbox_id
2.0.0
kata_firecracker_block:
Block Device associated metrics.
GAUGE
  • item
    • activate_fails
    • cfg_fails
    • event_fails
    • execute_fails
    • flush_count
    • invalid_reqs_count
    • no_avail_buffer
    • queue_event_count
    • rate_limiter_event_count
    • rate_limiter_throttled_events
    • read_bytes
    • read_count
    • update_count
    • update_fails
    • write_bytes
    • write_count
  • sandbox_id
2.0.0
kata_firecracker_get_api_requests:
Metrics specific to GET API Requests for counting user triggered actions and/or failures.
GAUGE
  • item
    • instance_info_count
    • instance_info_fails
    • machine_cfg_count
    • machine_cfg_fails
  • sandbox_id
2.0.0
kata_firecracker_i8042:
Metrics specific to the i8042 device.
GAUGE
  • item
    • error_count
    • missed_read_count
    • missed_write_count
    • read_count
    • reset_count
    • write_count
  • sandbox_id
2.0.0
kata_firecracker_latencies_us:
Performance metrics related for the moment only to snapshots.
GAUGE
  • item
    • diff_create_snapshot
    • full_create_snapshot
    • load_snapshot
    • pause_vm
    • resume_vm
    • vmm_diff_create_snapshot
    • vmm_full_create_snapshot
    • vmm_load_snapshot
    • vmm_pause_vm
    • vmm_resume_vm
  • sandbox_id
2.0.0
kata_firecracker_logger:
Metrics for the logging subsystem.
GAUGE
  • item
    • log_fails
    • metrics_fails
    • missed_log_count
    • missed_metrics_count
  • sandbox_id
2.0.0
kata_firecracker_mmds:
Metrics for the MMDS functionality.
GAUGE
  • item
    • connections_created
    • connections_destroyed
    • rx_accepted
    • rx_accepted_err
    • rx_accepted_unusual
    • rx_bad_eth
    • rx_count
    • tx_bytes
    • tx_count
    • tx_errors
    • tx_frames
  • sandbox_id
2.0.0
kata_firecracker_net:
Network-related metrics.
GAUGE
  • item
    • activate_fails
    • cfg_fails
    • event_fails
    • mac_address_updates
    • no_rx_avail_buffer
    • no_tx_avail_buffer
    • rx_bytes_count
    • rx_count
    • rx_event_rate_limiter_count
    • rx_fails
    • rx_packets_count
    • rx_partial_writes
    • rx_queue_event_count
    • rx_rate_limiter_throttled
    • rx_tap_event_count
    • tap_read_fails
    • tap_write_fails
    • tx_bytes_count
    • tx_count
    • tx_fails
    • tx_malformed_frames
    • tx_packets_count
    • tx_partial_reads
    • tx_queue_event_count
    • tx_rate_limiter_event_count
    • tx_rate_limiter_throttled
    • tx_spoofed_mac_count
  • sandbox_id
2.0.0
kata_firecracker_patch_api_requests:
Metrics specific to PATCH API Requests for counting user triggered actions and/or failures.
GAUGE
  • item
    • drive_count
    • drive_fails
    • machine_cfg_count
    • machine_cfg_fails
    • network_count
    • network_fails
  • sandbox_id
2.0.0
kata_firecracker_put_api_requests:
Metrics specific to PUT API Requests for counting user triggered actions and/or failures.
GAUGE
  • item
    • actions_count
    • actions_fails
    • boot_source_count
    • boot_source_fails
    • drive_count
    • drive_fails
    • logger_count
    • logger_fails
    • machine_cfg_count
    • machine_cfg_fails
    • metrics_count
    • metrics_fails
    • network_count
    • network_fails
  • sandbox_id
2.0.0
kata_firecracker_rtc:
Metrics specific to the RTC device.
GAUGE
  • item
    • error_count
    • missed_read_count
    • missed_write_count
  • sandbox_id
2.0.0
kata_firecracker_seccomp:
Metrics for the seccomp filtering.
GAUGE
  • item
    • num_faults
  • sandbox_id
2.0.0
kata_firecracker_signals:
Metrics related to signals.
GAUGE
  • item
    • sigbus
    • sigsegv
  • sandbox_id
2.0.0
kata_firecracker_uart:
Metrics specific to the UART device.
GAUGE
  • item
    • error_count
    • flush_count
    • missed_read_count
    • missed_write_count
    • read_count
    • write_count
  • sandbox_id
2.0.0
kata_firecracker_vcpu:
Metrics specific to VCPUs' mode of functioning.
GAUGE
  • item
    • exit_io_in
    • exit_io_out
    • exit_mmio_read
    • exit_mmio_write
    • failures
    • filter_cpuid
  • sandbox_id
2.0.0
kata_firecracker_vmm:
Metrics specific to the machine manager as a whole.
GAUGE
  • item
    • device_events
    • panic_count
  • sandbox_id
2.0.0
kata_firecracker_vsock:
Vsock-related metrics.
GAUGE
  • item
    • activate_fails
    • cfg_fails
    • conn_event_fails
    • conns_added
    • conns_killed
    • conns_removed
    • ev_queue_event_fails
    • killq_resync
    • muxer_event_fails
    • rx_bytes_count
    • rx_packets_count
    • rx_queue_event_count
    • rx_queue_event_fails
    • rx_read_fails
    • tx_bytes_count
    • tx_flush_fails
    • tx_packets_count
    • tx_queue_event_count
    • tx_queue_event_fails
    • tx_write_fails
  • sandbox_id
2.0.0

Kata guest OS metrics

Guest OS's metrics in hypervisor.

Metric name Type Units Labels Introduced in Kata version
kata_guest_cpu_time:
Guest CPU stat.
GAUGE
  • cpu (CPU no. and total for all CPUs)
    • 0 (CPU 0)
    • 1 (CPU 1)
    • total (for all CPUs)
  • item (Kernel/system statistics, from /proc/stat)
    • guest
    • guest_nice
    • idle
    • iowait
    • irq
    • nice
    • softirq
    • steal
    • system
    • user
  • sandbox_id
2.0.0
kata_guest_diskstat:
Disks stat in system.
GAUGE
  • disk (disk name)
  • item (see /proc/diskstats)
    • discards
    • discards_merged
    • flushes
    • in_progress
    • merged
    • reads
    • sectors_discarded
    • sectors_read
    • sectors_written
    • time_discarding
    • time_flushing
    • time_in_progress
    • time_reading
    • time_writing
    • weighted_time_in_progress
    • writes
    • writes_merged
  • sandbox_id
2.0.0
kata_guest_load:
Guest system load.
GAUGE
  • item
    • load1
    • load15
    • load5
  • sandbox_id
2.0.0
kata_guest_meminfo:
Statistics about memory usage on the system.
GAUGE
  • item (see /proc/meminfo)
    • active
    • active_anon
    • active_file
    • anon_hugepages
    • anon_pages
    • bounce
    • buffers
    • cached
    • cma_free
    • cma_total
    • commit_limit
    • committed_as
    • direct_map_1G
    • direct_map_2M
    • direct_map_4M
    • direct_map_4k
    • dirty
    • hardware_corrupted
    • high_free
    • high_total
    • hugepages_free
    • hugepages_rsvd
    • hugepages_surp
    • hugepages_total
    • hugepagesize
    • hugetlb
    • inactive
    • inactive_anon
    • inactive_file
    • k_reclaimable
    • kernel_stack
    • low_free
    • low_total
    • mapped
    • mem_available
    • mem_free
    • mem_total
    • mlocked
    • mmap_copy
    • nfs_unstable
    • page_tables
    • per_cpu
    • quicklists
    • s_reclaimable
    • s_unreclaim
    • shmem
    • shmem_hugepages
    • shmem_pmd_mapped
    • slab
    • swap_cached
    • swap_free
    • swap_total
    • unevictable
    • vmalloc_chunk
    • vmalloc_total
    • vmalloc_used
    • writeback
    • writeback_tmp
  • sandbox_id
2.0.0
kata_guest_netdev_stat:
Guest net devices stats.
GAUGE
  • interface (network device name)
  • item (see /proc/net/dev)
    • recv_bytes
    • recv_compressed
    • recv_drop
    • recv_errs
    • recv_fifo
    • recv_frame
    • recv_multicast
    • recv_packets
    • sent_bytes
    • sent_carrier
    • sent_colls
    • sent_compressed
    • sent_drop
    • sent_errs
    • sent_fifo
    • sent_packets
  • sandbox_id
2.0.0
kata_guest_tasks:
Guest system load.
GAUGE
  • item
    • cur
    • max
  • sandbox_id
2.0.0
kata_guest_vm_stat:
Guest virtual memory stat.
GAUGE
  • item (see /proc/vmstat)
    • allocstall_dma
    • allocstall_dma32
    • allocstall_movable
    • allocstall_normal
    • balloon_deflate
    • balloon_inflate
    • compact_daemon_free_scanned
    • compact_daemon_migrate_scanned
    • compact_daemon_wake
    • compact_fail
    • compact_free_scanned
    • compact_isolated
    • compact_migrate_scanned
    • compact_stall
    • compact_success
    • drop_pagecache
    • drop_slab
    • htlb_buddy_alloc_fail
    • htlb_buddy_alloc_success
    • kswapd_high_wmark_hit_quickly
    • kswapd_inodesteal
    • kswapd_low_wmark_hit_quickly
    • nr_active_anon
    • nr_active_file
    • nr_anon_pages
    • nr_anon_transparent_hugepages
    • nr_bounce
    • nr_dirtied
    • nr_dirty
    • nr_dirty_background_threshold
    • nr_dirty_threshold
    • nr_file_pages
    • nr_free_cma
    • nr_free_pages
    • nr_inactive_anon
    • nr_inactive_file
    • nr_isolated_anon
    • nr_isolated_file
    • nr_kernel_stack
    • nr_mapped
    • nr_mlock
    • nr_page_table_pages
    • nr_shmem
    • nr_shmem_hugepages
    • nr_shmem_pmdmapped
    • nr_slab_reclaimable
    • nr_slab_unreclaimable
    • nr_unevictable
    • nr_unstable
    • nr_vmscan_immediate_reclaim
    • nr_vmscan_write
    • nr_writeback
    • nr_writeback_temp
    • nr_written
    • nr_zone_active_anon
    • nr_zone_active_file
    • nr_zone_inactive_anon
    • nr_zone_inactive_file
    • nr_zone_unevictable
    • nr_zone_write_pending
    • oom_kill
    • pageoutrun
    • pgactivate
    • pgalloc_dma
    • pgalloc_dma32
    • pgalloc_movable
    • pgalloc_normal
    • pgdeactivate
    • pgfault
    • pgfree
    • pginodesteal
    • pglazyfree
    • pglazyfreed
    • pgmajfault
    • pgmigrate_fail
    • pgmigrate_success
    • pgpgin
    • pgpgout
    • pgrefill
    • pgrotated
    • pgscan_direct
    • pgscan_direct_throttle
    • pgscan_kswapd
    • pgskip_dma
    • pgskip_dma32
    • pgskip_movable
    • pgskip_normal
    • pgsteal_direct
    • pgsteal_kswapd
    • pswpin
    • pswpout
    • slabs_scanned
    • swap_ra
    • swap_ra_hit
    • unevictable_pgs_cleared
    • unevictable_pgs_culled
    • unevictable_pgs_mlocked
    • unevictable_pgs_munlocked
    • unevictable_pgs_rescued
    • unevictable_pgs_scanned
    • unevictable_pgs_stranded
    • workingset_activate
    • workingset_nodereclaim
    • workingset_refault
  • sandbox_id
2.0.0

Hypervisor metrics

Hypervisors metrics, collected mainly from proc filesystem of hypervisor process.

Metric name Type Units Labels Introduced in Kata version
kata_hypervisor_fds:
Open FDs for hypervisor.
GAUGE
  • sandbox_id
2.0.0
kata_hypervisor_io_stat:
Process IO statistics.
GAUGE
  • item (see /proc/<pid>/io)
    • cancelledwritebytes
    • rchar
    • readbytes
    • syscr
    • syscw
    • wchar
    • writebytes
  • sandbox_id
2.0.0
kata_hypervisor_netdev:
Net devices statistics.
GAUGE
  • interface (network device name)
  • item (see /proc/net/dev)
    • recv_bytes
    • recv_compressed
    • recv_drop
    • recv_errs
    • recv_fifo
    • recv_frame
    • recv_multicast
    • recv_packets
    • sent_bytes
    • sent_carrier
    • sent_colls
    • sent_compressed
    • sent_drop
    • sent_errs
    • sent_fifo
    • sent_packets
  • sandbox_id
2.0.0
kata_hypervisor_proc_stat:
Hypervisor process statistics.
GAUGE
  • item (see /proc/<pid>/stat)
    • cstime
    • cutime
    • stime
    • utime
  • sandbox_id
2.0.0
kata_hypervisor_proc_status:
Hypervisor process status.
GAUGE
  • item (see /proc/<pid>/status)
    • hugetlbpages
    • nonvoluntary_ctxt_switches
    • rssanon
    • rssfile
    • rssshmem
    • vmdata
    • vmexe
    • vmhwm
    • vmlck
    • vmlib
    • vmpeak
    • vmpin
    • vmpmd
    • vmpte
    • vmrss
    • vmsize
    • vmstk
    • vmswap
    • voluntary_ctxt_switches
  • sandbox_id
2.0.0
kata_hypervisor_threads:
Hypervisor process threads.
GAUGE
  • sandbox_id
2.0.0

Kata monitor metrics

Metrics about monitor itself.

Metric name Type Units Labels Introduced in Kata version
kata_monitor_go_gc_duration_seconds:
A summary of the pause duration of garbage collection cycles.
SUMMARY seconds 2.0.0
kata_monitor_go_goroutines:
Number of goroutines that currently exist.
GAUGE 2.0.0
kata_monitor_go_info:
Information about the Go environment.
GAUGE
  • version (golang version)
    • go1.13.9 (environment dependent variable)
2.0.0
kata_monitor_go_memstats_alloc_bytes:
Number of bytes allocated and still in use.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_alloc_bytes_total:
Total number of bytes allocated, even if freed.
COUNTER bytes 2.0.0
kata_monitor_go_memstats_buck_hash_sys_bytes:
Number of bytes used by the profiling bucket hash table.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_frees_total:
Total number of frees.
COUNTER 2.0.0
kata_monitor_go_memstats_gc_cpu_fraction:
The fraction of this program's available CPU time used by the GC since the program started.
GAUGE 2.0.0
kata_monitor_go_memstats_gc_sys_bytes:
Number of bytes used for garbage collection system metadata.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_heap_alloc_bytes:
Number of heap bytes allocated and still in use.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_heap_idle_bytes:
Number of heap bytes waiting to be used.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_heap_inuse_bytes:
Number of heap bytes that are in use.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_heap_objects:
Number of allocated objects.
GAUGE 2.0.0
kata_monitor_go_memstats_heap_released_bytes:
Number of heap bytes released to OS.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_heap_sys_bytes:
Number of heap bytes obtained from system.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_last_gc_time_seconds:
Number of seconds since 1970 of last garbage collection.
GAUGE seconds 2.0.0
kata_monitor_go_memstats_lookups_total:
Total number of pointer lookups.
COUNTER 2.0.0
kata_monitor_go_memstats_mallocs_total:
Total number of mallocs.
COUNTER 2.0.0
kata_monitor_go_memstats_mcache_inuse_bytes:
Number of bytes in use by mcache structures.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_mcache_sys_bytes:
Number of bytes used for mcache structures obtained from system.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_mspan_inuse_bytes:
Number of bytes in use by mspan structures.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_mspan_sys_bytes:
Number of bytes used for mspan structures obtained from system.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_next_gc_bytes:
Number of heap bytes when next garbage collection will take place.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_other_sys_bytes:
Number of bytes used for other system allocations.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_stack_inuse_bytes:
Number of bytes in use by the stack allocator.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_stack_sys_bytes:
Number of bytes obtained from system for stack allocator.
GAUGE bytes 2.0.0
kata_monitor_go_memstats_sys_bytes:
Number of bytes obtained from system.
GAUGE bytes 2.0.0
kata_monitor_go_threads:
Number of OS threads created.
GAUGE 2.0.0
kata_monitor_process_cpu_seconds_total:
Total user and system CPU time spent in seconds.
COUNTER seconds 2.0.0
kata_monitor_process_max_fds:
Maximum number of open file descriptors.
GAUGE 2.0.0
kata_monitor_process_open_fds:
Number of open file descriptors.
GAUGE 2.0.0
kata_monitor_process_resident_memory_bytes:
Resident memory size in bytes.
GAUGE bytes 2.0.0
kata_monitor_process_start_time_seconds:
Start time of the process since unix epoch in seconds.
GAUGE seconds 2.0.0
kata_monitor_process_virtual_memory_bytes:
Virtual memory size in bytes.
GAUGE bytes 2.0.0
kata_monitor_process_virtual_memory_max_bytes:
Maximum amount of virtual memory available in bytes.
GAUGE bytes 2.0.0
kata_monitor_running_shim_count:
Running shim count(running sandboxes).
GAUGE 2.0.0
kata_monitor_scrape_count:
Scape count.
COUNTER 2.0.0
kata_monitor_scrape_durations_histogram_milliseconds:
Time used to scrape from shims
HISTOGRAM milliseconds 2.0.0
kata_monitor_scrape_failed_count:
Failed scape count.
COUNTER 2.0.0

Kata containerd shim v2 metrics

Metrics about Kata containerd shim v2 process.

Metric name Type Units Labels Introduced in Kata version
kata_shim_agent_rpc_durations_histogram_milliseconds:
RPC latency distributions.
HISTOGRAM milliseconds
  • action (RPC actions of Kata agent)
    • grpc.CheckRequest
    • grpc.CloseStdinRequest
    • grpc.CopyFileRequest
    • grpc.CreateContainerRequest
    • grpc.CreateSandboxRequest
    • grpc.DestroySandboxRequest
    • grpc.ExecProcessRequest
    • grpc.GetMetricsRequest
    • grpc.GuestDetailsRequest
    • grpc.ListInterfacesRequest
    • grpc.ListProcessesRequest
    • grpc.ListRoutesRequest
    • grpc.MemHotplugByProbeRequest
    • grpc.OnlineCPUMemRequest
    • grpc.PauseContainerRequest
    • grpc.RemoveContainerRequest
    • grpc.ReseedRandomDevRequest
    • grpc.ResumeContainerRequest
    • grpc.SetGuestDateTimeRequest
    • grpc.SignalProcessRequest
    • grpc.StartContainerRequest
    • grpc.StartTracingRequest
    • grpc.StatsContainerRequest
    • grpc.StopTracingRequest
    • grpc.TtyWinResizeRequest
    • grpc.UpdateContainerRequest
    • grpc.UpdateInterfaceRequest
    • grpc.UpdateRoutesRequest
    • grpc.WaitProcessRequest
    • grpc.WriteStreamRequest
  • sandbox_id
2.0.0
kata_shim_fds:
Kata containerd shim v2 open FDs.
GAUGE
  • sandbox_id
2.0.0
kata_shim_go_gc_duration_seconds:
A summary of the pause duration of garbage collection cycles.
SUMMARY seconds
  • sandbox_id
2.0.0
kata_shim_go_goroutines:
Number of goroutines that currently exist.
GAUGE
  • sandbox_id
2.0.0
kata_shim_go_info:
Information about the Go environment.
GAUGE
  • sandbox_id
  • version (golang version)
    • go1.13.9 (environment dependent variable)
2.0.0
kata_shim_go_memstats_alloc_bytes:
Number of bytes allocated and still in use.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_alloc_bytes_total:
Total number of bytes allocated, even if freed.
COUNTER bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_buck_hash_sys_bytes:
Number of bytes used by the profiling bucket hash table.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_frees_total:
Total number of frees.
COUNTER
  • sandbox_id
2.0.0
kata_shim_go_memstats_gc_cpu_fraction:
The fraction of this program's available CPU time used by the GC since the program started.
GAUGE
  • sandbox_id
2.0.0
kata_shim_go_memstats_gc_sys_bytes:
Number of bytes used for garbage collection system metadata.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_heap_alloc_bytes:
Number of heap bytes allocated and still in use.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_heap_idle_bytes:
Number of heap bytes waiting to be used.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_heap_inuse_bytes:
Number of heap bytes that are in use.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_heap_objects:
Number of allocated objects.
GAUGE
  • sandbox_id
2.0.0
kata_shim_go_memstats_heap_released_bytes:
Number of heap bytes released to OS.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_heap_sys_bytes:
Number of heap bytes obtained from system.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_last_gc_time_seconds:
Number of seconds since 1970 of last garbage collection.
GAUGE seconds
  • sandbox_id
2.0.0
kata_shim_go_memstats_lookups_total:
Total number of pointer lookups.
COUNTER
  • sandbox_id
2.0.0
kata_shim_go_memstats_mallocs_total:
Total number of mallocs.
COUNTER
  • sandbox_id
2.0.0
kata_shim_go_memstats_mcache_inuse_bytes:
Number of bytes in use by mcache structures.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_mcache_sys_bytes:
Number of bytes used for mcache structures obtained from system.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_mspan_inuse_bytes:
Number of bytes in use by mspan structures.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_mspan_sys_bytes:
Number of bytes used for mspan structures obtained from system.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_next_gc_bytes:
Number of heap bytes when next garbage collection will take place.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_other_sys_bytes:
Number of bytes used for other system allocations.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_stack_inuse_bytes:
Number of bytes in use by the stack allocator.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_stack_sys_bytes:
Number of bytes obtained from system for stack allocator.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_memstats_sys_bytes:
Number of bytes obtained from system.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_go_threads:
Number of OS threads created.
GAUGE
  • sandbox_id
2.0.0
kata_shim_io_stat:
Kata containerd shim v2 process IO statistics.
GAUGE
  • item (see /proc/<pid>/io)
    • cancelledwritebytes
    • rchar
    • readbytes
    • syscr
    • syscw
    • wchar
    • writebytes
  • sandbox_id
2.0.0
kata_shim_netdev:
Kata containerd shim v2 network devices statistics.
GAUGE
  • interface (network device name)
  • item (see /proc/net/dev)
    • recv_bytes
    • recv_compressed
    • recv_drop
    • recv_errs
    • recv_fifo
    • recv_frame
    • recv_multicast
    • recv_packets
    • sent_bytes
    • sent_carrier
    • sent_colls
    • sent_compressed
    • sent_drop
    • sent_errs
    • sent_fifo
    • sent_packets
  • sandbox_id
2.0.0
kata_shim_pod_overhead_cpu:
Kata Pod overhead for CPU resources(percent).
GAUGE percent
  • sandbox_id
2.0.0
kata_shim_pod_overhead_memory_in_bytes:
Kata Pod overhead for memory resources(bytes).
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_proc_stat:
Kata containerd shim v2 process statistics.
GAUGE
  • item (see /proc/<pid>/stat)
    • cstime
    • cutime
    • stime
    • utime
  • sandbox_id
2.0.0
kata_shim_proc_status:
Kata containerd shim v2 process status.
GAUGE
  • item (see /proc/<pid>/status)
    • hugetlbpages
    • nonvoluntary_ctxt_switches
    • rssanon
    • rssfile
    • rssshmem
    • vmdata
    • vmexe
    • vmhwm
    • vmlck
    • vmlib
    • vmpeak
    • vmpin
    • vmpmd
    • vmpte
    • vmrss
    • vmsize
    • vmstk
    • vmswap
    • voluntary_ctxt_switches
  • sandbox_id
2.0.0
kata_shim_process_cpu_seconds_total:
Total user and system CPU time spent in seconds.
COUNTER seconds
  • sandbox_id
2.0.0
kata_shim_process_max_fds:
Maximum number of open file descriptors.
GAUGE
  • sandbox_id
2.0.0
kata_shim_process_open_fds:
Number of open file descriptors.
GAUGE
  • sandbox_id
2.0.0
kata_shim_process_resident_memory_bytes:
Resident memory size in bytes.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_process_start_time_seconds:
Start time of the process since unix epoch in seconds.
GAUGE seconds
  • sandbox_id
2.0.0
kata_shim_process_virtual_memory_bytes:
Virtual memory size in bytes.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_process_virtual_memory_max_bytes:
Maximum amount of virtual memory available in bytes.
GAUGE bytes
  • sandbox_id
2.0.0
kata_shim_rpc_durations_histogram_milliseconds:
RPC latency distributions.
HISTOGRAM milliseconds
  • action (Kata shim v2 actions)
    • checkpoint
    • close_io
    • connect
    • create
    • delete
    • exec
    • kill
    • pause
    • pids
    • resize_pty
    • resume
    • shutdown
    • start
    • state
    • stats
    • update
    • wait
  • sandbox_id
2.0.0
kata_shim_threads:
Kata containerd shim v2 process threads.
GAUGE
  • sandbox_id
2.0.0