To enable the Prometheus metrics exporter for CRI-O, either start crio
with --metrics-enable
or add the corresponding option to a config overwrite,
for example /etc/crio/crio.conf.d/01-metrics.conf
:
[crio.metrics]
enable_metrics = true
The metrics endpoint serves per default on port 9090
via HTTP. This can be
changed via the --metrics-port
command line argument or via the configuration
file:
metrics_port = 9090
If CRI-O runs with enabled metrics, then this can be verified by querying the endpoint manually via curl.
curl localhost:9090/metrics
It is also possible to serve the metrics via HTTPs, by providing an additional certificate and key:
[crio.metrics]
enable_metrics = true
metrics_cert = "/path/to/cert.pem"
metrics_key = "/path/to/key.pem"
Beside the default golang based metrics, CRI-O provides the following additional metrics:
Metric Key | Possible Labels or Buckets | Type | Purpose |
---|---|---|---|
crio_operations_total |
every CRI-O RPC* operation |
Counter | Cumulative number of CRI-O operations by operation type. |
crio_operations_latency_seconds_total |
every CRI-O RPC* operation ,network_setup_pod (CNI pod network setup time),network_setup_overall (Overall network setup time) |
Summary | Latency in seconds of CRI-O operations. Split-up by operation type. |
crio_operations_latency_seconds |
every CRI-O RPC* operation |
Gauge | Latency in seconds of individual CRI calls for CRI-O operations. Broken down by operation type. |
crio_operations_errors_total |
every CRI-O RPC* operation |
Counter | Cumulative number of CRI-O operation errors by operation type. |
crio_image_pulls_bytes_total |
mediatype , size sizes are in bucket of bytes for layer sizes of 1 KiB, 1 MiB, 10 MiB, 50 MiB, 100 MiB, 200 MiB, 300 MiB, 400 MiB, 500 MiB, 1 GiB, 10 GiB |
Counter | Bytes transferred by CRI-O image pulls. |
crio_image_pulls_skipped_bytes_total |
size sizes are in bucket of bytes for layer sizes of 1 KiB, 1 MiB, 10 MiB, 50 MiB, 100 MiB, 200 MiB, 300 MiB, 400 MiB, 500 MiB, 1 GiB, 10 GiB |
Counter | Bytes skipped by CRI-O image pulls by name. The ratio of skipped bytes to total bytes can be used to determine cache reuse ratio. |
crio_image_pulls_success_total |
Counter | Successful image pulls. | |
crio_image_pulls_failure_total |
error |
Counter | Failed image pulls by their error category. |
crio_image_pulls_layer_size_{sum,count,bucket} |
buckets in byte for layer sizes of 1 KiB, 1 MiB, 10 MiB, 50 MiB, 100 MiB, 200 MiB, 300 MiB, 400 MiB, 500 MiB, 1 GiB, 10 GiB | Histogram | Bytes transferred by CRI-O image pulls per layer. |
crio_image_layer_reuse_total |
Counter | Reused (not pulled) local image layer count by name. | |
crio_containers_dropped_events_total |
Counter | The total number of container events dropped. | |
crio_containers_oom_total |
Counter | Total number of containers killed because they ran out of memory (OOM). | |
crio_containers_oom_count_total |
name |
Counter | Containers killed because they ran out of memory (OOM) by their name. The label name can have high cardinality sometimes but it is in the interest of users giving them the ease to identify which container(s) are going into OOM state. Also, ideally very few containers should OOM keeping the label cardinality of name reasonably low. |
crio_containers_seccomp_notifier_count_total |
name , syscall |
Counter | Forbidden syscall count resulting in killed containers by name . |
crio_processes_defunct |
Gauge | Total number of defunct processes in the node | |
crio_operations |
every CRI-O RPC* | Counter | (DEPRECATED: in favour of crio_operations_total ) Cumulative number of CRI-O operations by operation type. |
crio_operations_latency_microseconds_total |
every CRI-O RPC*,network_setup_pod (CNI pod network setup time),network_setup_overall (Overall network setup time) |
Summary | (DEPRECATED: in favour of crio_operations_latency_seconds_total ) Latency in microseconds of CRI-O operations. Split-up by operation type. |
crio_operations_latency_microseconds |
every CRI-O RPC* | Gauge | (DEPRECATED: in favour of crio_operations_latency_seconds ) Latency in microseconds of individual CRI calls for CRI-O operations. Broken down by operation type. |
crio_operations_errors |
every CRI-O RPC* | Counter | (DEPRECATED: in favour of crio_operations_errors_total ) Cumulative number of CRI-O operation errors by operation type. |
crio_image_pulls_by_digest |
name , digest , mediatype , size |
Counter | (DEPRECATED: in favour of crio_image_pulls_bytes_total ) Bytes transferred by CRI-O image pulls by digest. |
crio_image_pulls_by_name |
name , size |
Counter | (DEPRECATED: in favour of crio_image_pulls_bytes_total ) Bytes transferred by CRI-O image pulls by name. |
crio_image_pulls_by_name_skipped |
name |
Counter | (DEPRECATED: in favour of crio_image_pulls_skipped_bytes_total ) Bytes skipped by CRI-O image pulls by name. |
crio_image_pulls_successes |
name |
Counter | (DEPRECATED: in favour of crio_image_pulls_success_total ) Successful image pulls by image name |
crio_image_pulls_failures |
name , error |
Counter | (DEPRECATED: in favour of crio_image_pulls_failure_total ) Failed image pulls by image name and their error category. |
crio_image_layer_reuse |
name |
Counter | (DEPRECATED: in favour of crio_image_layer_reuse_total ) Reused (not pulled) local image layer count by name. |
crio_containers_oom |
name |
Counter | (DEPRECATED: in favour of crio_containers_oom_count_total ) Containers killed because they ran out of memory (OOM) by their name |
-
Available CRI-O RPC's from the gRPC API:
Attach
,ContainerStats
,ContainerStatus
,CreateContainer
,Exec
,ExecSync
,ImageFsInfo
,ImageStatus
,ListContainerStats
,ListContainers
,ListImages
,ListPodSandbox
,PodSandboxStatus
,PortForward
,PullImage
,RemoveContainer
,RemoveImage
,RemovePodSandbox
,ReopenContainerLog
,RunPodSandbox
,StartContainer
,Status
,StopContainer
,StopPodSandbox
,UpdateContainerResources
,UpdateRuntimeConfig
,Version
-
Available error categories for
crio_image_pulls_failures
:UNKNOWN
: The default label which gets applied if the error is not knownCONNECTION_REFUSED
: The local network is down or the registry refused the connection.CONNECTION_TIMEOUT
: The connection timed out during the image download.NOT_FOUND
: The registry does not exist at the specified resourceBLOB_UNKNOWN
: This error may be returned when a blob is unknown to the registry in a specified repository. This can be returned with a standard get or if a manifest references an unknown layer during upload.BLOB_UPLOAD_INVALID
: The blob upload encountered an error and can no longer proceed.BLOB_UPLOAD_UNKNOWN
: If a blob upload has been cancelled or was never started, this error code may be returned.DENIED
: The access controller denied access for the operation on a resource.DIGEST_INVALID
: When a blob is uploaded, the registry will check that the content matches the digest provided by the client. The error may include a detail structure with the key "digest", including the invalid digest string. This error may also be returned when a manifest includes an invalid layer digest.MANIFEST_BLOB_UNKNOWN
: This error may be returned when a manifest blob is unknown to the registry.MANIFEST_INVALID
: During upload, manifests undergo several checks ensuring validity. If those checks fail, this error may be returned, unless a more specific error is included. The detail will contain information the failed validation.MANIFEST_UNKNOWN
: This error is returned when the manifest, identified by name and tag is unknown to the repository.MANIFEST_UNVERIFIED
: During manifest upload, if the manifest fails signature verification, this error will be returned.NAME_INVALID
: Invalid repository name encountered either during manifest. validation or any API operation.NAME_UNKNOWN
: This is returned if the name used during an operation is unknown to the registry.SIZE_INVALID
: When a layer is uploaded, the provided size will be checked against the uploaded content. If they do not match, this error will be returned.TAG_INVALID
: During a manifest upload, if the tag in the manifest does not match the uri tag, this error will be returned.TOOMANYREQUESTS
: Returned when a client attempts to contact a service too many times.UNAUTHORIZED
: The access controller was unable to authenticate the client. Often this will be accompanied by a Www-Authenticate HTTP response header indicating how to authenticate.UNAVAILABLE
: Returned when a service is not available.UNSUPPORTED
: The operation was unsupported due to a missing implementation or invalid set of parameters.
The CRI-O metrics exporter can be used to provide a cluster wide scraping
endpoint for Prometheus. It is possible to either build the container image
manually via make metrics-exporter
or directly consume the available image on
quay.io.
The deployment requires enabled RBAC within the target Kubernetes
environment and creates a new ClusterRole to be able to list available
nodes. Beside that a new Role will be created to be able to update a config-map
within the cri-o-exporter
namespace. Please be aware that the exporter only
works if the pod has access to the node IP from its namespace. This should
generally work but might be restricted due to network configuration or policies.
To deploy the metrics exporter within a new cri-o-metrics-exporter
namespace,
simply apply the cluster.yaml from the root directory of this repository:
kubectl create -f contrib/metrics-exporter/cluster.yaml
The CRIO_METRICS_PORT
environment variable is set per default to "9090"
and
can be used to customize the metrics port for the nodes. If the deployment is
up and running, it should log the registered nodes as well as that a new
config-map has been created:
$ kubectl logs -f cri-o-metrics-exporter-65c9b7b867-7qmsb
level=info msg="Getting cluster configuration"
level=info msg="Creating Kubernetes client"
level=info msg="Retrieving nodes"
level=info msg="Registering handler /master (for 172.1.2.0)"
level=info msg="Registering handler /node-0 (for 172.1.3.0)"
level=info msg="Registering handler /node-1 (for 172.1.3.1)"
level=info msg="Registering handler /node-2 (for 172.1.3.2)"
level=info msg="Registering handler /node-3 (for 172.1.3.3)"
level=info msg="Registering handler /node-4 (for 172.1.3.4)"
level=info msg="Updated scrape configs in configMap cri-o-metrics-exporter"
level=info msg="Wrote scrape configs to configMap cri-o-metrics-exporter"
level=info msg="Serving HTTP on :8080"
The config-map now contains the scrape configuration, which can be used for Prometheus:
kubectl get cm cri-o-metrics-exporter -o yaml
apiVersion: v1
data:
config: |
scrape_configs:
- job_name: "cri-o-exporter-master"
scrape_interval: 1s
metrics_path: /master
static_configs:
- targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
labels:
instance: "master"
- job_name: "cri-o-exporter-node-0"
scrape_interval: 1s
metrics_path: /node-0
static_configs:
- targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
labels:
instance: "node-0"
- job_name: "cri-o-exporter-node-1"
scrape_interval: 1s
metrics_path: /node-1
static_configs:
- targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
labels:
instance: "node-1"
- job_name: "cri-o-exporter-node-2"
scrape_interval: 1s
metrics_path: /node-2
static_configs:
- targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
labels:
instance: "node-2"
- job_name: "cri-o-exporter-node-3"
scrape_interval: 1s
metrics_path: /node-3
static_configs:
- targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
labels:
instance: "node-3"
- job_name: "cri-o-exporter-node-4"
scrape_interval: 1s
metrics_path: /node-4
static_configs:
- targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
labels:
instance: "node-4"
kind: ConfigMap
metadata:
creationTimestamp: "2020-05-12T08:29:06Z"
name: cri-o-metrics-exporter
namespace: cri-o-metrics-exporter
resourceVersion: "2862950"
selfLink: /api/v1/namespaces/cri-o-metrics-exporter/configmaps/cri-o-metrics-exporter
uid: 1409804a-78a2-4961-8205-c5f383626b4b
If the scrape configuration has been added to the Prometheus server, then the provided Grafana dashboard within this repository can be setup, too: