RFC | Title | Author | Status | Type |
---|---|---|---|---|
51 |
Chef Telemetry |
Ranjib Dey <[email protected]> |
On Hold |
Standards Track |
As a chef user and developer I want an easy way to get chef run performance data so that I can identify areas that can be optimized
As we automate more things with chef and control more system resources, chef run time increases, and the chef client process consumes more memory and cpu. As a configuration management system, it is important that a chef run does not impact the underlying host performance. Over the last few years Chef community has shared a number of optimization techniques. Aim of this RFC is to facilitate some of those common patterns by providing a robust telemetry system in core chef. If configured, the telemetry system will give users fine grained metrics about chef run that can be used to find out performance hiccups. This will also help users quantify how a chef run impacts the underlying system.
Telemetry module will provide following metrics:
-
Time spent on key chef run milestones: How long does the chef run take for each stage? Telemetry system will declare a subset of the current events as major milestones, these are: run_start, ohai_completed, node_load_completed, cookbook_resolution_complete, converge_start, converge_complete, run_completed/failed. Metrics will be captured during these milestones via event handlers. This can be used to estimate time spent on cookbook sync, compilation convergence phase etc.
-
Time spent on individual resource & recipe convergence. This information is already available using the API, telemetry system will just consolidate them, along side the other metrics.
-
GC related data. Ruby's internal memory usage obtained from GC.stat, captured during the chef run milestones.
Telemetry system will also let users define their custom metrics that can be computed during these milestones (like total chef client process memory consumption from /proc/PID), This will be offered via Chef class DSL. This will also be the recommended way to implement platform specific metrics. Example:
require 'sys/proctable'
Chef.telemetry do |telemetry|
telemetry.add_metric 'vsize' do
Sys::ProcTable.ps(Process.pid).vsize
end
end
In addition to capturing these metrics, telemetry system will also provide publishing API for metrics. Telemetry system will provide two publishing mechanism out of the box. They are
- node attributes: metrics will be saved as node attribute itself
- statsd : To publish metrics via statsd endpoints
Telemetry system can be configured via the standard chef config file or using a dedicated CLI flag for chef-client and chef-solo. It will be disabled by default. Following is an example of configuring the telemetry subsystem
enable_telemetry true
config_context(:telemetry) do
resource true # captures per resource execution time taken
recipe true # captures per recipe execution time taken
gc true # captures GC stats during main chef events
process true # captures process memory stats from /proc during main chef events
client_run true # captures time spent on major chef run milestones
publish_using(
Chef::Telemetry::Publisher::Statsd.new(host: '192.168.2.11', port: 7676), # emit data to statsd
Chef::Telemetry::Publisher::NodeAttribute.new('chef-metrics') # save all metrics under node['chef-metrics'] attribute
)
end
Example of enabling telemetry using the CLI flag
chef-client --enable-telemetry
chef-solo --enable-telemetry
Telemetry publishing API will be a single method named #publish that accept a hash that
represents the metrics. publish_using
method is used to register a metrics publisher
with the telemetry system.
This work is in the public domain. In jurisdictions that do not allow for this, this work is available under CC0. To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.