This document summarises a set of proposals triggered by the tracing documentation PR.
This section explains some terminology required to understand the proposals. Further details can be found in the tracing documentation PR.
Trace mode | Description | Use-case |
---|---|---|
Static | Trace agent from startup to shutdown | Entire lifespan |
Dynamic | Toggle tracing on/off as desired | On-demand "snapshot" |
Trace type | Description | Use-case |
---|---|---|
isolated | traces all relate to single component | Observing lifespan |
collated | traces "grouped" (runtime+agent) | Understanding component interaction |
Lifespan | trace mode | trace type |
---|---|---|
short-lived | static | collated if possible, else isolated? |
long-running | dynamic | collated? (to see interactions) |
-
Implement all trace types and trace modes for agent.
-
Why?
-
Maximum flexibility.
Counterargument:
Due to the intrusive nature of adding tracing, we have learnt that landing small incremental changes is simpler and quicker!
-
Compatibility with Kata 1.x tracing.
Counterargument:
Agent tracing in Kata 1.x was extremely awkward to setup (to the extent that it's unclear how many users actually used it!)
This point, coupled with the new architecture for Kata 2.x, suggests that we may not need to supply the same set of tracing features (in fact they may not make sense)).
-
-
All tracing will be static.
-
Why?
-
Because dynamic tracing will always be "partial"
In fact, not only would it be only a "snapshot" of activity, it may not even be possible to create a complete "trace transaction". If this is true, the trace output would be partial and would appear "unstructured".
-
-
Agent tracing will be "isolated" by default.
-
Agent tracing will be "collated" if runtime tracing is also enabled.
-
Why?
- Offers a graceful fallback for agent tracing if runtime tracing disabled.
- Simpler code!
-
Are your containers long-running or short-lived?
-
Would you ever need to turn on tracing "briefly"?
-
If "yes", is a "partial trace" useful or useless?
Likely to be considered useless as it is a partial snapshot. Alternative tracing methods may be more appropriate to dynamic OpenTelemetry tracing.
-
-
Are you happy to stop a container to enable tracing? If "no", dynamic tracing may be required.
-
Would you ever want to trace the agent and the runtime "in isolation" at the same time?
-
If "yes", we need to fully implement
trace_mode=isolated
This seems unlikely though.
-
The second set of proposals affect the way traces are collected.
Currently:
- The runtime sends trace spans to Jaeger directly.
- The agent will send trace spans to the
trace-forwarder
component. - The trace forwarder will send trace spans to Jaeger.
Kata agent tracing overview:
+-------------------------------------------+
| Host |
| |
| +-----------+ |
| | Trace | |
| | Collector | |
| +-----+-----+ |
| ^ +--------------+ |
| | spans | Kata VM | |
| +-----+-----+ | | |
| | Kata | spans | +-----+ | |
| | Trace |<-----------------|Kata | | |
| | Forwarder | VSOCK | |Agent| | |
| +-----------+ Channel | +-----+ | |
| +--------------+ |
+-------------------------------------------+
Currently:
-
If agent tracing is enabled but the trace forwarder is not running, the agent will error.
-
If the trace forwarder is started but Jaeger is not running, the trace forwarder will error.
-
The runtime and agent should:
- Use the same trace collection implementation.
- Use the most the common configuration items.
-
Kata should should support more trace collection software or
SaaS
(for exampleZipkin
,datadog
). -
Trace collection should not block normal runtime/agent operations (for example if
vsock-exporter
/Jaeger is not running, Kata Containers should work normally).
Kata runtime/agent all send spans to trace forwarder, and the trace forwarder,
acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or datadog
.
Pros:
- Runtime/agent will be simple.
- Could update trace collection target while Kata Containers are running.
Cons:
- Requires the trace forwarder component to be running (that is a pressure to operation).
Send spans to collector directly from runtime/agent, this proposal need network accessible to the collector.
Pros:
- No additional trace forwarder component needed.
Cons:
- Need more code/configuration to support all trace collectors.
- We could add dynamic and fully isolated tracing at a later stage, if required.
- See the new GitHub project.
- kata-containers-tracing-status gist.
- tracing documentation PR.
- 2021-07-01: A summary of the discussion was posted to the mail list.
- 2021-06-22: These proposals were discussed in the Kata Architecture Committee meeting.
- 2021-06-18: These proposals where announced on the mailing list.
- Nobody opposed the agent proposals, so they are being implemented.
- The trace collection proposals are still being considered.