Skip to content

Commit

Permalink
Merge pull request #3458 from bobrik/ivan/no-accelerator
Browse files Browse the repository at this point in the history
Remove mentions of accelerator from the docs
  • Loading branch information
iwankgb authored Jan 21, 2024
2 parents 786dbcf + 13df731 commit 27f1e92
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 67 deletions.
30 changes: 0 additions & 30 deletions deploy/kubernetes/overlays/examples/gpu-privilages.yaml

This file was deleted.

25 changes: 1 addition & 24 deletions docs/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ sudo docker run \

cAdvisor is now running (in the background) on `http://localhost:8080/`. The setup includes directories with Docker state cAdvisor needs to observe.

**Note**:
**Note**:
- If docker daemon is running with [user namespace enabled](https://docs.docker.com/engine/reference/commandline/dockerd/#starting-the-daemon-with-user-namespaces-enabled),
you need to add `--userns=host` option in order for cAdvisor to monitor Docker containers,
otherwise cAdvisor can not connect to docker daemon.
Expand Down Expand Up @@ -122,26 +122,3 @@ cAdvisor is now running (in the foreground) on `http://localhost:8080/`.
## Runtime Options

cAdvisor has a series of flags that can be used to configure its runtime behavior. More details can be found in runtime [options](runtime_options.md).

## Hardware Accelerator Monitoring

cAdvisor can export some metrics for hardware accelerators attached to containers.
Currently only Nvidia GPUs are supported. There are no machine level metrics.
So, metrics won't show up if no container with accelerators attached is running.
Metrics will only show up if accelerators are explicitly attached to the container, e.g., by passing `--device /dev/nvidia0:/dev/nvidia0` flag to docker.
If nothing is explicitly attached to the container, metrics will NOT show up. This can happen when you access accelerators from privileged containers.

There are two things that cAdvisor needs to show Nvidia GPU metrics:
- access to NVML library (`libnvidia-ml.so.1`).
- access to the GPU devices.

If you are running cAdvisor inside a container, you will need to do the following to give the container access to NVML library:
```
-e LD_LIBRARY_PATH=<path-where-nvml-is-present>
--volume <above-path>:<above-path>
```

If you are running cAdvisor inside a container, you can do one of the following to give it access to the GPU devices:
- Run with `--privileged`
- If you are on docker v17.04.0-ce or above, run with `--device-cgroup-rule 'c 195:* mrw'`
- Run with `--device /dev/nvidiactl:/dev/nvidiactl /dev/nvidia0:/dev/nvidia0 /dev/nvidia1:/dev/nvidia1 <and-so-on-for-all-nvidia-devices>`
24 changes: 12 additions & 12 deletions docs/runtime_options.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This document describes a set of runtime flags available in cAdvisor.

* `--env_metadata_whitelist`: a comma-separated list of environment variable keys that needs to be collected for containers, only support containerd and docker runtime for now.

## Limiting which containers are monitored
## Limiting which containers are monitored
* `--docker_only=false` - do not report raw cgroup metrics, except the root cgroup.
* `--raw_cgroup_prefix_whitelist` - a comma-separated list of cgroup path prefix that needs to be collected even when `--docker_only` is specified
* `--disable_root_cgroup_stats=false` - disable collecting root Cgroup stats.
Expand Down Expand Up @@ -134,8 +134,8 @@ cAdvisor stores the latest historical data in memory. How long of a history it s
--application_metrics_count_limit=100: Max number of application metrics to store (per container) (default 100)
--collector_cert="": Collector's certificate, exposed to endpoints for certificate based authentication.
--collector_key="": Key for the collector's certificate
--disable_metrics=<metrics>: comma-separated list of metrics to be disabled. Options are accelerator,advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp. (default advtcp,cpu_topology,cpuset,hugetlb,memory_numa,process,referenced_memory,resctrl,sched,tcp,udp)
--enable_metrics=<metrics>: comma-separated list of metrics to be enabled. If set, overrides 'disable_metrics'. Options are accelerator,advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp.
--disable_metrics=<metrics>: comma-separated list of metrics to be disabled. Options are advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,psi_avg,psi_total,referenced_memory,resctrl,sched,tcp,udp. (default advtcp,cpu_topology,cpuset,hugetlb,memory_numa,process,referenced_memory,resctrl,sched,tcp,udp)
--enable_metrics=<metrics>: comma-separated list of metrics to be enabled. If set, overrides 'disable_metrics'. Options are advtcp,app,cpu,cpuLoad,cpu_topology,cpuset,disk,diskIO,hugetlb,memory,memory_numa,network,oom_event,percpu,perf_event,process,psi_avg,psi_total,referenced_memory,resctrl,sched,tcp,udp.
--prometheus_endpoint="/metrics": Endpoint to expose Prometheus metrics on (default "/metrics")
--disable_root_cgroup_stats=false: Disable collecting root Cgroup stats
```
Expand Down Expand Up @@ -191,7 +191,7 @@ in mind that it is impossible to group more events that there are counters avail

#### Getting config values
Using perf tools:
* Identify the event in `perf list` output.
* Identify the event in `perf list` output.
* Execute command: `perf stat -I 5000 -vvv -e EVENT_NAME`
* Find `perf_event_attr` section on `perf stat` output, copy config and type field to configuration file.

Expand All @@ -208,7 +208,7 @@ perf_event_attr:
exclude_guest 1
------------------------------------------------------------
```
* Configuration file should look like:
* Configuration file should look like:
```json
{
"core": {
Expand Down Expand Up @@ -242,15 +242,15 @@ perf_event_attr:
}
```

Config values can be also obtain from:
Config values can be also obtain from:
* [Intel® 64 and IA32 Architectures Performance Monitoring Events](https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia32-architectures-performance-monitoring-events.html)


##### Uncore Events configuration
Uncore Event name should be in form `PMU_PREFIX/event_name` where **PMU_PREFIX** mean
that statistics would be counted on all PMUs with that prefix in name.

Let's explain this by example:
Let's explain this by example:

```json
{
Expand All @@ -260,7 +260,7 @@ Let's explain this by example:
"uncore_imc_0/cas_count_write",
"cas_count_all"
],
"custom_events": [
"custom_events": [
{
"config": [
"0x304"
Expand Down Expand Up @@ -419,11 +419,11 @@ See example configuration below:
```

In the example above:
* `instructions` will be measured as a non-grouped event and is specified using human friendly interface that can be
obtained by calling `perf list`. You can use any name that appears in the output of `perf list` command. This is
* `instructions` will be measured as a non-grouped event and is specified using human friendly interface that can be
obtained by calling `perf list`. You can use any name that appears in the output of `perf list` command. This is
interface that majority of users will rely on.
* `instructions_retired` will be measured as non-grouped event and is specified using an advanced API that allows
to specify any perf event available (some of them are not named and can't be specified with plain string). Event name
to specify any perf event available (some of them are not named and can't be specified with plain string). Event name
should be a human readable string that will become a metric name.
* `cas_count_read` will be measured as uncore non-grouped event on all Integrated Memory Controllers Performance Monitoring Units because of unset `type` field and
`uncore_imc` prefix.
Expand All @@ -435,7 +435,7 @@ Resctrl file system is not hierarchical like cgroups, so users should set `--doc

```
--resctrl_interval=0: Resctrl mon groups updating interval. Zero value disables updating mon groups.
```
```

## Storage driver specific instructions:

Expand Down
2 changes: 1 addition & 1 deletion stats/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ import info "github.com/google/cadvisor/info/v1"
// For each container detected by the cAdvisor manager, it will call
// GetCollector() with the devices cgroup path for that container.
// GetCollector() is supposed to return an object that can update
// accelerator stats for that container.
// external stats for that container.
type Manager interface {
Destroy()
GetCollector(deviceCgroup string) (Collector, error)
Expand Down

0 comments on commit 27f1e92

Please sign in to comment.