Skip to content

Commit

Permalink
Merge pull request cortexproject#6329 from CharlieTLe/proofread-gossi…
Browse files Browse the repository at this point in the history
…p-ring-getting-started
  • Loading branch information
CharlieTLe authored Nov 13, 2024
2 parents 1d09628 + 5d3f5a9 commit 2c18450
Show file tree
Hide file tree
Showing 18 changed files with 193 additions and 180 deletions.
15 changes: 8 additions & 7 deletions docs/guides/alert-manager-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,30 @@ slug: alertmanager-configuration

## Context

Cortex Alertmanager notification setup follow mostly the syntax of Prometheus Alertmanager since it is based on the same codebase. The following is a description on how to load the configuration setup so that Alertmanager can use for notification when an alert event happened.
Cortex Alertmanager notification setup follows mostly the syntax of Prometheus Alertmanager since it is based on the same codebase. The following is a description on how to load the configuration setup so that Alertmanager can use it for notification when an alert event happens.

### Configuring the Cortex Alertmanager storage backend

With the introduction of Cortex 1.8 the storage backend config option shifted to the new pattern [#3888](https://github.com/cortexproject/cortex/pull/3888). You can find the new configuration [here](../configuration/config-file-reference.md#alertmanager_storage_config)
With the introduction of Cortex 1.8, the storage backend config option shifted to the new pattern [#3888](https://github.com/cortexproject/cortex/pull/3888). You can find the new configuration [here](../configuration/config-file-reference.md#alertmanager_storage_config)

Note that when using `-alertmanager.sharding-enabled=true`, the following storage backends are not supported: `local`, `configdb`.

When using the new configuration pattern it is important that any of the old configuration pattern flags are unset (`-alertmanager.storage`), as well as `-<prefix>.configs.url`. This is because the old pattern still takes precedence over the new one. The old configuration pattern (`-alertmanager.storage`) is marked as deprecated and will be removed by Cortex version 1.11. However this change doesn't apply to `-alertmanager.storage.path` and `-alertmanager.storage.retention`.
When using the new configuration pattern, it is important that any of the old configuration pattern flags are unset (`-alertmanager.storage`), as well as `-<prefix>.configs.url`. This is because the old pattern still takes precedence over the new one. The old configuration pattern (`-alertmanager.storage`) is marked as deprecated and will be removed by Cortex version 1.11. However, this change doesn't apply to `-alertmanager.storage.path` and `-alertmanager.storage.retention`.

### Cortex Alertmanager configuration

Cortex Alertmanager can be uploaded via Cortex [Set Alertmanager configuration API](../api/_index.md#set-alertmanager-configuration) or using [Cortex Tools](https://github.com/cortexproject/cortex-tools).

Follow the instruction at the `cortextool` link above to download or update to the latest version of the tool.
Follow the instructions at the `cortextool` link above to download or update to the latest version of the tool.

To obtain the full help of how to use `cortextool` for all commands and flags, use
`cortextool --help-long`.

The following example shows the steps to upload the configuration to Cortex `Alertmanager` using `cortextool`.

#### 1. Create the Alertmanager configuration `yml` file.
#### 1. Create the Alertmanager configuration YAML file.

The following is `amconfig.yml`, an example of a configuration for Cortex `Alertmanager` to send notification via email:
The following is `amconfig.yml`, an example of a configuration for Cortex `Alertmanager` to send notifications via email:

```
global:
Expand All @@ -50,7 +50,7 @@ receivers:
- to: 'someone@localhost'
```

[Example on how to setup Slack](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/#:~:text=To%20set%20up%20alerting%20in,to%20receive%20notifications%20from%20Alertmanager.) to support receiving Alertmanager notification.
[Example on how to set up Slack](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/#:~:text=To%20set%20up%20alerting%20in,to%20receive%20notifications%20from%20Alertmanager.) to support receiving Alertmanager notifications.

#### 2. Upload the Alertmanager configuration

Expand All @@ -76,3 +76,4 @@ cortextool alertmanager get \
--id=100 \
--key=<yourKey>
```

13 changes: 7 additions & 6 deletions docs/guides/authentication-and-authorisation.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@ All Cortex components take the tenant ID from a header `X-Scope-OrgID`
on each request. A tenant (also called "user" or "org") is the owner of
a set of series written to and queried from Cortex. All Cortex components
trust this value completely: if you need to protect your Cortex installation
from accidental or malicious calls then you must add an additional layer
from accidental or malicious calls, then you must add an additional layer
of protection.

Typically this means you run Cortex behind a reverse proxy, and you must
Typically, this means you run Cortex behind a reverse proxy, and you must
ensure that all callers, both machines sending data over the `remote_write`
interface and humans sending queries from GUIs, supply credentials
which identify them and confirm they are authorised. When configuring the
`remote_write` API in Prometheus, the user and password fields of http Basic
`remote_write` API in Prometheus, the user and password fields of HTTP Basic
auth, or Bearer token, can be used to convey the tenant ID and/or credentials.
See the [Cortex-Tenant](#cortex-tenant) section below for one way to solve this.

Expand All @@ -34,7 +34,7 @@ To disable the multi-tenant functionality, you can pass the argument
to the string `fake` for every request.

Note that the tenant ID that is used to write the series to the datastore
should be the same as the one you use to query the data. If they don't match
should be the same as the one you use to query the data. If they don't match,
you won't see any data. As of now, you can't see series from other tenants.

For more information regarding the tenant ID limits, refer to: [Tenant ID limitations](./limitations.md#tenant-id-naming)
Expand All @@ -48,6 +48,7 @@ It can be placed between Prometheus and Cortex and will search for a predefined
label and use its value as `X-Scope-OrgID` header when proxying the timeseries to Cortex.

This can help to run Cortex in a trusted environment where you want to separate your metrics
into distinct namespaces by some criteria (e.g. teams, applications, etc).
into distinct namespaces by some criteria (e.g. teams, applications, etc.).

Be advised that **cortex-tenant** is a third-party community project and it's not maintained by the Cortex team.

Be advised that **cortex-tenant** is a third-party community project and it's not maintained by Cortex team.
79 changes: 40 additions & 39 deletions docs/guides/capacity-planning.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,52 +15,53 @@ sent to Cortex.

Some key parameters are:

1. The number of active series. If you have Prometheus already you
can query `prometheus_tsdb_head_series` to see this number.
2. Sampling rate, e.g. a new sample for each series every minute
(the default Prometheus [scrape_interval](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)).
Multiply this by the number of active series to get the
total rate at which samples will arrive at Cortex.
3. The rate at which series are added and removed. This can be very
high if you monitor objects that come and go - for example if you run
thousands of batch jobs lasting a minute or so and capture metrics
with a unique ID for each one. [Read how to analyse this on
Prometheus](https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality).
4. How compressible the time-series data are. If a metric stays at
the same value constantly, then Cortex can compress it very well, so
12 hours of data sampled every 15 seconds would be around 2KB. On
the other hand if the value jumps around a lot it might take 10KB.
There are not currently any tools available to analyse this.
5. How long you want to retain data for, e.g. 1 month or 2 years.
1. The number of active series. If you have Prometheus already, you
can query `prometheus_tsdb_head_series` to see this number.
2. Sampling rate, e.g. a new sample for each series every minute
(the default Prometheus [scrape_interval](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)).
Multiply this by the number of active series to get the
total rate at which samples will arrive at Cortex.
3. The rate at which series are added and removed. This can be very
high if you monitor objects that come and go - for example, if you run
thousands of batch jobs lasting a minute or so and capture metrics
with a unique ID for each one. [Read how to analyse this on
Prometheus](https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality).
4. How compressible the time-series data are. If a metric stays at
the same value constantly, then Cortex can compress it very well, so
12 hours of data sampled every 15 seconds would be around 2KB. On
the other hand, if the value jumps around a lot, it might take 10KB.
There are not currently any tools available to analyse this.
5. How long you want to retain data for, e.g. 1 month or 2 years.

Other parameters which can become important if you have particularly
high values:

6. Number of different series under one metric name.
7. Number of labels per series.
8. Rate and complexity of queries.
6. Number of different series under one metric name.
7. Number of labels per series.
8. Rate and complexity of queries.

Now, some rules of thumb:

1. Each million series in an ingester takes 15GB of RAM. Total number
of series in ingesters is number of active series times the
replication factor. This is with the default of 12-hour chunks - RAM
required will reduce if you set `-ingester.max-chunk-age` lower
(trading off more back-end database IO).
There are some additional considerations for planning for ingester memory usage.
1. Memory increases during write ahead log (WAL) replay, [See Prometheus issue #6934](https://github.com/prometheus/prometheus/issues/6934#issuecomment-726039115). If you do not have enough memory for WAL replay, the ingester will not be able to restart successfully without intervention.
2. Memory temporarily increases during resharding since timeseries are temporarily on both the new and old ingesters. This means you should scale up the number of ingesters before memory utilization is too high, otherwise you will not have the headroom to account for the temporary increase.
2. Each million series (including churn) consumes 15GB of chunk
storage and 4GB of index, per day (so multiply by the retention
period).
3. The distributors CPU utilization depends on the specific Cortex cluster
setup, while they don't need much RAM. Typically, distributors are capable
to process between 20,000 and 100,000 samples/sec with 1 CPU core. It's also
highly recommended to configure Prometheus `max_samples_per_send` to 1,000
samples, in order to reduce the distributors CPU utilization given the same
total samples/sec throughput.
1. Each million series in an ingester takes 15GB of RAM. The total number
of series in ingesters is the number of active series times the
replication factor. This is with the default of 12-hour chunks - RAM
required will reduce if you set `-ingester.max-chunk-age` lower
(trading off more back-end database I/O).
There are some additional considerations for planning for ingester memory usage.
1. Memory increases during write-ahead log (WAL) replay, [See Prometheus issue #6934](https://github.com/prometheus/prometheus/issues/6934#issuecomment-726039115). If you do not have enough memory for WAL replay, the ingester will not be able to restart successfully without intervention.
2. Memory temporarily increases during resharding since timeseries are temporarily on both the new and old ingesters. This means you should scale up the number of ingesters before memory utilization is too high, otherwise you will not have the headroom to account for the temporary increase.
2. Each million series (including churn) consumes 15GB of chunk
storage and 4GB of index, per day (so multiply by the retention
period).
3. The distributors CPU utilization depends on the specific Cortex cluster
setup, while they don't need much RAM. Typically, distributors are capable
of processing between 20,000 and 100,000 samples/sec with 1 CPU core. It's also
highly recommended to configure Prometheus `max_samples_per_send` to 1,000
samples, in order to reduce the distributors CPU utilization given the same
total samples/sec throughput.

If you turn on compression between distributors and ingesters (for
example to save on inter-zone bandwidth charges at AWS/GCP) they will use
significantly more CPU (approx 100% more for distributor and 50% more
example, to save on inter-zone bandwidth charges at AWS/GCP), they will use
significantly more CPU (approx. 100% more for distributor and 50% more
for ingester).

7 changes: 4 additions & 3 deletions docs/guides/encryption-at-rest.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ The alertmanager S3 server-side encryption can be configured similarly to the bl

### Per-tenant config overrides

The S3 client used by the blocks storage, ruler and alertmanager supports S3 SSE config overrides on a per-tenant basis, using the [runtime configuration file](../configuration/arguments.md#runtime-configuration-file).
The following settings can ben overridden for each tenant:
The S3 client used by the blocks storage, ruler, and alertmanager supports S3 SSE config overrides on a per-tenant basis, using the [runtime configuration file](../configuration/arguments.md#runtime-configuration-file).
The following settings can be overridden for each tenant:

- **`s3_sse_type`**<br />
S3 server-side encryption type. It must be set to enable the SSE config override for a given tenant.
Expand All @@ -60,4 +60,5 @@ The following settings can ben overridden for each tenant:

## Other storages

Other storage backends may support encryption at rest configuring it directly at the storage level.
Other storage backends may support encryption at rest, configuring it directly at the storage level.

7 changes: 4 additions & 3 deletions docs/guides/encryption-at-rest.template
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ The alertmanager S3 server-side encryption can be configured similarly to the bl

### Per-tenant config overrides

The S3 client used by the blocks storage, ruler and alertmanager supports S3 SSE config overrides on a per-tenant basis, using the [runtime configuration file](../configuration/arguments.md#runtime-configuration-file).
The following settings can ben overridden for each tenant:
The S3 client used by the blocks storage, ruler, and alertmanager supports S3 SSE config overrides on a per-tenant basis, using the [runtime configuration file](../configuration/arguments.md#runtime-configuration-file).
The following settings can be overridden for each tenant:

- **`s3_sse_type`**<br />
S3 server-side encryption type. It must be set to enable the SSE config override for a given tenant.
Expand All @@ -42,4 +42,5 @@ The following settings can ben overridden for each tenant:

## Other storages

Other storage backends may support encryption at rest configuring it directly at the storage level.
Other storage backends may support encryption at rest, configuring it directly at the storage level.

6 changes: 3 additions & 3 deletions docs/guides/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ A single chunk contains timestamp-value pairs for several series.

Churn is the frequency at which series become idle.

A series become idle once it's not exported anymore by the monitored targets. Typically, series become idle when the monitored target itself disappear (eg. the process or node gets terminated).
A series becomes idle once it's not exported anymore by the monitored targets. Typically, series become idle when the monitored target itself disappears (eg. the process or node gets terminated).

### Flushing

Expand All @@ -35,7 +35,7 @@ For more information, please refer to the guide "[Config for sending HA Pairs da

### Hash ring

The hash ring is a distributed data structure used by Cortex for sharding, replication and service discovery. The hash ring data structure gets shared across Cortex replicas via gossip or a key-value store.
The hash ring is a distributed data structure used by Cortex for sharding, replication, and service discovery. The hash ring data structure gets shared across Cortex replicas via gossip or a key-value store.

For more information, please refer to the [Architecture](../architecture.md#the-hash-ring) documentation.

Expand Down Expand Up @@ -94,6 +94,6 @@ _See [Tenant](#tenant)._

### WAL

The Write-Ahead Log (WAL) is an append only log stored on disk used by ingesters to recover their in-memory state after the process gets restarted, either after a clear shutdown or an abruptly termination.
The Write-Ahead Log (WAL) is an append-only log stored on disk used by ingesters to recover their in-memory state after the process gets restarted, either after a clear shutdown or an abrupt termination.

For more information, please refer to [Ingesters with WAL](../blocks-storage/_index.md#the-write-path).
Loading

0 comments on commit 2c18450

Please sign in to comment.