NOTE: This is doc for developing this feature, for user doc please check user guide.
- ecsobserver discover ECS tasks
- Integration test
After building your own image 12346.dkr.ecr.us-west-2.amazonaws.com/aoc:ecssd-0.2
you can
use this cfn
to launch the collector on ECS EC2 cluster.
export CLUSTER_NAME=aoc-prometheus-dashboard-1
export CREATE_IAM_ROLES=True
export COLLECTOR_IMAGE=12346.dkr.ecr.us-west-2.amazonaws.com/aoc:ecssd-0.2
aws cloudformation create-stack --stack-name AOC-Prometheus-ECS-${CLUSTER_NAME} \
--template-body file://aws-otel-container-insights-prometheus-ec2-deployment-cfn.yaml \
--parameters ParameterKey=ClusterName,ParameterValue=${CLUSTER_NAME} \
ParameterKey=CreateIAMRoles,ParameterValue=${CREATE_IAM_ROLES} \
ParameterKey=CollectorImage,ParameterValue=${COLLECTOR_IMAGE} \
--capabilities CAPABILITY_NAMED_IAM
It will create the following resource:
- SSM parameter
- IAM roles,
AWSOTelRolePrometheusECS
andAWSOTelExecutionRolePrometheusECS
by default - ECS task definition and replica service
If you need to test your image frequently, you need a script to update SSM parameter, push the image, scale service down to 0 and back to 1. The cloudformation stack is a bit slow for iteration. (The script is an exercise for the reader, hint: aws cli can't upload ssm parameter value from file name)
NOTE: some problems (or problematic solutions...) also apply to (are copied from) Container Insights EKS Prometheus.
To understand the codebase, check README in ecsobserver . You can also use cloudwatch-agent as reference.
Labels are key value pairs e.g. env=prod
. They are called Dimension in CloudWatch. There is no direct translation from
label to dimension because CloudWatch does not support too many dimensions. Metrics declaration allow picking some
labels as dimension. There is also dimension rollup, but we disable it using NoDimensionRollup
.
For builtin dashboard to work, specific metric dimensions are required. In ecsobserver
, we export labels
with __meta_ecs_
prefix (e.g. __meta_ecs_task_definition_family
), which is different from cloudwatch-agent.
Using __
prefix is more popular in prometheus's builtin discovery implementations, so we followed that instead when
porting the discovery logic. For getting a dimension like TaskDefinitionFamily
in CloudWatch we go through two steps:
- prometheus relabel ensures we carry the label down the pipeline, otherwise all
__
are dropped.__meta_ecs_task_definition_family=MyTask
becomesTaskDefinitionFamily=MyTask
- emf's metrics declaration picks some label as metric dimension, other labels becomes structured log field.
MyTask
becomes a dimension value that will show up in dashboard when you slice and dice.
receivers:
prometheus:
config:
scrape_configs:
- job_name: "ecssd"
relabel_configs: # Relabel here because label with __ prefix will be dropped by receiver.
- source_labels: [ __meta_ecs_task_definition_family ] # TaskDefinitionFamily
action: replace
target_label: TaskDefinitionFamily
exporters:
awsemf:
metric_declarations:
- dimensions: [ [ ClusterName, TaskDefinitionFamily, ServiceName ] ] # dimension names are same as our relabeled keys.
label_matchers:
- label_names:
- ServiceName
regex: '^.*nginx-service$'
metric_name_selectors:
- "^nginx_.*$"
We allow user to specify different names using job_name
in config. They are NOT exported as job
and uses the value
from job_label_name
as exported label key (e.g. prometheus_job
). Then in processors we use metricstransform
processor to rename promethus_job
back to job
.
Why don't we just use job
directly? Short answer is prometheus receiver does not support specifying job
in discovery
output. We use file_sd
as the actual discovery implementation to bridge our discovery result, all the targets are
under the job ecssd
in prometheus config. However, prometheus receiver does not behave exactly like prometheus, it
relies on job name for detecting metrics type. If we export target with job nginx-prometheus-exporter
, receiver will
look up metadata cache using nginx-prometheus-exporter
while the only job in cache is ecssd
, the result is metric
type unknown. The comment in
this PR
gives more detail and links to upstream issue.
extensions:
ecs_observer: # extension type is ecs_observer
# custom name for 'job' so we can rename it back to 'job' using metricstransform processor
job_label_name: prometheus_job
result_file: '/etc/ecs_sd_targets.yaml'
services:
- name_pattern: '^.*nginx-service$' # NGINX
metrics_ports:
- 9113
job_name: nginx-prometheus-exporter
receivers:
prometheus:
config:
scrape_configs:
- job_name: "ecssd"
file_sd_configs:
- files:
- '/etc/ecs_sd_targets.yaml' # MUST match the file name in ecs_observer.result_file
processors:
metricstransform:
transforms:
- include: ".*" # Rename customized job label back to job
match_type: regexp
action: update
operations:
- label: prometheus_job # must match the value configured in ecs_observer
new_label: job
action: update_label
prom_metric_type
is a label only used by CloudWatch builtin dashboards. In order to do that, we changed EMF exporter
to look up resource attributes
and change output when receiver is prometheus
. However, recevier
is not a default attribute, and we insert it manually using resource
processor. In another word,
our solution only works when prometheus receiver is the only metrics receiver sending metrics to CloudWatch EMF exporter
in the pipeline.
processors:
resource:
attributes:
- key: receiver # Insert receiver: prometheus for CloudWatch EMF Exporter to add prom_metric_type
value: "prometheus"
action: insert
Unlike EKS, ECS has a reliable way to discover current cluster using endpoint provided by ECS agent. We didn't include it in initial release because we already have two components with duplicated code for metadata client.
To implement this feature, just check metadata API if user give empty cluster name. Scraping metrics in cluster A using collector running in cluster B is a valid use case, so we shouldn't override cluster name if user already provide one. In fact, the collector can run anywhere as long as it can connect to AWS API and ECS tasks.
- 2021-06-23 @pingleig init the doc, ported from #435