Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update apm-data and remap for OTel hostmetrics to elastic metrics #13196

Merged
merged 11 commits into from
Jun 7, 2024

Conversation

lahsivjar
Copy link
Contributor

@lahsivjar lahsivjar commented May 20, 2024

Motivation/summary

Remaps metrics produced by OTel's hostmetrics receiver to Elastic compatible metrics. This powers parts of Kibana's curated UIs around host and system.

To be merged after elastic/apm-data#277

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

To test all the changes, the best approach would be to run the collector on K8s. This would allow to test the Kibana UI for both, hostmetrics and K8s metrics.

  1. Run OTel collector on K8s, I have attached an example of a manifest file that could be used to run the collector. Note that, since we are testing the APM path using the collector the exporter should point to APM-Server.
  2. Visualize that the curated UI is working:
    1. The UI is located at Observability -> Infrastructure/{Inventory, Hosts}. Both these UIs should show the host running OTel collector.
    2. Given that the collector is running on K8s, the Inventory UI should also show the Pod names (by clicking on Show -> Kubernetes Pods).
    3. In the hosts UI, there are some panels which will not work:
      1. Network panel (Network remappers should also handle host.network.* metrics opentelemetry-lib#14)
      2. Log Rate (should be empty)
      3. Host OS version and Host OS name if not passed explicitly would be empty
      4. Cloud Provider may be empty if run locally
  3. Also ensure that the Overview dashboard in Observability shows the hosts list.
Example K8s manifest for running OTel collector
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-otel-collector-agent
  namespace: default
  labels:
    app.kubernetes.io/name: elastic-opentelemetry-collector
    app.kubernetes.io/version: "8.15.0-SNAPSHOT"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-otel-collector-agent
  labels:
    app.kubernetes.io/name: elastic-opentelemetry-collector
    app.kubernetes.io/version: "8.15.0-SNAPSHOT"
rules:
  - apiGroups: [""]
    resources: ["pods", "namespaces", "nodes"]
    verbs: ["get", "watch", "list"]
  - apiGroups: ["apps"]
    resources: ["daemonsets", "deployments", "replicasets", "statefulsets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["extensions"]
    resources: ["daemonsets", "deployments", "replicasets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [ "" ]
    resources: [ "nodes/stats" ]
    verbs: [ "get", "watch", "list" ]
  - apiGroups: [ "" ]
    resources: [ "nodes/proxy" ]
    verbs: [ "get" ]
  - apiGroups: [ "" ]
    resources: ["configmaps"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-otel-collector-agent
  labels:
    app.kubernetes.io/name: elastic-opentelemetry-collector
    app.kubernetes.io/version: "8.15.0-SNAPSHOT"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: elastic-otel-collector-agent
subjects:
  - kind: ServiceAccount
    name: elastic-otel-collector-agent
    namespace: default
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  labels:
    app: opentelemetry
    component: otel-collector-conf
data:
  otel-collector-config: |
    receivers:
      hostmetrics:
        collection_interval: 10s
        # root_path: /hostfs
        scrapers:
          cpu:
            metrics:
              system.cpu.utilization:
                enabled: true
              system.cpu.logical.count:
                enabled: true
          memory:
            metrics:
              system.memory.utilization:
                enabled: true
          process:
            mute_process_exe_error: true
            mute_process_io_error: true
            mute_process_user_error: true
            metrics:
              process.threads:
                enabled: true
              process.open_file_descriptors:
                enabled: true
              process.memory.utilization:
                enabled: true
              process.disk.operations:
                enabled: true
          network:
          processes:
          load:
          disk:
          filesystem:
            exclude_mount_points:
              mount_points:
                - /dev/*
                - /proc/*
                - /sys/*
                - /run/k3s/containerd/*
                - /var/lib/docker/*
                - /var/lib/kubelet/*
                - /snap/*
              match_type: regexp
            exclude_fs_types:
              fs_types:
                - autofs
                - binfmt_misc
                - bpf
                - cgroup2
                - configfs
                - debugfs
                - devpts
                - devtmpfs
                - fusectl
                - hugetlbfs
                - iso9660
                - mqueue
                - nsfs
                - overlay
                - proc
                - procfs
                - pstore
                - rpc_pipefs
                - securityfs
                - selinuxfs
                - squashfs
                - sysfs
                - tracefs
              match_type: strict
      kubeletstats:
        auth_type: serviceAccount
        collection_interval: 20s
        endpoint: ${env:K8S_NODE_NAME}:10250
        node: '${env:K8S_NODE_NAME}'
        # Required to work for all CSPs without an issue
        insecure_skip_verify: true
        k8s_api_config:
          auth_type: serviceAccount
        metric_groups:
          - node
          - pod
          - node
        metrics:
          k8s.pod.cpu.node.utilization:
            enabled: true
          k8s.container.cpu_limit_utilization:
            enabled: true
          k8s.pod.cpu_limit_utilization:
            enabled: true
          k8s.container.cpu_request_utilization:
            enabled: true
          k8s.container.memory_limit_utilization:
            enabled: true
          k8s.pod.memory_limit_utilization:
            enabled: true
          k8s.container.memory_request_utilization:
            enabled: true
          k8s.node.uptime:
            enabled: true
          k8s.node.cpu.usage:
            enabled: true
          k8s.pod.cpu.usage:
            enabled: true
        extra_metadata_labels:
          - container.id
    processors:
      resource/k8s:
        attributes:
          - key: service.name
            from_attribute: app.label.component
            action: insert
      resource/cloud:
        attributes:
          - key: cloud.instance.id
            from_attribute: host.id
            action: insert
      resourcedetection/system:
        detectors: ["system", "ec2"]
        system:
          hostname_sources: [ "os" ]
          resource_attributes:
            host.name:
              enabled: true
            host.id:
              enabled: false
            host.arch:
              enabled: true
            host.ip:
              enabled: true
            host.mac:
              enabled: true
            host.cpu.vendor.id:
              enabled: true
            host.cpu.family:
              enabled: true
            host.cpu.model.id:
              enabled: true
            host.cpu.model.name:
              enabled: true
            host.cpu.stepping:
              enabled: true
            host.cpu.cache.l2.size:
              enabled: true
            os.description:
              enabled: true
            os.type:
              enabled: true
      k8sattributes:
        filter:
          node_from_env_var: K8S_NODE_NAME
        passthrough: false
        pod_association:
          - sources:
              - from: resource_attribute
                name: k8s.pod.ip
          - sources:
              - from: resource_attribute
                name: k8s.pod.uid
          - sources:
              - from: connection
        extract:
          metadata:
            - "k8s.namespace.name"
            - "k8s.deployment.name"
            - "k8s.statefulset.name"
            - "k8s.daemonset.name"
            - "k8s.cronjob.name"
            - "k8s.job.name"
            - "k8s.node.name"
            - "k8s.pod.name"
            - "k8s.pod.uid"
            - "k8s.pod.start_time"
          labels:
            - tag_name: app.label.component
              key: app.kubernetes.io/component
              from: pod
      batch:
      memory_limiter:
        # 80% of maximum memory up to 2G
        limit_mib: 1500
        # 25% of limit up to 2G
        spike_limit_mib: 512
        check_interval: 5s
    exporters:
      otlphttp:
        endpoint: <APM_endpoint>
        tls:
          insecure: false
        headers:
          Authorization: Bearer <APM_secret_token>
      # This should be replaced by the local apm-server
      # to allow the max flexibility in testing
      #otlphttp:
      #  endpoint: "http://apm.server.local:8200"
      #  tls:
      #    insecure: true
      debug:
        verbosity: basic
    service:
      pipelines:
        metrics:
          exporters:
          - debug
          - otlphttp
          processors:
          - k8sattributes
          - resourcedetection/system
          - resource/k8s
          - resource/cloud
          receivers:
          - kubeletstats
          - hostmetrics
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  ports:
  - name: otlp-grpc # Default endpoint for OpenTelemetry gRPC receiver.
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: otlp-http # Default endpoint for OpenTelemetry HTTP receiver.
    port: 4318
    protocol: TCP
    targetPort: 4318
  - name: metrics # Default endpoint for querying metrics.
    port: 8888
  selector:
    component: otel-collector
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  minReadySeconds: 5
  progressDeadlineSeconds: 120
  replicas: 1
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-collector
    spec:
      serviceAccountName: elastic-otel-collector-agent
      securityContext:
        runAsUser: 0
        runAsGroup: 0
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      # Enable for sending data to locally running APM-Server
      # hostAliases:
      # - ip: "192.168.0.44"
      #   hostnames:
      #   - "apm.server.local"
      containers:
      - image: otel/opentelemetry-collector-contrib:0.104.0
        name: otel-collector
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
        ports:
        - containerPort: 55679 # Default endpoint for ZPages.
        - containerPort: 4317 # Default endpoint for OpenTelemetry receiver.
        - containerPort: 14250 # Default endpoint for Jaeger gRPC receiver.
        - containerPort: 14268 # Default endpoint for Jaeger HTTP receiver.
        - containerPort: 9411 # Default endpoint for Zipkin receiver.
        - containerPort: 8888  # Default endpoint for querying metrics.
        env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.podIP
          - name: GOMEMLIMIT
            value: 1600MiB
          - name: K8S_NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /etc/otelcol-contrib
      volumes:
        - configMap:
            name: otel-collector-conf
            items:
              - key: otel-collector-config
                path: config.yaml
          name: otel-collector-config-vol

Related issues

@lahsivjar lahsivjar requested a review from a team as a code owner May 20, 2024 21:48
@lahsivjar lahsivjar marked this pull request as draft May 20, 2024 21:49
Copy link
Contributor

mergify bot commented May 20, 2024

This pull request does not have a backport label. Could you fix it @lahsivjar? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label May 20, 2024
@lahsivjar lahsivjar force-pushed the hostmetrics-poc branch 2 times, most recently from 34bce97 to 0cef088 Compare June 5, 2024 09:31
@lahsivjar lahsivjar changed the title Hostmetrics poc Update apm-data and remap for OTel hostmetrics to elastic metrics Jun 5, 2024
@lahsivjar lahsivjar marked this pull request as ready for review June 5, 2024 09:33
@@ -8,6 +8,7 @@ https://github.com/elastic/apm-server/compare/8.14\...main[View commits]

- Avoid data race due to reuse of `bytes.Buffer` in ES bulk requests {pull}13155[13155]
- APM Server now relies on the Elasticsearch apm-data plugin's index templates, which reverts some unsafe uses of `flattened` field types {pull}12066[12066]
- Add `error.id` to jaeger errors {pull}13196[13196]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @kruskall this is due to the new apm-data version update.

@lahsivjar lahsivjar enabled auto-merge (squash) June 5, 2024 09:44
kruskall
kruskall previously approved these changes Jun 5, 2024
Copy link
Contributor

mergify bot commented Jun 5, 2024

This pull request is now in conflicts. Could you fix it @lahsivjar? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b hostmetrics-poc upstream/hostmetrics-poc
git merge upstream/main
git push upstream hostmetrics-poc

@@ -10,7 +10,7 @@
"opentelemetry/go"
],
"agent.version": [
"1.25.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewers] This change is due to upgrade of the otel-agent in the system test from v1.25.0 to v1.27.0

@@ -1,4 +1,102 @@
[
{
"@timestamp": [
"dynamic"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewers] This change is because PR elastic/apm-tools#61 now sorts the docs after handling dynamic fields.

@lahsivjar lahsivjar requested review from kruskall and a team June 7, 2024 14:23
- Upgraded bundled APM Java agent attacher CLI to version 1.50.0 {pull}13326[13326]
- Enable Kibana curated UIs to work with hostmetrics from OpenTelemetry's [hostmetricsreceiver](https://pkg.go.dev/go.opentelemetry.io/collector/receiver/hostmetricsreceiver) {pull}13196[13196]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this link works in asciidoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update is based on

Please use the https://www.docker.elastic.co/r/apm[Elastic docker registry] to download the 8.5.0 APM Server image.

@lahsivjar lahsivjar disabled auto-merge June 7, 2024 14:41
@lahsivjar lahsivjar enabled auto-merge (squash) June 7, 2024 14:41
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, can you fill out the "How to test these changes" section?

@lahsivjar lahsivjar merged commit 1fdef04 into elastic:main Jun 7, 2024
14 checks passed
@lahsivjar lahsivjar deleted the hostmetrics-poc branch June 7, 2024 16:00
v1v added a commit that referenced this pull request Jun 9, 2024
* upstream/main:
  chore: Update .go-version with Golang version 1.22.4 (#13367)
  build(deps): bump github.com/jaegertracing/jaeger from 1.56.0 to 1.57.0 in /systemtest (#13316)
  [updatecli] Bump elastic stack version to 8.15.0-725cdb43 (#13363)
  feat: add wolfi based image (#12671)
  Add Amazon Linux 2023 to the smoke tests (#13358)
  Update apm-data and remap for OTel hostmetrics to elastic metrics (#13196)
  build(deps): bump github.com/elastic/go-elasticsearch/v8 from 8.13.1 to 8.14.0 (#13356)
@inge4pres inge4pres mentioned this pull request Jul 5, 2024
9 tasks
@carsonip
Copy link
Member

Testing

✔️ test-plan-ok

Tested with otel collector running in k8s alongside opentelemetry-demo

Most graphs in Infrastructure.{Inventory, Hosts} work, with the exception of network graph
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify test-plan test-plan-ok v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants