Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datadog_metrics sink's traffic flows to our VPC endpoint but not to Datadog #21867

Open
nzxwang opened this issue Nov 21, 2024 · 6 comments
Open
Labels
type: bug A code related bug.

Comments

@nzxwang
Copy link

nzxwang commented Nov 21, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I would like to configure vector to route traffic to Datadog’s PrivateLink offering in us-east-1 from other regions using VPC Peering. I've configured datadog_metrics with endpoint = "https://metrics.agent.datadoghq.com/api/v1/series" and verified my test metrics are flowing through my pipeline in the trace logs (as per previous advice #21325 (comment)) as well as vector taping into my pipeline. Moreover, my VPC endpoint is correctly reporting traffic for my test metrics:
Screenshot 2024-11-21 at 3 03 32 PM

However, the test metrics are either being dropped by Datadog's VPC Endpoint Service or by Datadog itself.

Configuration

[sinks.datadog_metrics_sink_4]
type = "datadog_metrics"
inputs = ["remap_fanout_del_4"]
default_api_key = "${DD_API_KEY}"
batch.timeout_secs = 5
batch.max_events = 2000
buffer.max_events = 200000
request.concurrency = "adaptive"
endpoint = "https://metrics.agent.datadoghq.com/api/v1/series"

Version

vector 0.39.0 (x86_64-unknown-linux-musl 73da9bb 2024-06-17 16:00:23.791735272)

Debug Output

2024-11-21T22:40:30.948719Z DEBUG sink{component_kind="sink" component_id=datadog_metrics_sink_4 component_type=datadog_metrics}:request{request_id=1}:http: vector::internal_events::http_client: Sending HTTP request. uri=https://metrics.agent.datadoghq.com/api/v1/series/api/v1/series method=POST version=HTTP/1.1 headers={"dd-api-key": "REDACTED", "dd-agent-payload": "4.87.0", "content-type": "application/json", "content-encoding": "deflate", "user-agent": "Vector/0.39.0 (x86_64-unknown-linux-musl 73da9bb 2024-06-17 16:00:23.791735272)", "accept-encoding": "identity"} body=[245 bytes]

Example Data

No response

Additional Context

No response

References

No response

@nzxwang nzxwang added the type: bug A code related bug. label Nov 21, 2024
@nzxwang
Copy link
Author

nzxwang commented Nov 21, 2024

I also cut a ticket to Datadog support (id: 1937927) as I'm unsure if this issue lies in vector or in Datadog. Apologies in advance if vector is working correctly.

@pront
Copy link
Member

pront commented Nov 22, 2024

Hi @nzxwang, thank you for creating this issue.

Did you by any chance inspect other sink metrics e.g. https://vector.dev/docs/reference/configuration/sinks/datadog_metrics/#component_sent_events_total? I would also like to confirm that the response codes on the Vector service are OK.

@nzxwang
Copy link
Author

nzxwang commented Nov 22, 2024

Hi @pront thank you for the quick response! I should have thought to check the other sink metrics like component_sent_events_total earlier but I just assumed that it worked since there was traffic over our endpoint. I'm currently observing that our datadog_metrics sinks are receiving events but not sending them, despite there being traffic over our vpc endpoint (please don't mind the multiple sinks; #21373 (reply in thread)):
Screenshot 2024-11-22 at 10 19 43 AM

Do you have any insight into how this is possible?

@pront
Copy link
Member

pront commented Nov 27, 2024

Another metrics to check is https://vector.dev/docs/reference/configuration/sinks/datadog_metrics/#component_discarded_events_total.

I'm currently observing that our datadog_metrics sinks are receiving events but not sending them, despite there being traffic over our vpc endpoint

When an event is received by the sink, the sink's service will prepare a request. But that request might fail, in that case the component_sent_events_total will not be increased. Do you see any errors that indicate failed requests?

@nzxwang
Copy link
Author

nzxwang commented Dec 3, 2024

Hi again @pront and thank you for your patience in debugging this with me. I tried again and am seeing errors indicating failed requests like:

component_id=datadog_metrics_sink_2 component_type=datadog_metrics}:request{request_id=1}: vector::sinks::util::retries: Retrying after error. error=Client request was invalid. internal_log_rate_limit=true

The metrics emitted by the datadog_metrics sinks are the same and component_discarded_events_total are zero as well
Screenshot 2024-12-03 at 12 15 36 PM

FWIW, I also tried to reproduce what we're observing by simply curling https://metrics.agent.datadoghq.com/api/v1/series. However, the metrics made it to datadog and the traffic over our VPC matched as well. This was my script:

count=0
while true; do
  count=$((count + 1))
  curl -X POST "https://metrics.agent.datadoghq.com/api/v1/series" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: REDACTED" \
  -d "{
    \"series\": [
      {
        \"metric\": \"test\",
        \"points\": [
          [$(date +%s), 3300]
        ],
        \"type\": \"gauge\",
        \"tags\": [\"location:globalqa.aws-usw2-dev.niwang\", \"service:loop.vector-collector-curl\"]
      }
    ]
  }"

  echo "Curl command executed $count times"
done

That being the case, could the issue be in how datadog_metrics is sending its requests to https://metrics.agent.datadoghq.com/api/v1/series? Do you know of other customers sending metrics via the datadog_metrics sink over Datadog's AWS Privatelink?

@pront
Copy link
Member

pront commented Dec 6, 2024

Looking at the error log and the diagrams I think it confirms that the sinks receive events but the service requests are failing. The requests are retried that's why there are no discarded events.

However, the metrics made it to datadog and the traffic over our VPC matched as well.

It is possible that this handmade request is different than the request Vector generates. You might be able to see metrics ingestion errors in your DD org.

Do you have VECTOR_LOG=debug on? That might reveal more errors.

Do you know of other customers sending metrics via the datadog_metrics sink over Datadog's AWS Privatelink?

We do not have visibility into that since the Vector team does not offer support for on-premises deployments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants