-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using multiple concurrent readers makes async metrics produce wrong results #4741
Comments
Can you please provide exact code snippets to reproduce the issue and also use it to further describe actual and expected behavior? |
I can provide a full example later. Here is the current code of the project.
I provided more infos, please see issue. |
@pellared for me it looks like that the prometheus exporter was not designed to work with any other reader because it will lead to multiple callback executions; one for each export cycle 15s + for every prometheus scrape. Not sure, how this is supposed to work. |
@StarpTech This is correct. The Prometheus exporter is a metric reader itself. The @dashpole I think that if we would not embed the reader (like I proposed in #4313) then we could avoid such issues and confusion. |
@pellared could you be a bit more specific? I did not modify the reader. I just use a MeterProvider with an OTEL exporter + Prometheus Exporter.
|
Then everything should work fine. Please create a Minimal, Reproducible Example (it would be a lot easier for all of us).
I find this strange. |
Will do! Just for better understanding. Is it correct, that a manual reader invokes the |
AFAIK this line
will invoke the registered asynchronous callbacks. |
Here is a reproducible example. The values are fine as long as you don't call the prometheus metrics endpoint. package main
import (
"context"
"fmt"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.opentelemetry.io/otel/exporters/prometheus"
"go.opentelemetry.io/otel/exporters/stdout/stdoutmetric"
otelmetric "go.opentelemetry.io/otel/metric"
"go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
"log"
"net/http"
"time"
)
const meterName = "github.com/open-telemetry/opentelemetry-go/example/prometheus"
func main() {
ctx := context.Background()
r, err := resource.New(ctx,
resource.WithAttributes(semconv.ServiceNameKey.String("my-service")),
resource.WithProcessPID(),
resource.WithTelemetrySDK(),
resource.WithHost(),
)
promExporter, err := prometheus.New()
if err != nil {
log.Fatal(err)
}
exporter, err := stdoutmetric.New(stdoutmetric.WithPrettyPrint())
if err != nil {
log.Fatal(err)
}
provider := metric.NewMeterProvider(metric.WithReader(promExporter), metric.WithResource(r), metric.WithReader(
metric.NewPeriodicReader(exporter,
metric.WithTimeout(time.Second*10),
metric.WithInterval(time.Second*15),
),
))
meter := provider.Meter(meterName)
// Start the prometheus HTTP server and pass the exporter Collector to it
go serveMetrics()
uptime, err := meter.Float64ObservableCounter(
"uptime",
otelmetric.WithDescription("The duration since the application started."),
otelmetric.WithUnit("s"),
)
if err != nil {
log.Fatal(err)
}
a := time.Now()
_, err = meter.RegisterCallback(
func(ctx context.Context, o otelmetric.Observer) error {
elapsed := time.Since(a)
fmt.Println("elapsedSeconds", elapsed.Seconds())
o.ObserveFloat64(uptime, elapsed.Seconds())
return nil
},
uptime,
)
if err != nil {
log.Fatal(err)
}
<-ctx.Done()
}
func serveMetrics() {
log.Printf("serving metrics at localhost:2223/metrics")
http.Handle("/metrics", promhttp.Handler())
err := http.ListenAndServe(":2223", nil) //nolint:gosec // Ignoring G114: Use of net/http serve function that has no support for setting timeouts.
if err != nil {
fmt.Printf("error serving http: %v", err)
return
}
} After visiting
|
It looks like a bug. I do not remember the code but it could be even possible that asynchronous instruments (metrics) are not correctly processed when there are multiple concurrent readers. Have you checked if it works fine if you remove the |
Yes, it works fine as long as I don't have multiple readers.
If this is the case, it is a huge bug. |
Can we ping anyone from the core team who can bring light into this? |
I'll investigate |
Reporting progress so far. I've figured out that when Collect is invoked on a single reader, measure is invoked for all readers. But then, we only call the aggregation function (e.g. precomutedSum.delta() for that reader's aggregation. This means that for each reader, the callback seems to increment the value for all readers, but only measuring and resetting a single reader's storage. Here are a bunch of print statements to illustrate:
0xc000118b60 is the address of one of the sums' valueMap, and 0xc000118b70 is the other. You can see 2 The scenario above is for a callback that always returns 50. So you can see that when the second reader reads the value, it was already incremented to 50 by the previous reader, and is (incorrectly) incremented again to 100. @MrAlias or @MadVikingGod may have a better intuition for how this could happen. |
Measures are functions that can be called to record a new value. There is one Measure for each instrument, for each Reader. This is correct for synchronous instruments, because if I increment a counter in my code, it should be incremented for all Readers. But with asynchronous instruments, callbacks are invoked once for each Reader, so incrementing for all readers is not correct. |
@dashpole great findings. I came to the same conclusion. Therefore, every reader should maintain its own state of aggregators for asynchronous instruments? If this is confirmed, it is a major bug. |
Fix: #4742 |
Environment
Steps To Reproduce
Hi, I'm using the
go.opentelemetry.io/otel/exporters/prometheus
package to implement a custom Prometheus exporter. I use the OTEL SDK succesfully to export synchronous instruments like counter and histograms. Today, I started to add aFloat64ObservableCounter
to measure the uptime of the server but this has led to reporting of invalid values to the collector. I use the clickhouse exporter in the otelcollector. Here is the code of the uptime counter.It seems that everytime Prometheus scrapes the server,
gather()
https://github.com/wundergraph/cosmo/blob/main/router/internal/metric/prom_registry.go#L26 is called and therefore also the callback of theFloat64ObservableCounter
. This somehow results in weird values:27.59, 51.28, 117.59, 111.28
Correct would be
14.97, 29.97, 44.98, 59.97
because my export interval is 15s. I get those values after removing thesdkmetric.WithReader(promExp)
from the MeterProvider.Expected behavior
I'd expect that the readers does not influence each other.
The text was updated successfully, but these errors were encountered: