Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bind_boot_time_seconds appears to be shifting strangely #81

Open
rootwyrm opened this issue Jul 14, 2020 · 6 comments
Open

bind_boot_time_seconds appears to be shifting strangely #81

rootwyrm opened this issue Jul 14, 2020 · 6 comments

Comments

@rootwyrm
Copy link

rootwyrm commented Jul 14, 2020

image

The issue appears to be the use of max(node_time_seconds{instance=~"$node:.*"}) which appears to not work in Prometheus 2.19, and a possible change in how Bind 9.16 reports uptime. The result is that changing the queries to time() - max(bind_boot_time_seconds{instance=~"$node:.*"}) produces sensible seeming results, but these are actually still off by an order of magnitude.

i.e. a Bind 9.16 reported boot time of 2020-07-14T21:10:48.999Z, with a current time of 2020-07-14T22:11:56.299Z will report incorrectly with the above, claiming 5.8 hours. I thought I had found the order of magnitude error, but then I noticed that something is still wrong because it wasn't updating correctly. That's when I noticed that bind_boot_time_seconds was moving. It went from 1594766106 to 1594740334, which is definitely not correct.

The actual Bind statistics do not reflect a change in the corresponding XML or JSON.

@rootwyrm rootwyrm changed the title Associated Grafana dashboard does not display uptimes correctly bind_boot_time_seconds appears to be shifting strangely Jul 14, 2020
@SuperQ
Copy link
Contributor

SuperQ commented Jul 15, 2020

The exporter is only taking what bind reports in the boot-time XML field. It's a reasonably simple XML parse to Go time.Time.

Without examining the raw metric data, it's hard to say what's going on.

@rootwyrm
Copy link
Author

That's what I'm saying: somehow it's mangling the raw metric. I am positive of this. I checked the raw. The raw in the XML is correct and more importantly, does not change. Yet the export is changing the value from the XML in strange ways. And this is reflected in the raw data from the exporter.

@SuperQ
Copy link
Contributor

SuperQ commented Jul 15, 2020

The only thing we could do here is to build a version with logging of the raw XML data to see what's returned. Without some concrete proof that the exporter is doing something, there's nothing we can do.

@rootwyrm
Copy link
Author

I definitely agree; this is going to need some XML dumping. However, I don't see any way to do that and frankly I have zero experience working in Go (so frankly, I suspect my attempt would mangle output.)

Also probably safe to go ahead and drop the xml.v2 channel completely as that was fully discontinued with 9.10 (which went fully EOL in 2018.) Maybe that would make debugging easier as well.

@SuperQ
Copy link
Contributor

SuperQ commented Jul 16, 2020

That reminds me, we should add support for the new json format.

#82

@dswarbrick
Copy link
Member

This is intriguing. Whilst working on a very trivial patch to eliminate ioutil.ReadAll when unmarshalling the XML, I hit a test failure which I haven't yet resolved. However, in the failed test output, there are bind_boot_time_seconds timestamps that are slightly shifted vs. what the tests expect. I cannot find anything in the test fixtures to explain this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants