Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DTrace scripts to Nexus zone #7244

Merged
merged 6 commits into from
Dec 14, 2024
Merged

Add DTrace scripts to Nexus zone #7244

merged 6 commits into from
Dec 14, 2024

Conversation

bnaecker
Copy link
Collaborator

  • Adds a script for tracing all transactions run by Nexus, including their overall latency and the number of statements in each one.
  • Move all existing scripts to a Nexus subdirectory, and then include that whole directory in the Omicron zone for Nexus itself.
  • Closes Would like D scripts for tracing transactions in Nexus #7224

@bnaecker bnaecker requested a review from smklein December 12, 2024 23:05
@bnaecker
Copy link
Collaborator Author

For a quick confirmation that these scripts are actually in the Nexus zone tarball, I built everything using omicron-package package, and then:

bnaecker@shale : ~/omicron $ tar -tf out/omicron-nexus.tar.gz | grep dtrace
root/var/nexus/dtrace
root/var/nexus/dtrace/aggregate-query-latency.d
root/var/nexus/dtrace/slowest-queries.d
root/var/nexus/dtrace/trace-db-queries.d
root/var/nexus/dtrace/trace-transactions.d

Most of these scripts have been around for a long time, I'm just including them in the zone image now. I've also added one new script trace-transactions.d, and tested it on Dogfood by copying it there manually and running. Here's the kind of output we get:

root@oxz_nexus_65a11c18:~# ./foo.d -p $(pgrep nexus)
Tracing all database transactions for nexus PID 11753, use Ctrl-C to exit
COMMIT 11 statements on connection 0eda0d4b-1858-4933-948b-73d7923c6379 (13806 us)
COMMIT 11 statements on connection a2c35315-8aef-4533-8b8f-d0a22b0c25f7 (16038 us)
COMMIT 13 statements on connection 0af48731-ce1a-4157-a090-d1e7b811ae67 (16258 us)
COMMIT 13 statements on connection 0eda0d4b-1858-4933-948b-73d7923c6379 (11246 us)
COMMIT 11 statements on connection 1268736c-018e-4cb4-946f-3272a04bce82 (14209 us)
COMMIT 11 statements on connection 0eda0d4b-1858-4933-948b-73d7923c6379 (9972 us)
COMMIT 3 statements on connection b7b94607-9099-40c5-bf6b-86a3c5d8b993 (4768 us)
COMMIT 2 statements on connection a2c35315-8aef-4533-8b8f-d0a22b0c25f7 (2809 us)
COMMIT 3 statements on connection 0af48731-ce1a-4157-a090-d1e7b811ae67 (3770 us)
COMMIT 2 statements on connection 0eda0d4b-1858-4933-948b-73d7923c6379 (2157 us)
COMMIT 3 statements on connection a2c35315-8aef-4533-8b8f-d0a22b0c25f7 (4730 us)
COMMIT 2 statements on connection a2c35315-8aef-4533-8b8f-d0a22b0c25f7 (2333 us)
COMMIT 3 statements on connection 0af48731-ce1a-4157-a090-d1e7b811ae67 (3406 us)
COMMIT 2 statements on connection 0eda0d4b-1858-4933-948b-73d7923c6379 (1940 us)
COMMIT 3 statements on connection a2c35315-8aef-4533-8b8f-d0a22b0c25f7 (3526 us)
COMMIT 2 statements on connection a2c35315-8aef-4533-8b8f-d0a22b0c25f7 (5750 us)
COMMIT 2 statements on connection c9167464-884f-41a8-bb1b-0a2dbc6ab9bc (5267 us)
COMMIT 2 statements on connection 45f54e80-b192-434a-ba9a-05dac9377975 (5191 us)
COMMIT 2 statements on connection b7b94607-9099-40c5-bf6b-86a3c5d8b993 (36201 us)
...

So this shows us the count of statements in each transaction when it's either committed or rolled back, along with the connection ID and total duration of the transaction itself. This would have certainly helped in the investigation of #7208, where many transactions were both long and including a huge number of statements. I'm sure we'll want more scripts in the long run, though this is a good start.

@leftwo
Copy link
Contributor

leftwo commented Dec 12, 2024

We have a pile of crucible related DTrace scripts include in all global zones.
At the moment, those files are extracted at: /opt/oxide/crucible_dtrace

Should we define (and have all of us use) a common top level directory to put all these?

@bnaecker
Copy link
Collaborator Author

Should we define (and have all of us use) a common top level directory to put all these?

That sounds like a good idea! I didn't realize the Crucible scripts were already in the GZ. What are the tradeoffs to putting them there as opposed to the Crucible zone? I added these to the Nexus zone itself, but am happy to move them if that makes sense. A path structure like /opt/oxide/dtrace/{component} or similar seems OK.

@leftwo
Copy link
Contributor

leftwo commented Dec 12, 2024

What are the tradeoffs to putting them there as opposed to the Crucible zone? I added these to the Nexus zone itself,

There was a time when I could only run DTrace in the global zone, but I don't think that is still true.
Some of the scripts expect to be in the global zone and look at all the propolis instances.
Many of the scripts also take a PID, and that was generally how I ran things if I wanted to look at just a specific zone's process (which I can still do from the global zone).

@bnaecker
Copy link
Collaborator Author

Thanks @leftwo that makes sense. We can definitely run DTrace from a non-global zone now. I don't really see a downside to putting them in the GZ, so I'll go ahead and do that instead. Does /opt/oxide/dtrace/{nexus,crucible,...} seem like a fine place to you?

@leftwo
Copy link
Contributor

leftwo commented Dec 13, 2024

Does /opt/oxide/dtrace/{nexus,crucible,...} seem like a fine place to you?

Seems fine to me. We (I) can move the crucible ones in a different PR.

@davepacheco
Copy link
Collaborator

One risk of putting these in the GZ is that they can be mismatched with the zone (if the probes and scripts change over time). That may be uncommon enough that we don't care but I figured I'd mention it.

@leftwo
Copy link
Contributor

leftwo commented Dec 13, 2024

One risk of putting these in the GZ is that they can be mismatched with the zone (if the probes and scripts change over time). That may be uncommon enough that we don't care but I figured I'd mention it.

That's a good point, and yes, it depends on what the probes are probing.
For crucible, we make the DTrace package match the bits that come with propolis. Though, if we change the downstairs and don't change the upstairs/propolis side, there could be a mismatch.

@bnaecker
Copy link
Collaborator Author

Good point @davepacheco, thanks for bringing that up. The probes haven't changed much in a while, but I'm also hesitant to introduce more coupling than we need to. There's always a chance that the scripts are out of date even in the zone, since nothing really checks that at packaging time. Still, given that you can always get into the Nexus zone from the GZ, but not vice-versa, it seems safer to deploy these as part of that zone.

@bnaecker
Copy link
Collaborator Author

I'm going to wait for an update to diesel-dtrace to merge before updating these scripts. Sean is adding a probe to the place in Nexus where we tend to create transactions (though not all of them), which will help associate the actual diesel-trace probes with the Nexus code itself.

@bnaecker
Copy link
Collaborator Author

Ok, the scripts are updated with the new version of diesel-dtrace, and are in the Nexus zone still for the reasons Dave mentioned.

@bnaecker
Copy link
Collaborator Author

Oops, didn't mean to request review yet. I think we'll rework this to use the retry-wrapper probes Sean added in #7248.

smklein added a commit that referenced this pull request Dec 13, 2024
- Adds a "NonRetryHelper" struct to help instrument non-retryable
transactions
- Adds `transaction__start` and `transaction__done` probes which wrap
our retryable (and now, non-retryable transactions)

Intended to supplement
#7244 , and provide
transaction names
bnaecker and others added 4 commits December 13, 2024 21:36
- Adds a script for tracing all transactions run by Nexus, including
  their overall latency and the number of statements in each one.
- Move all existing scripts to a Nexus subdirectory, and then include
  that whole directory in the Omicron zone for Nexus itself.
- Closes #7224
@bnaecker bnaecker force-pushed the dtrace-sql-transactions branch from 33317df to 9853e1b Compare December 13, 2024 22:13
@bnaecker
Copy link
Collaborator Author

Ok, I've updated this to use the new probes @smklein added in #7248. Here's what we get, running Omicron locally on my Helios machine:

bnaecker@shale : ~/omicron $ pfexec ./tools/dtrace/nexus/trace-transactions.d -p $(pgrep -n nexus)
Tracing all database transactions for nexus PID 17772, use Ctrl-C to exit
Started transaction 'switch_port_settings_get' on conn b749c14b-6fbb-47aa-9988-3bedf208988a
COMMIT 11 statement(s) in transaction 'switch_port_settings_get' on connection b749c14b-6fbb-47aa-9988-3bedf208988a (7533 us)
Started transaction 'switch_port_settings_get' on conn 217540d7-0e7a-4ae7-88f9-10aa08ab6772
COMMIT 11 statement(s) in transaction 'switch_port_settings_get' on connection 217540d7-0e7a-4ae7-88f9-10aa08ab6772 (7722 us)

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

@bnaecker bnaecker enabled auto-merge (squash) December 13, 2024 22:25
@leftwo
Copy link
Contributor

leftwo commented Dec 13, 2024

Thanks @leftwo that makes sense. We can definitely run DTrace from a non-global zone now. I don't really see a downside to putting them in the GZ, so I'll go ahead and do that instead. Does /opt/oxide/dtrace/{nexus,crucible,...} seem like a fine place to you?

Did this change not make it into the final PR?

@bnaecker
Copy link
Collaborator Author

@leftwo Right now, I'd prefer to keep these scripts in the zone so we're less likely to end up with a mismatch between the scripts in the GZ and the actual probes in the zone itself. I think putting the scripts that are in the GZ in that path structure makes a lot of sense.

package-manifest.toml Outdated Show resolved Hide resolved
@leftwo
Copy link
Contributor

leftwo commented Dec 13, 2024

@leftwo Right now, I'd prefer to keep these scripts in the zone so we're less likely to end up with a mismatch between the scripts in the GZ and the actual probes in the zone itself. I think putting the scripts that are in the GZ in that path structure makes a lot of sense.

Yeah, sorry, I was not specific. Having them zone only is fine.
We could have a standard path that is the same for global or inside a zone is more what I was talking about.

@bnaecker
Copy link
Collaborator Author

As of 5e90bac, the scripts are now here:

bnaecker@shale : ~/omicron $ tar -tf out/omicron-nexus.tar.gz  | grep dtrace
root/opt/oxide/omicron-nexus/dtrace
root/opt/oxide/omicron-nexus/dtrace/aggregate-query-latency.d
root/opt/oxide/omicron-nexus/dtrace/slowest-queries.d
root/opt/oxide/omicron-nexus/dtrace/trace-db-queries.d
root/opt/oxide/omicron-nexus/dtrace/trace-transactions.d

@bnaecker
Copy link
Collaborator Author

Yeah, sorry, I was not specific. Having them zone only is fine.
We could have a standard path that is the same for global or inside a zone is more what I was talking about.

Ah, I didn't understand that. In the latest commit, I put them in the same location as the binary they're intended to trace, /opt/oxide/omicron-nexus/dtrace, where Nexus is at /opt/oxide/omicron-nexus/bin/nexus. Would /opt/oxide/dtrace/nexus make more sense, and we use /opt/oxide/dtrace for the general base prefix?

@leftwo
Copy link
Contributor

leftwo commented Dec 13, 2024

Yeah, sorry, I was not specific. Having them zone only is fine.
We could have a standard path that is the same for global or inside a zone is more what I was talking about.

Ah, I didn't understand that. In the latest commit, I put them in the same location as the binary they're intended to trace, /opt/oxide/omicron-nexus/dtrace, where Nexus is at /opt/oxide/omicron-nexus/bin/nexus. Would /opt/oxide/dtrace/nexus make more sense, and we use /opt/oxide/dtrace for the general base prefix?

Yeah, /opt/oxide/dtrace/ seems like a good general base, and we can all make sub-directories inside that.
I'm thinking of the support person and hoping to make their search for scripts easier :)

@bnaecker bnaecker enabled auto-merge (squash) December 13, 2024 23:50
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp of rubber has been applied!

@bnaecker bnaecker merged commit 89d6c76 into main Dec 14, 2024
17 checks passed
@bnaecker bnaecker deleted the dtrace-sql-transactions branch December 14, 2024 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Would like D scripts for tracing transactions in Nexus
5 participants