Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tracing): rework capturing and use it for all requests #5049

Merged
merged 95 commits into from
Dec 16, 2024
Merged

Conversation

aqrln
Copy link
Member

@aqrln aqrln commented Nov 17, 2024

When exporting the engine traces to the client, we used to use the tracing_opentelemetry crate to convert tracing spans to OpenTelemetry spans and feed them to a custom OpenTelemetry SpanProcessor that, again, converted them from OpenTelemetry spans to our custom format to be serialized and sent to the client and then re-created as OpenTelemetry spans there. This approach had numerous problems:

  • OpenTelemetry on the engine side is an unnecessary step that only complicates things: there's no OpenTelemetry exporter or collector involved, and rather than converting the OTel spans to our custom format in an OTel SpanProcessor, we can just as well convert tracing spans to that same format in a tracing_subscriber Layer. The dependency on opentelemetry was also the reason we didn't have tracing support in WASM.
  • Exporting the spans to the client happened asynchronously via logs (except for Accelerate): this only worked correctly as long as there were no concurrent queries, and could mix up spans between requests otherwise.
  • We tried to transfer the OTel span IDs generated on the engine which was (1) incorrect, because the client may initialize the tracer provider with a different span ID generator rather than the default random one and (2) required us to use a private API (Span constructor) of a specific OTel SDK implementation (@opentelemetry/sdk-trace-base) in @prisma/instrumentation, which made us completely incompatible with certain vendor-specific SDKs not based on sdk-trace-base and caused dependency hell issues for the compatible ones because we and the application had to have matching versions. And we couldn't easily re-map those spans by re-creating them with tracing.startActiveSpan and discard the engine IDs without buffering and keeping additional data structures in memory which are tricky to maintain and clean up because we couldn't know if and when we received all spans for a request.
  • Although we already had the infrastructure in place to instrument each individual request with a tracing Subscriber corresponding to the correct engine instance (which was crucial for logs working correctly), OpenTelemetry instrumentation was global, so tracing broke with multiple PrismaClient instances in a single app.

Additionally, for Accelerate, we had what we called "capturing mechanism", that was a different SpanProcessor which was listenting for spans and events and associating them with requests by trace ID, to be included as part of the response to the client. Capturing the spans in this way for the normal, local engine is a nice solution to the above problems but the following problems needed to be solved:

  • Capturing mechanism used to use trace IDs to correlate spans and events with requests. This is incorrect: we can have many different Prisma Client operations in the same trace. Using the root engine span ID would've been correct from the application logic point of view but tracing span IDs are not globally unique: they can be reused after the span is closed, so while they can be relied on to distinguish the spans in the same trace, root span IDs from different points in time could repeat. This wouldn't be a problem for Accelerate in practice, but for library engine we'd want to make a separate endpoint for fetching the spans by request ID, otherwise it's not possible to have the prisma:engine:response_json_serialization span because the trace would have to be finished before preparing the response — this is a tradeoff we are willing to have for Accelerate but we want a more accurate timing in the library engine. This means we need the request IDs to be unique for a very long period of time.

This PR reworks the capturing mechanism to be a tracing_subscriber Layer, introduces RequestId in the API and makes it the only way to retrieve tracing data used in all types of engines. It also gets rid of OpenTelemetry dependencies, does some refactoring and makes the tracing logic leaner and more efficient.

Additionally, this PR enables tracing support in query-engine-wasm. This adds 10 KB gzipped which is okay now that Cloudflare allows 3 MB instead of 1 MB on free accounts.

Closes: https://github.com/prisma/team-orm/issues/1337
Closes: https://github.com/prisma/team-orm/issues/1336
Closes: https://github.com/prisma/team-orm/issues/1338
Close: https://github.com/prisma/team-orm/issues/1309
Fixes: prisma/prisma#16791
Fixes: prisma/prisma#16980
Fixes: prisma/prisma#17853
Fixes: prisma/prisma#19580
Fixes: prisma/prisma#20779

The corresponding client-side changes will also fix the following issues:

Copy link
Contributor

github-actions bot commented Nov 17, 2024

WASM Query Engine file Size

Engine This PR Base branch Diff
Postgres 2.078MiB 2.053MiB 25.654KiB
Postgres (gzip) 835.074KiB 823.375KiB 11.700KiB
Mysql 2.044MiB 2.017MiB 27.084KiB
Mysql (gzip) 821.530KiB 809.240KiB 12.291KiB
Sqlite 1.942MiB 1.915MiB 27.235KiB
Sqlite (gzip) 781.037KiB 769.209KiB 11.829KiB

Copy link

codspeed-hq bot commented Nov 19, 2024

CodSpeed Performance Report

Merging #5049 will not alter performance

Comparing rework-tracing (ef8afe7) with main (995e205)

Summary

✅ 11 untouched benchmarks

@aqrln aqrln force-pushed the rework-tracing branch 3 times, most recently from 3539cbd to e74fb4c Compare November 25, 2024 14:32
@aqrln aqrln added this to the 6.1.0 milestone Nov 28, 2024
@aqrln aqrln force-pushed the rework-tracing branch 2 times, most recently from 1006ad1 to 79b8d8d Compare December 2, 2024 12:55
@aqrln aqrln force-pushed the rework-tracing branch 2 times, most recently from db6f925 to 5917c3b Compare December 10, 2024 12:37
@aqrln aqrln marked this pull request as ready for review December 12, 2024 12:31
@aqrln aqrln requested a review from a team as a code owner December 12, 2024 12:32
@aqrln aqrln requested review from jkomyno and removed request for a team December 12, 2024 12:32
Brewfile Show resolved Hide resolved
@prisma prisma deleted a comment from aqrln Dec 16, 2024
Copy link
Contributor

@jkomyno jkomyno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for the additional context in call.

@jkomyno jkomyno merged commit d0b7335 into main Dec 16, 2024
1 check passed
@jkomyno jkomyno deleted the rework-tracing branch December 16, 2024 11:08
aqrln added a commit that referenced this pull request Dec 16, 2024
Fixup #5049 and rename one
missed span attribute. The tests on the client were updated to expect
the new name but the attribute was only updated in one of two places.
aqrln added a commit that referenced this pull request Dec 16, 2024
Fixup #5049 and rename one
missed span attribute. The tests on the client were updated to expect
the new name but the attribute was only updated in one of two places.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment