How to propagate Context in batch processing? #3826

jstaffans · 2021-11-03T21:08:59Z

jstaffans
Nov 3, 2021

I'm thinking about how to instrument a batch processing pipeline that accesses a set of files. Basically the only thing that each step in the pipeline is aware of is the filename, which is unique. I am imagining having an end-to-end trace in place for all the steps that have accessed a particular file in some way.

If I understand things correctly, a Context needs to be "propagated", so can I somehow construct the same Context using the filename in each of the steps in the pipeline? I keep reading mentions that it's possible to define the Context explicitly but haven't yet seen concrete examples how to do this. Or does this just mean that I use the same trace ID everywhere and base it on the filename somehow?

Answered by jkwatson

Nov 4, 2021

If the only thing that the step is aware of is the filename, it will probably be quite difficult to propagate the trace across step boundaries. Are the steps running on the same thread? Are they even running on the same machine?

The thing that makes a trace coherent, is the TraceId that is associated with each span. So, at the very minimum, you have to have a way to make sure that the TraceId is accessible when you are creating each span.

View full answer

jkwatson · 2021-11-04T02:31:05Z

jkwatson
Nov 4, 2021
Maintainer

If the only thing that the step is aware of is the filename, it will probably be quite difficult to propagate the trace across step boundaries. Are the steps running on the same thread? Are they even running on the same machine?

The thing that makes a trace coherent, is the TraceId that is associated with each span. So, at the very minimum, you have to have a way to make sure that the TraceId is accessible when you are creating each span.

3 replies

jstaffans Nov 4, 2021
Author

Right, I guess that the first step in the pipeline could generate a TraceId and somehow attach it to the file, so that subsequent steps know which TraceId to use. Or, if the filename is really unique for each pipeline execution, maybe a hash function based on the filename.

What about the fact that as processing is handed over from one step to the next in the pipeline, there's no single process/thread that could both open and close the root span. Is it necessary to have a root span or is a trace "coherent" with just a collection of spans that each use the same TraceId?

I think this point is valid for any asynchronous setup, like when there's a queue between two systems and not an HTTP/gRPC call or something.

jkwatson Nov 4, 2021
Maintainer

Many of the queuing systems have a way to propagate (headers in kafka, etc), but you're correct, it will be very difficult if there is no transport that allows metadata to propagate across steps or message processing boundaries.

The "root span" can be closed as soon as it's started, and just be considered the parent of all the steps. That is, assuming you have a way to get the traceId propagated, and you can use the same technique to propagate the root span id along with it.

jstaffans Nov 4, 2021
Author

That makes sense, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to propagate Context in batch processing? #3826

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to propagate Context in batch processing? #3826

jstaffans Nov 3, 2021

Replies: 1 comment · 3 replies

jkwatson Nov 4, 2021 Maintainer

jstaffans Nov 4, 2021 Author

jkwatson Nov 4, 2021 Maintainer

jstaffans Nov 4, 2021 Author

jstaffans
Nov 3, 2021

Replies: 1 comment 3 replies

jkwatson
Nov 4, 2021
Maintainer

jstaffans Nov 4, 2021
Author

jkwatson Nov 4, 2021
Maintainer

jstaffans Nov 4, 2021
Author