Please revisit decision to emit an event for every message #1621

stephentoub · 2024-11-26T23:12:33Z

The latest (1.28) spec requires emitting an event for every individual message. This is problematic in a variety of ways:

Messages that are part of the same request are not individual entities: they are only meaningful as part of the overall request being submitted to the LLM, which functionally turns the totality of those messages into a single prompt. The whole set of messages / the whole request should be logged, not each tiny piece of it individually.
A request might contain dozens or hundreds of messages. Logging each individually spams the listener, which then needs to do work to reconstitute the original request.
Any sampling being performed also then needs to do such work, as it otherwise might drop portions of a request.
The same message ends up getting logged on its own many times, each time a new request with the same history is resubmitted (augmented with additional messages). That's easy to understand when the messages are grouped, but when they're logged individually, it's much harder to reason about the duplication.

lmolkova · 2024-11-27T00:22:43Z

Logging all messages as one event comes with the following limitations:

hard to query. Different messages have different structure. As a result, messages would become heterogeneous arrays of objects.
individual timestamps are lost. user message happens at the beginning of the request, tool call (if performed by the service-side agent) at a later time, final completion happens at the end
harder to control verbosity. With current approach, If I want only choices and don't want to capture system messages or tool calls I can configure it (OTel will provide all the hooks and options and afaik, .NET is also considering per-event-name/id control). With all-messages-in-one-event, each instrumentation would need to provide some GenAI specific options.
different model from assistants. OpenAI assistants, AWS Bedrock agents don't require you to re-send chat history all the time, so we'd be forced to model things on per-message basis

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways). If users use the tool that's not showing chat history nicely, they probably don't care about the chat history and can disable those events or send them to low-cost destination.

The fact that chat history needs to be sent over and over again only shows how terribly inefficient existing GenAI solutions are and I don't think we should hide it.

[Update] I'd expect any distributed application that does anything complicated to generate 10-100+ info logs for each trace even if it has nothing to do with AI. Enabling debug logs would add another order of magnitude. Existing tools don't show it nicely, that's true.

stephentoub · 2024-11-27T00:45:35Z

Different messages have different structure. As a result, messages would become heterogeneous arrays of objects.

That's the nature of the system; they do have different structure, and they are heterogenous.

user message happens at the beginning of the request, tool call (if performed by the service-side agent) at a later time, final completion happens at the end

For non-streaming, that's not the case: the completion is logically all created at a single moment in time and shows up as a single response payload. For streaming, sure, but the spec already doesn't reflect the nature of streaming, e.g. you could get two tokens at very different times, but there's no notion of that in the spec, which necessitates that a streaming response be composed back into manufactured messages.

With all-messages-in-one-event, each instrumentation would need to provide some GenAI specific options.

I'm not following. The system needs to have a mapping of what it would emit at various verbosity levels. If it's emitting individual messages, it needs to emit them at that level. If it's emitting them altogether, it can simply query for the current verbosity level (which any respectible system does anyway before logging) and choose to compose only the pieces that are relevant at that verbosity level. Any options to configure this are relevant to both approaches.

OpenAI assistants, AWS Bedrock agents don't require you to re-send chat history all the time, so we'd be forced to model things on per-message basis

I disagree. You log what you send/receive. In the case of chat completion, what you send is a list of messages. In the case of assistants, what you send is a set of messages, it just needn't be the complete set because it may be augmented with a further set that already exists in storage. The telemetry would include what you send and receive, in both cases.

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways).

I do not see how it degrades the experience. In both cases, the tooling needs to understand the format and the data.

If users use the tool that's not showing chat history nicely, they probably don't care about the chat history

I'm a user. I care about chat history. My tool is not showing it nicely.

and I don't think we should hide it

How is it hiding it? In both cases, all the content is logged. It's just a question of whether it's logged in a way that reflects reality (they're physically and logically all part of the same prompt), or whether the telemetry is manufacturing boundaries that don't exist.

I'd expect any distributed application that does anything complicated to generate 10-100+ info logs for each trace even if it has nothing to do with AI.

100%. All the more reason to not introduce artificial boundaries where they don't exist. If I have a sentence I want to log: "This is a test of the emergency broadcast system", can you imagine if we recommended that be logged as 9 distinct entries?

logger.LogInformation("This ");
logger.LogInformation("is ");
logger.LogInformation("a ");
logger.LogInformation("test ");
logger.LogInformation("of ");
logger.LogInformation("the ");
logger.LogInformation("emergency ");
logger.LogInformation("broadcast ");
logger.LogInformation("system");

yet that's exactly what this design is suggested be done for the prompt to the LLM, splitting the prompt into pieces that are then logged individually, even though it's all just part of the same input and is not interpretable separately.

lmolkova · 2024-11-27T01:41:50Z

Let me tackle things one-by-one

That's the nature of the system; they do have different structure, and they are heterogenous.
I do not see how it degrades the experience. In both cases, the tooling needs to understand the format and the data.

Different structure is a good signal that messages should be in separate events. The more complicated the structure is, the harder it is for the end users to parse it. Tooling suffers from it too, but I'm less worried about tools.

Imagine I'm writing a query that would return all completions with finish_reason that's not "stop". I'd write something similar to (in KQL)

logs | where name == "gen_ai.choice" | extend content = parse_json(body) | where content["finish_reason"] != "stop"

I can spend 15-45 mins to write a query that'd do the same for events like one below. I've done it a few times in the past and it's not easy. The hint is to use mv-expand if I recall correctly

{
      "name": "gen_ai.content",
      "body": {
        "messages": [
          {
            "role": "system",
            "content": "You are a helpful assistant."
          },
          {
            "role": "user",
            "content": "Hello!"
          }
        ],
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "\n\nHello there, how may I assist you today?"
            },
            "logprobs": null,
            "finish_reason": "stop"
          }
        ]
      }
    }

I encourage you to try it yourself.

You can say that it's hard to do in KQL and you might be right.

Let's compare it to JQ:

Now:

.logs[] 
| select(.name == "gen_ai.choice" and .body.content.finish_reason != "stop")

One event:

.logs[] 
| select(.name == "gen_ai.content")
| .body.choices[]
| select(.finish_reason != "stop")

both are relatively simple, yet I spent 1 min to write the first one, and a few more to write the second one

lmolkova · 2024-11-27T01:47:49Z

For non-streaming, that's not the case: the completion is logically all created at a single moment in time and shows up as a single response payload.

I disagree. You log what you send/receive. In the case of chat completion, what you send is a list of messages. In the case of assistants, what you send is a set of messages, it just needn't be the complete set because it may be augmented with a further set that already exists in storage. The telemetry would include what you send and receive, in both cases.

This is not the case for agents/assistants where you do add messages one-by-one and agent manages your chat history separately from the completions. Or when you report individual events (too calls) as they occur during thread run - PTAL at the microsoft#3

Also what you're saying that if we log HTTP request and response bodies we should log them together since they occur at the same time and describe the same thing creating visual noise in my tool.

lmolkova · 2024-11-27T01:49:55Z

I'm a user. I care about chat history. My tool is not showing it nicely.

We can fix the tool or change the telemetry (by making it harder to query and control verbosity), why should we change the telemetry and not the tool?

stephentoub · 2024-11-27T01:49:59Z

yet I spent 1 min to write the first one, and a few more to write the second one

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

lmolkova · 2024-11-27T01:50:44Z

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

If only tool can group and visualize them under spanId...

stephentoub · 2024-11-27T01:50:45Z

This is not the case for agents/assistants where you do add messages one-by-one and agent manages your chat history separately from the completions.

Sure it is. You log what you have in the largest unit you have them.

lmolkova · 2024-11-27T01:51:37Z

Sure it is. You log what you have in the largest unit you have them.

I don't think we should have different telemetry formats for agents and chat completions just because existing tooling can't show verbose logs nicely.

stephentoub · 2024-11-27T01:51:54Z

why should we change the telemetry and not the tool?

Why should the telemetry make it harder for the tool by exposing the data factored at a finer granularity than it actually is in reality?

stephentoub · 2024-11-27T01:53:13Z

Sure it is. You log what you have in the largest unit you have them.

I don't think we should have different telemetry formats for agents and chat completions just because existing tooling can't show verbose logs nicely.

You're going to end up with different telemetry, regardless, eg if each turn for completions logs the same message multiple times and only once or zero times for assistants.

stephentoub · 2024-11-27T01:56:32Z

Also what you're saying that if we log HTTP request and response bodies we should log them together since they occur at the same time and describe the same thing creating visual noise in my tool.

I'm not saying that. The request and response are separate from each other and can be logged separately, just as an LLM prompt can be logged separately from the completion. I'm objecting to artificially splitting the request or artificially spitting the completion, just as I'd object to doing so for HTTP. I don't want a separate log entry for every HTTP header, for example.

stephentoub · 2024-11-27T01:58:36Z

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

If only tool can group and visualize them under spanId...

Why is the tool capable of that but not basic search?

lmolkova · 2024-11-27T18:25:58Z

Why is the tool capable of that but not basic search?

What do you mean by basic search? I can write a relatively simple query that would group logs by the spanid and traceid in KQL or JQ.

You're going to end up with different telemetry, regardless, eg if each turn for completions logs the same message multiple times and only once or zero times for assistants.

I'll end up with different telemetry, but the structure of individual telemetry items will be the same.

I'm objecting to artificially splitting the request or artificially spitting the completion, just as I'd object to doing so for HTTP. I don't want a separate log entry for every HTTP header, for example.

Would you put tool definition inside prompt event? I would not because I'd like people who don't want to log tool definitions to be able to disable that event using generic log filtering/config. Also I want to have the same event defining tool call for assistants and for chat completions.

So splitting may look artificial, but it does not mean it's bad.

If we summarize:

Group prompt messages together.
- Pros: fewer logs (same volume/cost), look nicer in the tools that don't provide GenAI-specific experience
- Cons: harder to query, inconsistent individual event format with assistants, harder to control verbosity - elaborated in Please revisit decision to emit an event for every message #1621 (comment)
Use individual message as event
- Pros: easier to query, individual events are consistent with assistants, verbosity control by event name. Tools that visualize chat visualize individual messages anyway
- Cons: looks bad in tools that do generic log visualization

I'm happy to continue the discussion and try looking for options 3, 4, etc, but picking between 1 and 2, I don't see why 1 would be better.

samsp-msft · 2024-11-27T20:38:06Z

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways).

Having all the data in multiple log messages is going to make analysis tools considerably harder to write - they will need to store all the messages and then run queries to piece together the logs to build the bigger picture. If all the data is in one payload, its much easier to do batch operations or parallel processing as each log entry tells the complete story for that gen_ai call.

lmolkova · 2024-11-27T21:07:15Z

Having all the data in multiple log messages is going to make analysis tools considerably harder to write - they will need to store all the messages and then run queries to piece together the logs to build the bigger picture. If all the data is in one payload, its much easier to do batch operations or parallel processing as each log entry tells the complete story for that gen_ai call.

would it mean grouping prompts, tool definitions, attachments, tool calls, tool results, and completions into one uber event that captures everything?

The moment you separate prompts from completions into different records, you get a problem of bringing them together.

I'd like to challenge assumptions that

This is the absolutely most important scenario we need to optimize for, important enough to make other scenarios terrible
That it's hard to group logs by span-id and trace-id. Here's the ~query logs| summarize make_list(bag_pack("name", name, "content", body)) by trace_id, span_id

SteveSandersonMS · 2024-11-28T10:19:46Z

Is the goal to record the underlying physical reality of the protocol, or is it to record an interpretation of what happened in a form that eases certain queries? I can see arguments both ways.

Consumers who are comfortable with the wire format will likely bias towards seeing the "true" shape of the data, whereas consumers who have never seen the wire format but just want answers to high-level queries will likely prefer whatever makes their specific queries more intuitive.

I'm happy to continue the discussion and try looking for options 3, 4, etc, but picking between 1 and 2, I don't see why 1 would be better.

TBH I don't think the pros/cons list is that clear cut. For example, when it comes to ease of querying, you could say:

Group prompt messages together: easier querying if your goal is to do anything with the requests/responses themselves
Use individual message as event: easier querying if your goal is to do anything with individual messages

Ultimately it's hard to argue from first principles that there's a single right answer here. It depends on scenarios. Maybe in the long run it will become clear that tools and consumption scenarios all settle one way or the other. If that's already settled, it's news to me.

You could make the "first principles" argument that in the absence of definite evidence that people want to vary the structure from its original shape, it's a safer to preserve the original wire format since that carries some additional level of truth. I don't know how the decision was made to switch to event-per-message so maybe there were a large number of people agreeing that's best in their scenario, but that's not clear from this GitHub issue alone.

lmolkova · 2024-11-28T18:42:10Z

The principles we use to instrument libraries and write semantic conventions that have an impact on this decision are:

consistency across similar technologies. If OpenAI wire format changes tomorrow, we won't change semantic conventions to match it. We try to come up with a format that works for multiple backends. Also for similar scenarios we try to generate similar telemetry. It mostly does not matter how user generates completion (via assistant thread run or via chat completion that carries full history on each request), so queries/dashboard/other tools should not really care that much.
layered operations and separation of responsibilities. If you need protocol-level data, use HTTP/gRPC/etc telemetry - one can log HTTP request/response bodies for GenAI calls - GenAI instrumentation won't normalize/sanitize/extract important details. GenAI instrumentation creates logical operations that cover (potentially) multiple requests over arbitrary protocol (or even local ones if you run in-proc model). Toy can think about GenAI instrumentation as something wrapping the public API surface of the client - telemetry it captures intentionally has nothing to do with the protocol.
use-cases. We focus on what you do with telemetry (perf: throughput/latency, reliability: errors, debugging: tracing + events, evaluations: events + metrics). Tools that provide AI-specific experience, usually visualize messages separately. You might want to drop recording instructions since it's mostly static or provide them once when you create your agent. You probably want to run queries against choices.

There are many others. I encourage you to participate in semconv discussions. There are extensive discussions on the topic - check out #980 and linked issues + we have notes and recordings of calls where we discussed it - all available here https://docs.google.com/document/d/1EKIeDgBGXQPGehUigIRLwAUpRGa7-1kXB736EaYuJ2M/edit?tab=t.0#heading=h.ylazl6464n0c

SteveSandersonMS · 2024-11-28T19:11:55Z

Thanks @lmolkova, that's interesting and helps a bit with understanding the thought process.

To be clear when I referred to "record the underlying physical reality of the protocol" I meant representing the same conceptual information (e.g., having equivalent concepts and being at the same level of granularity), not the literal bytes.

can think about GenAI instrumentation as something wrapping the public API surface of the client

I guess that is meant quite loosely since you're also implying you don't want to reflect the differences between alternate API patterns (e.g., streaming vs nonstreaming have very different API shapes, likewise there tend to be different APIs for calling a stateless chat endpoint vs a stateful assistant).

lmolkova · 2024-12-03T00:37:49Z

@SteveSandersonMS we record multiple telemetry items for a single API call - logical span/duration metric/events + transport request spans, maybe connection establishment or dns lookup can be done in a scope of that call. I don't think we can have any strict rules on how to record details (and how many details to record) on a single telemetry item for an arbitrary call.

I.e. we analyze use-cases and try to manually optimize telemetry items for the most consistency across important use-cases.

The use-case as I see it here is "dear model, please generate some content based on this data".
You can implement it in multiple ways

pass content by reference (thread id or cached content) or by value (message list)
pass one set of related messages (chat) or a list of unrelated messages (batch)
stream response or not stream it.

Depending on the scenario, my telemetry will be slightly different. E.g. If I use agents I'd have gen_ai.thread.id attribute on my spans and events, the operation name will also be different. In streaming case, I'd have additional metric for time-per-token. But structurally telemetry will be similar.

If as a user I decide to use a different chat completion API while staying within the same use case (chat completion), I'd like my telemetry to change as little as possible (ideally nothing major breaks, I just get a slightly different set of attributes describing the details).

For content, the unit that remains almost the same is the message itself. I can create it, pass it by value or reference. If the model reports it, I'd like to know how many tokens were in that message (and if it doesn't, it probably will at some point) and I'd like to have a standard place to get it from in my queries.

You're right that it's applied loosely and how it's applied depends on the level of abstraction and what we consider to be the same use-case. It seems to be the core of this discussion.

mikeldking · 2024-12-06T15:29:40Z

commenting here mainly to follow along.

I'm one of the maintainers of a few LLM instrumentors (https://github.com/Arize-ai/openinference) as well as an OTEL backend (https://github.com/Arize-ai/phoenix) and have had success with messages being a part of the span attributes payload. It's been a pretty successful direction for us as it's allowed for some helpful debugging flows (evaluations, replay of the invocation, etc.) - I understand that this mainly product related an not related to semantic conventions per-se but just wanted to express that I do echo some of the concerns @stephentoub expressed above.

Let me know if I can be of any help as we've gone through the implementation of messages as attributes end-to-end and might be able to provide some perspective.

TaoChenOSU · 2024-12-18T21:54:11Z

With the current convention, I have observed that when emitting an event for each message, the precision of the timestamps is crucial for end users to accurately reconstruct the sequence of messages.

In my experience on Azure Application Insights, I frequently notice that chat messages appear in a seemingly random order, because all messages have the same timestamps, despite they already being six digits into the second:

lzchen · 2024-12-19T21:34:07Z

@TaoChenOSU

I've created an issue in the Python repo that might address the issue you are observing.

stephentoub changed the title ~~Please revisit choice to emit an event for every message~~ Please revisit decision to emit an event for every message Nov 26, 2024

lmolkova added this to GenAI Semantic Conventions and Instrumentation libraries Dec 5, 2024

lmolkova added the area:gen-ai label Dec 5, 2024

lzchen mentioned this issue Dec 19, 2024

Implement higher precision clock than time_ns open-telemetry/opentelemetry-python#4367

Open

This was referenced Dec 19, 2024

Timestamp granularity and log ordering #1701

Open

Azure AI inference events timestamp in AzMon workaround Azure/azure-sdk-for-python#38955

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please revisit decision to emit an event for every message #1621

Please revisit decision to emit an event for every message #1621

stephentoub commented Nov 26, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

stephentoub commented Nov 27, 2024

lmolkova commented Nov 27, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

stephentoub commented Nov 27, 2024

lmolkova commented Nov 27, 2024

stephentoub commented Nov 27, 2024

lmolkova commented Nov 27, 2024

stephentoub commented Nov 27, 2024

stephentoub commented Nov 27, 2024 •

edited

Loading

stephentoub commented Nov 27, 2024 •

edited

Loading

stephentoub commented Nov 27, 2024

lmolkova commented Nov 27, 2024 •

edited

Loading

samsp-msft commented Nov 27, 2024

lmolkova commented Nov 27, 2024 •

edited

Loading

SteveSandersonMS commented Nov 28, 2024

lmolkova commented Nov 28, 2024

SteveSandersonMS commented Nov 28, 2024

lmolkova commented Dec 3, 2024 •

edited

Loading

mikeldking commented Dec 6, 2024

TaoChenOSU commented Dec 18, 2024

lzchen commented Dec 19, 2024

Please revisit decision to emit an event for every message #1621

Please revisit decision to emit an event for every message #1621

Comments

stephentoub commented Nov 26, 2024 • edited Loading

lmolkova commented Nov 27, 2024 • edited Loading

stephentoub commented Nov 27, 2024

lmolkova commented Nov 27, 2024 • edited Loading

lmolkova commented Nov 27, 2024 • edited Loading

lmolkova commented Nov 27, 2024 • edited Loading

stephentoub commented Nov 27, 2024

lmolkova commented Nov 27, 2024

stephentoub commented Nov 27, 2024

lmolkova commented Nov 27, 2024

stephentoub commented Nov 27, 2024

stephentoub commented Nov 27, 2024 • edited Loading

stephentoub commented Nov 27, 2024 • edited Loading

stephentoub commented Nov 27, 2024

lmolkova commented Nov 27, 2024 • edited Loading

samsp-msft commented Nov 27, 2024

lmolkova commented Nov 27, 2024 • edited Loading

SteveSandersonMS commented Nov 28, 2024

lmolkova commented Nov 28, 2024

SteveSandersonMS commented Nov 28, 2024

lmolkova commented Dec 3, 2024 • edited Loading

mikeldking commented Dec 6, 2024

TaoChenOSU commented Dec 18, 2024

lzchen commented Dec 19, 2024

stephentoub commented Nov 26, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

stephentoub commented Nov 27, 2024 •

edited

Loading

stephentoub commented Nov 27, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

lmolkova commented Nov 27, 2024 •

edited

Loading

lmolkova commented Dec 3, 2024 •

edited

Loading