Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please revisit decision to emit an event for every message #1621

Open
stephentoub opened this issue Nov 26, 2024 · 23 comments
Open

Please revisit decision to emit an event for every message #1621

stephentoub opened this issue Nov 26, 2024 · 23 comments

Comments

@stephentoub
Copy link

stephentoub commented Nov 26, 2024

The latest (1.28) spec requires emitting an event for every individual message. This is problematic in a variety of ways:

  • Messages that are part of the same request are not individual entities: they are only meaningful as part of the overall request being submitted to the LLM, which functionally turns the totality of those messages into a single prompt. The whole set of messages / the whole request should be logged, not each tiny piece of it individually.
  • A request might contain dozens or hundreds of messages. Logging each individually spams the listener, which then needs to do work to reconstitute the original request.
  • Any sampling being performed also then needs to do such work, as it otherwise might drop portions of a request.
  • The same message ends up getting logged on its own many times, each time a new request with the same history is resubmitted (augmented with additional messages). That's easy to understand when the messages are grouped, but when they're logged individually, it's much harder to reason about the duplication.
@stephentoub stephentoub changed the title Please revisit choice to emit an event for every message Please revisit decision to emit an event for every message Nov 26, 2024
@lmolkova
Copy link
Contributor

lmolkova commented Nov 27, 2024

Logging all messages as one event comes with the following limitations:

  • hard to query. Different messages have different structure. As a result, messages would become heterogeneous arrays of objects.
  • individual timestamps are lost. user message happens at the beginning of the request, tool call (if performed by the service-side agent) at a later time, final completion happens at the end
  • harder to control verbosity. With current approach, If I want only choices and don't want to capture system messages or tool calls I can configure it (OTel will provide all the hooks and options and afaik, .NET is also considering per-event-name/id control). With all-messages-in-one-event, each instrumentation would need to provide some GenAI specific options.
  • different model from assistants. OpenAI assistants, AWS Bedrock agents don't require you to re-send chat history all the time, so we'd be forced to model things on per-message basis

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways). If users use the tool that's not showing chat history nicely, they probably don't care about the chat history and can disable those events or send them to low-cost destination.

The fact that chat history needs to be sent over and over again only shows how terribly inefficient existing GenAI solutions are and I don't think we should hide it.

[Update] I'd expect any distributed application that does anything complicated to generate 10-100+ info logs for each trace even if it has nothing to do with AI. Enabling debug logs would add another order of magnitude. Existing tools don't show it nicely, that's true.

@stephentoub
Copy link
Author

Different messages have different structure. As a result, messages would become heterogeneous arrays of objects.

That's the nature of the system; they do have different structure, and they are heterogenous.

user message happens at the beginning of the request, tool call (if performed by the service-side agent) at a later time, final completion happens at the end

For non-streaming, that's not the case: the completion is logically all created at a single moment in time and shows up as a single response payload. For streaming, sure, but the spec already doesn't reflect the nature of streaming, e.g. you could get two tokens at very different times, but there's no notion of that in the spec, which necessitates that a streaming response be composed back into manufactured messages.

With all-messages-in-one-event, each instrumentation would need to provide some GenAI specific options.

I'm not following. The system needs to have a mapping of what it would emit at various verbosity levels. If it's emitting individual messages, it needs to emit them at that level. If it's emitting them altogether, it can simply query for the current verbosity level (which any respectible system does anyway before logging) and choose to compose only the pieces that are relevant at that verbosity level. Any options to configure this are relevant to both approaches.

OpenAI assistants, AWS Bedrock agents don't require you to re-send chat history all the time, so we'd be forced to model things on per-message basis

I disagree. You log what you send/receive. In the case of chat completion, what you send is a list of messages. In the case of assistants, what you send is a set of messages, it just needn't be the complete set because it may be augmented with a further set that already exists in storage. The telemetry would include what you send and receive, in both cases.

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways).

I do not see how it degrades the experience. In both cases, the tooling needs to understand the format and the data.

If users use the tool that's not showing chat history nicely, they probably don't care about the chat history

I'm a user. I care about chat history. My tool is not showing it nicely.

and I don't think we should hide it

How is it hiding it? In both cases, all the content is logged. It's just a question of whether it's logged in a way that reflects reality (they're physically and logically all part of the same prompt), or whether the telemetry is manufacturing boundaries that don't exist.

I'd expect any distributed application that does anything complicated to generate 10-100+ info logs for each trace even if it has nothing to do with AI.

100%. All the more reason to not introduce artificial boundaries where they don't exist. If I have a sentence I want to log: "This is a test of the emergency broadcast system", can you imagine if we recommended that be logged as 9 distinct entries?

logger.LogInformation("This ");
logger.LogInformation("is ");
logger.LogInformation("a ");
logger.LogInformation("test ");
logger.LogInformation("of ");
logger.LogInformation("the ");
logger.LogInformation("emergency ");
logger.LogInformation("broadcast ");
logger.LogInformation("system");

yet that's exactly what this design is suggested be done for the prompt to the LLM, splitting the prompt into pieces that are then logged individually, even though it's all just part of the same input and is not interpretable separately.

@lmolkova
Copy link
Contributor

lmolkova commented Nov 27, 2024

Let me tackle things one-by-one

That's the nature of the system; they do have different structure, and they are heterogenous.
I do not see how it degrades the experience. In both cases, the tooling needs to understand the format and the data.

Different structure is a good signal that messages should be in separate events. The more complicated the structure is, the harder it is for the end users to parse it. Tooling suffers from it too, but I'm less worried about tools.

Imagine I'm writing a query that would return all completions with finish_reason that's not "stop". I'd write something similar to (in KQL)

logs | where name == "gen_ai.choice" | extend content = parse_json(body) | where content["finish_reason"] != "stop"

I can spend 15-45 mins to write a query that'd do the same for events like one below. I've done it a few times in the past and it's not easy. The hint is to use mv-expand if I recall correctly

{
      "name": "gen_ai.content",
      "body": {
        "messages": [
          {
            "role": "system",
            "content": "You are a helpful assistant."
          },
          {
            "role": "user",
            "content": "Hello!"
          }
        ],
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "\n\nHello there, how may I assist you today?"
            },
            "logprobs": null,
            "finish_reason": "stop"
          }
        ]
      }
    }

I encourage you to try it yourself.

You can say that it's hard to do in KQL and you might be right.

Let's compare it to JQ:

Now:

.logs[] 
| select(.name == "gen_ai.choice" and .body.content.finish_reason != "stop")

One event:

.logs[] 
| select(.name == "gen_ai.content")
| .body.choices[]
| select(.finish_reason != "stop")

both are relatively simple, yet I spent 1 min to write the first one, and a few more to write the second one

@lmolkova
Copy link
Contributor

lmolkova commented Nov 27, 2024

For non-streaming, that's not the case: the completion is logically all created at a single moment in time and shows up as a single response payload.

I disagree. You log what you send/receive. In the case of chat completion, what you send is a list of messages. In the case of assistants, what you send is a set of messages, it just needn't be the complete set because it may be augmented with a further set that already exists in storage. The telemetry would include what you send and receive, in both cases.

This is not the case for agents/assistants where you do add messages one-by-one and agent manages your chat history separately from the completions. Or when you report individual events (too calls) as they occur during thread run - PTAL at the microsoft#3

Also what you're saying that if we log HTTP request and response bodies we should log them together since they occur at the same time and describe the same thing creating visual noise in my tool.

@lmolkova
Copy link
Contributor

lmolkova commented Nov 27, 2024

I'm a user. I care about chat history. My tool is not showing it nicely.

We can fix the tool or change the telemetry (by making it harder to query and control verbosity), why should we change the telemetry and not the tool?

@stephentoub
Copy link
Author

yet I spent 1 min to write the first one, and a few more to write the second one

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

@lmolkova
Copy link
Contributor

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

If only tool can group and visualize them under spanId...

@stephentoub
Copy link
Author

This is not the case for agents/assistants where you do add messages one-by-one and agent manages your chat history separately from the completions.

Sure it is. You log what you have in the largest unit you have them.

@lmolkova
Copy link
Contributor

Sure it is. You log what you have in the largest unit you have them.

I don't think we should have different telemetry formats for agents and chat completions just because existing tooling can't show verbose logs nicely.

@stephentoub
Copy link
Author

why should we change the telemetry and not the tool?

Why should the telemetry make it harder for the tool by exposing the data factored at a finer granularity than it actually is in reality?

@stephentoub
Copy link
Author

stephentoub commented Nov 27, 2024

Sure it is. You log what you have in the largest unit you have them.

I don't think we should have different telemetry formats for agents and chat completions just because existing tooling can't show verbose logs nicely.

You're going to end up with different telemetry, regardless, eg if each turn for completions logs the same message multiple times and only once or zero times for assistants.

@stephentoub
Copy link
Author

stephentoub commented Nov 27, 2024

Also what you're saying that if we log HTTP request and response bodies we should log them together since they occur at the same time and describe the same thing creating visual noise in my tool.

I'm not saying that. The request and response are separate from each other and can be logged separately, just as an LLM prompt can be logged separately from the completion. I'm objecting to artificially splitting the request or artificially spitting the completion, just as I'd object to doing so for HTTP. I don't want a separate log entry for every HTTP header, for example.

@stephentoub
Copy link
Author

But with the second, I get the full content. With the first, you then need to follow it up with yet another query you need to write to extract all the messages that are part of same request or completion.

If only tool can group and visualize them under spanId...

Why is the tool capable of that but not basic search?

@lmolkova
Copy link
Contributor

lmolkova commented Nov 27, 2024

Why is the tool capable of that but not basic search?

What do you mean by basic search? I can write a relatively simple query that would group logs by the spanid and traceid in KQL or JQ.

You're going to end up with different telemetry, regardless, eg if each turn for completions logs the same message multiple times and only once or zero times for assistants.

I'll end up with different telemetry, but the structure of individual telemetry items will be the same.

I'm objecting to artificially splitting the request or artificially spitting the completion, just as I'd object to doing so for HTTP. I don't want a separate log entry for every HTTP header, for example.

Would you put tool definition inside prompt event? I would not because I'd like people who don't want to log tool definitions to be able to disable that event using generic log filtering/config. Also I want to have the same event defining tool call for assistants and for chat completions.

So splitting may look artificial, but it does not mean it's bad.

If we summarize:

  1. Group prompt messages together.

  2. Use individual message as event

    • Pros: easier to query, individual events are consistent with assistants, verbosity control by event name. Tools that visualize chat visualize individual messages anyway
    • Cons: looks bad in tools that do generic log visualization

I'm happy to continue the discussion and try looking for options 3, 4, etc, but picking between 1 and 2, I don't see why 1 would be better.

@samsp-msft
Copy link

Squashing all messages into one reduces visual noise when using non-GenAI aware tools, but results in degraded experience for tools that do need to understand content of these messages (tools that render chat history, run evaluations, query these events in custom ways).

Having all the data in multiple log messages is going to make analysis tools considerably harder to write - they will need to store all the messages and then run queries to piece together the logs to build the bigger picture. If all the data is in one payload, its much easier to do batch operations or parallel processing as each log entry tells the complete story for that gen_ai call.

@lmolkova
Copy link
Contributor

lmolkova commented Nov 27, 2024

Having all the data in multiple log messages is going to make analysis tools considerably harder to write - they will need to store all the messages and then run queries to piece together the logs to build the bigger picture. If all the data is in one payload, its much easier to do batch operations or parallel processing as each log entry tells the complete story for that gen_ai call.

would it mean grouping prompts, tool definitions, attachments, tool calls, tool results, and completions into one uber event that captures everything?

The moment you separate prompts from completions into different records, you get a problem of bringing them together.

I'd like to challenge assumptions that

  1. This is the absolutely most important scenario we need to optimize for, important enough to make other scenarios terrible
  2. That it's hard to group logs by span-id and trace-id. Here's the ~query logs| summarize make_list(bag_pack("name", name, "content", body)) by trace_id, span_id

@SteveSandersonMS
Copy link

Is the goal to record the underlying physical reality of the protocol, or is it to record an interpretation of what happened in a form that eases certain queries? I can see arguments both ways.

Consumers who are comfortable with the wire format will likely bias towards seeing the "true" shape of the data, whereas consumers who have never seen the wire format but just want answers to high-level queries will likely prefer whatever makes their specific queries more intuitive.

I'm happy to continue the discussion and try looking for options 3, 4, etc, but picking between 1 and 2, I don't see why 1 would be better.

TBH I don't think the pros/cons list is that clear cut. For example, when it comes to ease of querying, you could say:

  • Group prompt messages together: easier querying if your goal is to do anything with the requests/responses themselves
  • Use individual message as event: easier querying if your goal is to do anything with individual messages

Ultimately it's hard to argue from first principles that there's a single right answer here. It depends on scenarios. Maybe in the long run it will become clear that tools and consumption scenarios all settle one way or the other. If that's already settled, it's news to me.

You could make the "first principles" argument that in the absence of definite evidence that people want to vary the structure from its original shape, it's a safer to preserve the original wire format since that carries some additional level of truth. I don't know how the decision was made to switch to event-per-message so maybe there were a large number of people agreeing that's best in their scenario, but that's not clear from this GitHub issue alone.

@lmolkova
Copy link
Contributor

The principles we use to instrument libraries and write semantic conventions that have an impact on this decision are:

  • consistency across similar technologies. If OpenAI wire format changes tomorrow, we won't change semantic conventions to match it. We try to come up with a format that works for multiple backends. Also for similar scenarios we try to generate similar telemetry. It mostly does not matter how user generates completion (via assistant thread run or via chat completion that carries full history on each request), so queries/dashboard/other tools should not really care that much.
  • layered operations and separation of responsibilities. If you need protocol-level data, use HTTP/gRPC/etc telemetry - one can log HTTP request/response bodies for GenAI calls - GenAI instrumentation won't normalize/sanitize/extract important details. GenAI instrumentation creates logical operations that cover (potentially) multiple requests over arbitrary protocol (or even local ones if you run in-proc model). Toy can think about GenAI instrumentation as something wrapping the public API surface of the client - telemetry it captures intentionally has nothing to do with the protocol.
  • use-cases. We focus on what you do with telemetry (perf: throughput/latency, reliability: errors, debugging: tracing + events, evaluations: events + metrics). Tools that provide AI-specific experience, usually visualize messages separately. You might want to drop recording instructions since it's mostly static or provide them once when you create your agent. You probably want to run queries against choices.

There are many others. I encourage you to participate in semconv discussions. There are extensive discussions on the topic - check out #980 and linked issues + we have notes and recordings of calls where we discussed it - all available here https://docs.google.com/document/d/1EKIeDgBGXQPGehUigIRLwAUpRGa7-1kXB736EaYuJ2M/edit?tab=t.0#heading=h.ylazl6464n0c

@SteveSandersonMS
Copy link

Thanks @lmolkova, that's interesting and helps a bit with understanding the thought process.

To be clear when I referred to "record the underlying physical reality of the protocol" I meant representing the same conceptual information (e.g., having equivalent concepts and being at the same level of granularity), not the literal bytes.

can think about GenAI instrumentation as something wrapping the public API surface of the client

I guess that is meant quite loosely since you're also implying you don't want to reflect the differences between alternate API patterns (e.g., streaming vs nonstreaming have very different API shapes, likewise there tend to be different APIs for calling a stateless chat endpoint vs a stateful assistant).

@lmolkova
Copy link
Contributor

lmolkova commented Dec 3, 2024

@SteveSandersonMS we record multiple telemetry items for a single API call - logical span/duration metric/events + transport request spans, maybe connection establishment or dns lookup can be done in a scope of that call. I don't think we can have any strict rules on how to record details (and how many details to record) on a single telemetry item for an arbitrary call.

I.e. we analyze use-cases and try to manually optimize telemetry items for the most consistency across important use-cases.

The use-case as I see it here is "dear model, please generate some content based on this data".
You can implement it in multiple ways

  • pass content by reference (thread id or cached content) or by value (message list)
  • pass one set of related messages (chat) or a list of unrelated messages (batch)
  • stream response or not stream it.

Depending on the scenario, my telemetry will be slightly different. E.g. If I use agents I'd have gen_ai.thread.id attribute on my spans and events, the operation name will also be different. In streaming case, I'd have additional metric for time-per-token. But structurally telemetry will be similar.

If as a user I decide to use a different chat completion API while staying within the same use case (chat completion), I'd like my telemetry to change as little as possible (ideally nothing major breaks, I just get a slightly different set of attributes describing the details).

For content, the unit that remains almost the same is the message itself. I can create it, pass it by value or reference. If the model reports it, I'd like to know how many tokens were in that message (and if it doesn't, it probably will at some point) and I'd like to have a standard place to get it from in my queries.

You're right that it's applied loosely and how it's applied depends on the level of abstraction and what we consider to be the same use-case. It seems to be the core of this discussion.

@mikeldking
Copy link

commenting here mainly to follow along.

I'm one of the maintainers of a few LLM instrumentors (https://github.com/Arize-ai/openinference) as well as an OTEL backend (https://github.com/Arize-ai/phoenix) and have had success with messages being a part of the span attributes payload. It's been a pretty successful direction for us as it's allowed for some helpful debugging flows (evaluations, replay of the invocation, etc.) - I understand that this mainly product related an not related to semantic conventions per-se but just wanted to express that I do echo some of the concerns @stephentoub expressed above.

Let me know if I can be of any help as we've gone through the implementation of messages as attributes end-to-end and might be able to provide some perspective.

@TaoChenOSU
Copy link
Contributor

With the current convention, I have observed that when emitting an event for each message, the precision of the timestamps is crucial for end users to accurately reconstruct the sequence of messages.

In my experience on Azure Application Insights, I frequently notice that chat messages appear in a seemingly random order, because all messages have the same timestamps, despite they already being six digits into the second:
image

@lzchen
Copy link

lzchen commented Dec 19, 2024

@TaoChenOSU

I've created an issue in the Python repo that might address the issue you are observing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants