You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The amount of context to pass in with a prompt is a balance between cost and complexity as well as an exercise in not overwhelming the LLM. See documentation.
Currently the prompt provides the git diff with +/- a few lines for each patch hunk (or multiple if the diff is too large). A best effort is made to dynamically expand the context window to include the function/class definition up to max_extra_lines_before_dynamic_context lines. The benefit of using the git diff as a whole is the LLM has context of presumably related changes in the same commit across multiple files. This allows it to better understand possible call trees. A disadvantage is the context may lack information about field and function definitions as well as longer functions or call trees that involve existing functions that don't show up in the diff. For example, I've seen the bot often get confused about if variables are null or if validation has been done because the definitions and validation code doesn't make it into the context window.
Another approach would be to provide by-file context. For each set of patches for a file, provide the entire file as context to the VM. This provides the opposite advantage/disadvantage as the by-diff approach above. The LLM will have more context about the file, but will lack context of cross-file call trees. This approach works better for understanding the meaning of variables or functions interwoven through the file. There is a risk the LLM gets overwhelmed on large files but this is no different than large diffs or creating a large file all at once.
I propose using both approaches and collecting together the results. The prompt for the by-diff could be instructed to focus on cross-file changes whereas the by-file prompt would focus on single file changes.
Given that this approach is unproven and would result in cost increase I suggest it be made configurable - context-by-diff, context-by-file, or both. Default to context-by-diff for the time being as that is the existing behavior.
The text was updated successfully, but these errors were encountered:
Thanks for the feedback @MarkRx
This is a complicated issue, and my tl;dr instinct is that it will degrade results (on average):
a PR change can be a line or two. Files can be very large. I estimate giving the entire file will increase the number of input tokens by x5-10 on average. That's a lot. In the end, the model will always have limited context. Even giving all the relevant PR files is not enough - maybe other files in the repo are also needed ?
leave the cost (and speed) issue a side, model's quality degrade significantly when you give them large input. Especially when most of this input is irrelevant. People call it "needle in the haystack", but it's more like finding many needles in many haystacks. It's very hard for the model to find code problems in that scenario. And will be even harder when we combine diff code and regular code together.
In general, doing a complicated change as 'optional' (turned off by default) is not the way to go. very few people, if any, will turn it on. I think also evaluating the impact for the common user will be very hard.:
LLM are not deterministic
Specific PRs where more context is needed might improve, but other PRs would suffer due to harder input. It will be difficult to compare properly.
regarding I've seen the bot often get confused about if variables are null or if validation has been done because the definitions and validation code doesn't make it into the context window: I have to say I have not encountered this often with the new 'focus on problems' mode, and claude sonnet with updated prompt. If you want top results, I recommend using both.
The amount of context to pass in with a prompt is a balance between cost and complexity as well as an exercise in not overwhelming the LLM. See documentation.
Currently the prompt provides the git diff with +/- a few lines for each patch hunk (or multiple if the diff is too large). A best effort is made to dynamically expand the context window to include the function/class definition up to
max_extra_lines_before_dynamic_context
lines. The benefit of using the git diff as a whole is the LLM has context of presumably related changes in the same commit across multiple files. This allows it to better understand possible call trees. A disadvantage is the context may lack information about field and function definitions as well as longer functions or call trees that involve existing functions that don't show up in the diff. For example, I've seen the bot often get confused about if variables are null or if validation has been done because the definitions and validation code doesn't make it into the context window.Another approach would be to provide by-file context. For each set of patches for a file, provide the entire file as context to the VM. This provides the opposite advantage/disadvantage as the by-diff approach above. The LLM will have more context about the file, but will lack context of cross-file call trees. This approach works better for understanding the meaning of variables or functions interwoven through the file. There is a risk the LLM gets overwhelmed on large files but this is no different than large diffs or creating a large file all at once.
I propose using both approaches and collecting together the results. The prompt for the by-diff could be instructed to focus on cross-file changes whereas the by-file prompt would focus on single file changes.
Given that this approach is unproven and would result in cost increase I suggest it be made configurable - context-by-diff, context-by-file, or both. Default to context-by-diff for the time being as that is the existing behavior.
The text was updated successfully, but these errors were encountered: