Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Dual by-diff and by-file context #1445

Open
MarkRx opened this issue Jan 7, 2025 · 1 comment
Open

[Proposal] Dual by-diff and by-file context #1445

MarkRx opened this issue Jan 7, 2025 · 1 comment

Comments

@MarkRx
Copy link
Contributor

MarkRx commented Jan 7, 2025

The amount of context to pass in with a prompt is a balance between cost and complexity as well as an exercise in not overwhelming the LLM. See documentation.

Currently the prompt provides the git diff with +/- a few lines for each patch hunk (or multiple if the diff is too large). A best effort is made to dynamically expand the context window to include the function/class definition up to max_extra_lines_before_dynamic_context lines. The benefit of using the git diff as a whole is the LLM has context of presumably related changes in the same commit across multiple files. This allows it to better understand possible call trees. A disadvantage is the context may lack information about field and function definitions as well as longer functions or call trees that involve existing functions that don't show up in the diff. For example, I've seen the bot often get confused about if variables are null or if validation has been done because the definitions and validation code doesn't make it into the context window.

Another approach would be to provide by-file context. For each set of patches for a file, provide the entire file as context to the VM. This provides the opposite advantage/disadvantage as the by-diff approach above. The LLM will have more context about the file, but will lack context of cross-file call trees. This approach works better for understanding the meaning of variables or functions interwoven through the file. There is a risk the LLM gets overwhelmed on large files but this is no different than large diffs or creating a large file all at once.

I propose using both approaches and collecting together the results. The prompt for the by-diff could be instructed to focus on cross-file changes whereas the by-file prompt would focus on single file changes.

Given that this approach is unproven and would result in cost increase I suggest it be made configurable - context-by-diff, context-by-file, or both. Default to context-by-diff for the time being as that is the existing behavior.

@mrT23
Copy link
Collaborator

mrT23 commented Jan 10, 2025

Thanks for the feedback @MarkRx
This is a complicated issue, and my tl;dr instinct is that it will degrade results (on average):

  • a PR change can be a line or two. Files can be very large. I estimate giving the entire file will increase the number of input tokens by x5-10 on average. That's a lot. In the end, the model will always have limited context. Even giving all the relevant PR files is not enough - maybe other files in the repo are also needed ?

  • leave the cost (and speed) issue a side, model's quality degrade significantly when you give them large input. Especially when most of this input is irrelevant. People call it "needle in the haystack", but it's more like finding many needles in many haystacks. It's very hard for the model to find code problems in that scenario. And will be even harder when we combine diff code and regular code together.

  • In general, doing a complicated change as 'optional' (turned off by default) is not the way to go. very few people, if any, will turn it on. I think also evaluating the impact for the common user will be very hard.:

    • LLM are not deterministic
    • Specific PRs where more context is needed might improve, but other PRs would suffer due to harder input. It will be difficult to compare properly.
  • regarding I've seen the bot often get confused about if variables are null or if validation has been done because the definitions and validation code doesn't make it into the context window: I have to say I have not encountered this often with the new 'focus on problems' mode, and claude sonnet with updated prompt. If you want top results, I recommend using both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants