TODO.txt

## Phase V

    Incorporating claude-3-5-sonnet-20241022, which is a powerful new model.

    For this run, the test patches are based on the existing data available from gpt-4o
    solve runs, supplemented with new patches discovered by sonnet runs.

    All code patches are generated by sonnet.


## Phase IV

    The workflow official.yml is run to create test patches and preliminary solve results.

    https://docs.google.com/spreadsheets/d/1yGXKz6jJEV30bqfCKg9b7tNFq5Fqkee5S4aesmCfifs/edit?gid=0#gid=0

    284 test patches are available. The initial solve rate at 8,000 tokens was 32.4%.

    The next step is to re-run the solver on all non-optimal instances, with an increased token limit.

    These results will be merged with the initial results (workflow runs 17-22 of official.yml) 

    A straightforward way to do this is to follow the method of Phase III:

    1. Build an instance set of those instances that are not optimally solved in the first phase.
    2. Collect the optimal test patches from the initial results into data/test_patches.
    3. Re-solve the non-optimal examples.
    4. Combine the results with the initial solutions.


## Phase III

    Phase III data is collected here -

    https://docs.google.com/spreadsheets/d/1BJXnulBWL8CA0FTNk82nyBWq_37unpuIk2iT6UlSkoY/edit?gid=437586290#gid=437586290

    In this phase, the solve.yml workflow was used to generate test patches and optimal code solutions.
    See workflow runs 223-225.

    Subsequently, the instances with no optimal code patches were collected into no_code_patch_<date>_[12].txt
    datasets. These datesets were re-run with the test patches available.

    Results from the no_code_patch runs replace the instances from 223-225 that were non-optimal.

## Score and pass rate analysis

    Re-try batch 3 with synthetic tests and new solver limits - test_files=0 code_files=6 code_status_retry=3

    https://github.com/getappmap/navie-benchmark/actions/runs/10905207717

    33_3 0,3,3 64k notest noobserve gpt-4o #226

    This run will only produce code patches, it will not produce test patches.
    The run will not produce code patches when the test patch cannot be identified.

    It can be used to:

    * Obtain and evaluate the selection of the correct code file
    * Check to see if the solution rate is better with 64k context for examples that have 
        test and inverted patches in both #225 and #226

verifi_33_3 3,3,3,3 -test -observe 16k gpt-4o #225

Disappointing with a 25.9% resolved rate
Code file match %	52%
Test file match %	18% *
Test patch gen %	61%
Inverted patch gen %	53%
Score = 0	20	12%
Score = 1	91	56%
Score = 2	30	19%
Score = 3	20	12%
Resolved (=0)	1	5%
Resolved (=1)	18	20%
Resolved (=2)	13	43%
Resolved (=3)	10	50%
Resolved %	25.9%

verified_33 3,3,3,3 -test -observe 16k gpt-4o #223

Code file match %	    64% *
Test file match %	    28% *
Test patch gen %	    65%
Inverted patch gen %	56%
Score = 0	25	15%
Score = 1	81	50%
Score = 2	24	15%
Score = 3	33	20%
Resolved (=0)	6	24%
Resolved (=1)	19	23%
Resolved (=2)	8	33%
Resolved (=3)	18	55%
Resolved %	31.3%

verified_30_pct 3,1,3,3 32k gpt-4o #217

/noterms and no text search limit. Uses the test database.

Code file match %	    63% *
Test file match %	    22% *
Test patch gen %	    78%
Inverted patch gen %	75%
Score = 0	18	11%
Score = 1	81	49%
Score = 2	27	16%
Score = 3	39	24%
Resolved (=0)	4	22%
Resolved (=1)	22	27%
Resolved (=2)	9	33%
Resolved (=3)	24	62%
Resolved %	35.8%

verified_30_pct 3,1,3,3 32k gpt-4o #201

Uses /terms and has the text search limit

Code file match %	    25% *
Test file match %	    21%
Test patch gen %	    78%
Inverted patch gen %	74%
Score = 0	12	7%
Score = 1	106	65%
Score = 2	16	10%
Score = 3	27	16%
Resolved (=0)	2	17%
Resolved (=1)	18	17%
Resolved (=2)	5	31%
Resolved (=3)	17	63%
Resolved %	25.6%

* Plan is too strict about editing a single file

* Progress update

    ## verified_33_2 3,3,3,3 -test -observe 16k gpt-4o #224

    ## verified_33_1 3,3,3,3 -test -observe 16k gpt-4o #223

    33% run using /noterms and no character limit on text search.
    It does not use the existing synthetic tests, and the observe flag is off.

    Overall score rates are lower than #217, below.

    Pass rate at Score=3 was 55%.

    ## verified_30_pct 3,1,3,3 32k gpt-4o #217

    This is a 33% run which uses the /noterms flag and removes the character limit on text search.
    It uses the existing synthetic tests. Observe tests flag is on.

    Pass rate at Score=3 was 62%.

* Use /noterms, remove 50k file limit
    https://docs.google.com/spreadsheets/d/1lCT67I8WK64dQjzcIo5KQ3gf9Z_Q6vj5LS8Wga7tjk4/edit?gid=1462583265#gid=1462583265

* Sonnet rate limits are unhandled

    https://github.com/getappmap/appmap-js/issues/1996

* AppMap data doesn't seem effective in identifying the code patch file

    https://docs.google.com/spreadsheets/d/1lCT67I8WK64dQjzcIo5KQ3gf9Z_Q6vj5LS8Wga7tjk4/edit?gid=1462583265#gid=1462583265

* Parameter comparison tests

    Running the 30% instance set
    ============================

    * 10884295754 #198

    1/1 in test, 3/3 in code, 8k context tokens

    25% solve rate overall

    Edit test file % of only 74%. Needs more edit test file attempts to find a pass-to-pass.
    60% of issues end with score=1, only 13% of which pass. Increase the code limits to try and find a solution.
    Increase the token limit to improve the solver chances.

    * 10885877926 #201

    3/1 in test, 3/3 in code, 32k context tokens

    Sonnet
    ======

    8k token limit
    https://github.com/getappmap/navie-benchmark/issues/33
    https://github.com/getappmap/navie-benchmark/issues/32 


    Token limit #139 vs #136
    ============

    16k token limit vs 8k
    Solve rate of 25.3% vs 21.6%

    NOTE: verified_10_pct 1,1,3,3 sonnet #199 not actually sonnet, was gpt-4o. Re-run as https://github.com/getappmap/navie-benchmark/actions/runs/10885877926

* Patch generation progress

    Test patches
    ============
    10863347641 - 32 patches on 218 examples
    10863770934 - 22 patches on 186 examples

    Code patches
    ============
    10863315253 - 14 patches on 160 examples
    TODO: Are we only looking for code solutions when we have a patch solution?
    *** We should be adding newly solved examples to the code instances

* Summarize the root cause of test errors before feeding them back in

* Sonnet - truncated output

    solve/django__django-14034/navie/generate-code/attempt-1/code-1/generate/generate.md

    /Users/kgilpin/Downloads/solve-1 (3)/django__django-11211/navie/generate-code/attempt-1/code-1/generate/generate.md
    
<change>
<file change-number-for-this-file="1">django/db/models/fields/related_lookups.py</file>
<original line-count="14" no-ellipsis="true"><![CDATA[
class RelatedLookupMixin:
    def get_prep_lookup(self):
        if not isinstance(self.lhs, MultiColSource) and self.rhs_is_direct_value():
            
* If there is little or no AppMap data, supplement with Context

* astropy__astropy-14365 missing 

    https://github.com/getappmap/navie-benchmark/actions/runs/10819249064

    - [ ] Limit the size of AppMap derived context 

* Needs to be a limit on the amount of context selected for the edit test file

  Select edit test file with the smallest AppMap data?

* Custom context should be "system" prompt rather than user code selection, at least to match the behavior of the frontend.

* Verify that this test passes if the patch is reverted:

    solve/django__django-13658/navie/generate-code/attempt-1/run-test/test-patch/code_0.patch

[workflow] (django__django-13658) Adapting test file: tests/admin_scripts/management/commands/base_command.py
[test/lint-repair] (django__django-13658) Making attempt 1 to generate code that lints cleanly
[generate-test] (django__django-13658) Generated test file: tests/admin_scripts/management/commands/management_utility_command_parser_fix_test.py
[test/lint-repair] (django__django-13658) Code lints cleanly
[generate-test] (django__django-13658) Test patch generated after 1 attempts.
HEAD is now at 48212eb Baseline commit
[workflow] (django__django-13658) Cleaned git state

[run-test] (django__django-13658) Running tests tests/admin_scripts/management/commands/management_utility_command_parser_fix_test.py 
  in /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/generate-test/attempt-1_from-base_command.py/test-2/run-test/test-patch
[execute-container] (django__django-13658)
  Saved output to log file: /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/generate-test/attempt-1_from-base_command.py/test-2/run-test/test-patch/run_test.log

[run-test] (django__django-13658) Test tests/admin_scripts/management/commands/management_utility_command_parser_fix_test.py completed with status TestStatus.PASSED.

[workflow] (django__django-13658) Test passed. Accepting test.
[workflow] (django__django-13658) Inverting test
HEAD is now at 48212eb Baseline commit
[workflow] (django__django-13658) Cleaned git state
[invert-test/lint-repair] (django__django-13658) Making attempt 1 to generate code that lints cleanly
[generate-test] (django__django-13658) Generated test file: tests/admin_scripts/management/commands/management_utility_command_parser_fix_test_inverted.py
[invert-test/lint-repair] (django__django-13658) Code lints cleanly
HEAD is now at 48212eb Baseline commit
[workflow] (django__django-13658) Cleaned git state
[invert-test] (django__django-13658) Test patch inverted after 1 attempts.
[run-test] (django__django-13658) Running tests tests/admin_scripts/management/commands/management_utility_command_parser_fix_test_inverted.py in /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/generate-test/attempt-1_from-base_command.py/test-2/invert
[execute-container] (django__django-13658) Saved output to log file: /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/generate-test/attempt-1_from-base_command.py/test-2/invert/run_test.log
[run-test] (django__django-13658) Test tests/admin_scripts/management/commands/management_utility_command_parser_fix_test_inverted.py completed with status TestStatus.FAILED.
[workflow] (django__django-13658) Inverted test failed with the expected marker error. Accepting test.
[workflow] (django__django-13658) Optimal test patch generated for tests/admin_scripts/management/commands/base_command.py


[workflow] (django__django-13658) Patch file generated to /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/test.patch
[workflow] (django__django-13658) Patch file generated to /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/test-inverted.patch

* Repeatedly emitting the following:

    +class ManagementUtilityCommandParserFixTest(SimpleTestCase):
    +    def test_prog_name_from_argv(self):
    +        original_argv = sys.argv
    +        try:
    +            # Simulate an environment where sys.argv[0] is None
    +            sys.argv = [None, 'test_command']
    +            with self.assertRaises(TypeError):
    +                execute_from_command_line(['manage.py', 'test_command'])

* Feedback test errors into code generation

    TODO: Utilize the same logic with test generation?

* Patch file matches, but no solution obtained

    https://docs.google.com/spreadsheets/d/1lCT67I8WK64dQjzcIo5KQ3gf9Z_Q6vj5LS8Wga7tjk4/edit?gid=539801750#gid=539801750

* Ignore user-provided Python versions

    ```sh
    python3.8 -m pip install --user --upgrade 'git+git://github.com/sphinx-doc/sphinx.git@3.0.x#egg=sphinx'
    ```
    ## Python environment

    Do not use Python features that are not available in this Python version.

    Python 3.11.5

* Do not take the user description overly literally --- 

    +        self.locale_dir = os.path.join(self.repo_dir, "locale", "da", "LC_MESSAGES")
    +        self.build_dir = os.path.join(self.repo_dir, "_build", "html")
    +        self.index_html = os.path.join(self.build_dir, "index.html")
    +
    +        # Clone the repository
    +        subprocess.run(["git", "clone", self.repo_url, self.repo_dir], check=True)
    +        subprocess.run(["git", "checkout", self.commit_hash], cwd=self.repo_dir, check=True)
    +
    +        # Create a virtual environment and install Sphinx
    +        subprocess.run(["python3", "-m", "venv", "env"], cwd=self.repo_dir, check=True)
    +        subprocess.run([os.path.join(self.repo_dir, "env", "bin", "pip"), "install", "sphinx"], check=True)

    * Require a code file to be present in the trace of the test case.

    * Feed test errors back into the solver.

    [run-test] (sympy__sympy-17318) Interpreting test output from log file: /Users/kgilpin/source/appland/navie-benchmark/solve/sympy__sympy-17318/navie/generate-test/attempt-1_from-test_sqrtdenest.py/test-2/run-test/test-patch/run_test.log
    [run-test] (sympy__sympy-17318) Interpreting test output from log file: /Users/kgilpin/source/appland/navie-benchmark/solve/sympy__sympy-17318/navie/generate-test/attempt-3_from-test_radsimp.py/test-1/run-test/test-patch/run_test.log

    Ignore failures due to deprecation?

    sympy__sympy-17318 - SOLVED - IN https://docs.google.com/spreadsheets/d/1gwF8VrAeWqsq6yH91Y5zvCGPIhYKA9M5FQG3KmwYYDk/edit?gid=1702640221#gid=1702640221
      Not in https://docs.google.com/spreadsheets/d/1lCT67I8WK64dQjzcIo5KQ3gf9Z_Q6vj5LS8Wga7tjk4/edit?gid=14622705#gid=14622705

* Premature acceptance of "best" code patch

[generate-and-validate-code] (sympy__sympy-15599) Code patch succeeded the pass-to-pass test, and there are no test patches to try. Accepting code patch.
[workflow] (sympy__sympy-15599) Optimal code patch generated (for available tests)

* Instances with empty patches: 5

https://github.com/getappmap/navie-benchmark/actions/runs/10761576039/job/29840927112#step:7:5439

* Extend the token limit by the size of the observed errors

* Generate multiple sets of pass to fail and fail to pass examples?

* Score the patch based on passing all the pass_to_pass tests?

* Generates the same bad patch on both run_test
  
  /Users/kgilpin/Downloads/solve-0/scikit-learn__scikit-learn-13328/navie/generate-code/2/run_test/6db4f467feabd60a471aa9f5d3169f28326cd0bd40bbc7a8f8b20a84adbf91a2/run_test.log

  Try a new test to patch

* Analyze performance vis a vis parameters:
  Edited file limit
  Lint retries
  Test retries
  Code retries

* Increase context size on retries? Or overall? Or make configurable?

* Test is getting "SKIPPED"

  https://github.com/getappmap/navie-benchmark/actions/runs/10753160219/job/29822284291#step:7:397

  [generate-code] (django__django-14559) Code patch generated after 1 attempts.
  [generate-and-validate-code] (django__django-14559) Running pass-to-pass test for attempt 2
  [workflow] (django__django-14559) Running test
  [run-test] (django__django-14559) Running tests tests/postgres_tests/test_bulk_update.py in /home/runner/work/navie-benchmark/navie-benchmark/solve/django__django-14559/navie/generate-code/attempt-2/run-test/pass-to-pass
  [run-test] (django__django-14559) Creating run-test container for django__django-14559...
  [run-test] (django__django-14559) Test run includes 1 code patches.
  [run-test] (django__django-14559) Test tests/postgres_tests/test_bulk_update.py completed with status TestStatus.SKIPPED.

* "Code patch is not optimal" can be emitted when there is no successful edit test file

[generate-and-validate-code] (django__django-13658) Code patch is not optimal. Will look for a better patch.

* ModelChoiceField doesn't select the expected file (django/forms/models)

Context: ModelChoiceField Value ValidationError invalid choice Django
Instructions: Summarize the issue and design a solution involving at most one file modification.
---
Terms: +ModelChoiceField ValidationError invalid invalid_choice value template Django
940ms [vectorTerms] +ModelChoiceField ValidationError invalid invalid_choice value template Django
Explain received context request: search
[collectContext] keywords: model choice modelchoice field choicefield validation error validationerror invalid invalid choice invalidchoice value template django

* Just report presence of test frameworks (pytest, unittest) to the solver?

* Split workflow.run into solve_test and solve_code

* Malformed patch

<change>
<file change-number-for-this-file="3">/home/runner/work/navie-benchmark/navie-benchmark/solve/astropy__astropy-14365/source/astropy/io/ascii/qdp.py</file>
<original line-count="14" no-ellipsis="true"><![CDATA[
if datatype.startswith("data"):
            # The first time I find data, I define err_specs
            if err_specs == {} and command_lines != "":
                for cline in command_lines.strip().split("\n"):
                    command = cline.strip().split()
                    # This should never happen, but just in case.
                    if len(command) < 3:
                        continue
                    err_specs[command[1].lower()] = [int(c) for c in command[2:]]
            if colnames is None:
                colnames = _interpret_err_lines(err_specs, ncol, names=input_colnames)

            if current_rows is None:
                current_rows = []
]]></original>
<modified line-count="14" no-ellipsis="true"><![CDATA[
if datatype.startswith("data"):
            # The first time I find data, I define err_specs
            if err_specs == {} and command_lines != "":
                for cline in command_lines.strip().split("\n"):
                    command = cline.strip().split()
                    # This should never happen, but just in case.
                    if len(command) < 3:
                        continue
                    err_specs[command[1].lower()] = [int(c) for c in command[2:]]
            if colnames is None:
                colnames = _interpret_err_lines(err_specs, ncol, names=input_colnames)

            if current_rows is None:
                current_rows = []
]]></modified>
</change>

* Found 2 changes, but the limit is 1

https://github.com/getappmap/navie-benchmark/actions/runs/10655942503/job/29534052664
[workflow/generate-code] (pytest-dev__pytest-7490) Found 2 changes, but the limit is 1
[workflow/generate-code] (pytest-dev__pytest-7490) Applied code changes to src/_pytest/skipping.py, src/_pytest/nodes.py
[code/lint-repair] (pytest-dev__pytest-7490) Code has lint errors: src/_pytest/nodes.py:287:25: F821 undefined name 'xfailed_key'

* astropy example fails due to a too-advanced version of numpy

https://stackoverflow.com/questions/74946845/attributeerror-module-numpy-has-no-attribute-int

sweb.eval.x86_64.astropy__astropy-8707

astropy/table/_np_utils.pyx:15: in init astropy.table._np_utils
    DTYPE = np.int
/opt/miniconda3/envs/testbed/lib/python3.9/site-packages/numpy/__init__.py:319: in __getattr__
    raise AttributeError(__former_attrs__[attr])
E   AttributeError: module 'numpy' has no attribute 'int'.
E   `np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
E   The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E       https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

[run-test] (astropy__astropy-8707) Interpreting test output from log file: /Users/kgilpin/source/appland/navie-benchmark/solve/astropy__astropy-8707/navie/generate-test/attempt-5_from-test_division.py/test-1/run-test/test-patch/run_test.log

Also this other depercation warning

self = <astropy.io.fits.tests.test_header_fromstring_bytes.TestHeaderFromStringBytes object at 0x400b83be50>
request = <SubRequest '_xunit_setup_method_fixture_TestHeaderFromStringBytes' for <Function test_card_fromstring_str>>

    @fixtures.fixture(
        autouse=True,
        scope="function",
        # Use a unique name to speed up lookup.
        name=f"_xunit_setup_method_fixture_{self.obj.__qualname__}",
    )
    def xunit_setup_method_fixture(self, request) -> Generator[None, None, None]:
        method = request.function
        if setup_method is not None:
            func = getattr(self, setup_name)
            _call_with_optional_argument(func, method)
            if emit_nose_setup_warning:
>               warnings.warn(
                    NOSE_SUPPORT_METHOD.format(
                        nodeid=request.node.nodeid, method="setup"
                    ),
                    stacklevel=2,
                )
E               pytest.PytestRemovedIn8Warning: Support for nose tests is deprecated and will be removed in a future release.
E               astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_card_fromstring_str is using nose-specific method: `setup(self)`
E               To remove this warning, rename it to `setup_method(self)`
E               See docs: https://docs.pytest.org/en/stable/deprecations.html#support-for-tests-written-for-nose

/opt/miniconda3/envs/testbed/lib/python3.9/site-packages/_pytest/python.py:898: PytestRemovedIn8Warning
=========================== short test summary info ============================
ERROR astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_header_fromstring_bytes
ERROR astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_header_fromstring_str
ERROR astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_card_fromstring_bytes
ERROR astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_card_fromstring_str
============================== 4 errors in 7.49s ===============================