-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO.txt
491 lines (350 loc) · 21.5 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
## Phase V
Incorporating claude-3-5-sonnet-20241022, which is a powerful new model.
For this run, the test patches are based on the existing data available from gpt-4o
solve runs, supplemented with new patches discovered by sonnet runs.
All code patches are generated by sonnet.
## Phase IV
The workflow official.yml is run to create test patches and preliminary solve results.
https://docs.google.com/spreadsheets/d/1yGXKz6jJEV30bqfCKg9b7tNFq5Fqkee5S4aesmCfifs/edit?gid=0#gid=0
284 test patches are available. The initial solve rate at 8,000 tokens was 32.4%.
The next step is to re-run the solver on all non-optimal instances, with an increased token limit.
These results will be merged with the initial results (workflow runs 17-22 of official.yml)
A straightforward way to do this is to follow the method of Phase III:
1. Build an instance set of those instances that are not optimally solved in the first phase.
2. Collect the optimal test patches from the initial results into data/test_patches.
3. Re-solve the non-optimal examples.
4. Combine the results with the initial solutions.
## Phase III
Phase III data is collected here -
https://docs.google.com/spreadsheets/d/1BJXnulBWL8CA0FTNk82nyBWq_37unpuIk2iT6UlSkoY/edit?gid=437586290#gid=437586290
In this phase, the solve.yml workflow was used to generate test patches and optimal code solutions.
See workflow runs 223-225.
Subsequently, the instances with no optimal code patches were collected into no_code_patch_<date>_[12].txt
datasets. These datesets were re-run with the test patches available.
Results from the no_code_patch runs replace the instances from 223-225 that were non-optimal.
## Score and pass rate analysis
Re-try batch 3 with synthetic tests and new solver limits - test_files=0 code_files=6 code_status_retry=3
https://github.com/getappmap/navie-benchmark/actions/runs/10905207717
33_3 0,3,3 64k notest noobserve gpt-4o #226
This run will only produce code patches, it will not produce test patches.
The run will not produce code patches when the test patch cannot be identified.
It can be used to:
* Obtain and evaluate the selection of the correct code file
* Check to see if the solution rate is better with 64k context for examples that have
test and inverted patches in both #225 and #226
verifi_33_3 3,3,3,3 -test -observe 16k gpt-4o #225
Disappointing with a 25.9% resolved rate
Code file match % 52%
Test file match % 18% *
Test patch gen % 61%
Inverted patch gen % 53%
Score = 0 20 12%
Score = 1 91 56%
Score = 2 30 19%
Score = 3 20 12%
Resolved (=0) 1 5%
Resolved (=1) 18 20%
Resolved (=2) 13 43%
Resolved (=3) 10 50%
Resolved % 25.9%
verified_33 3,3,3,3 -test -observe 16k gpt-4o #223
Code file match % 64% *
Test file match % 28% *
Test patch gen % 65%
Inverted patch gen % 56%
Score = 0 25 15%
Score = 1 81 50%
Score = 2 24 15%
Score = 3 33 20%
Resolved (=0) 6 24%
Resolved (=1) 19 23%
Resolved (=2) 8 33%
Resolved (=3) 18 55%
Resolved % 31.3%
verified_30_pct 3,1,3,3 32k gpt-4o #217
/noterms and no text search limit. Uses the test database.
Code file match % 63% *
Test file match % 22% *
Test patch gen % 78%
Inverted patch gen % 75%
Score = 0 18 11%
Score = 1 81 49%
Score = 2 27 16%
Score = 3 39 24%
Resolved (=0) 4 22%
Resolved (=1) 22 27%
Resolved (=2) 9 33%
Resolved (=3) 24 62%
Resolved % 35.8%
verified_30_pct 3,1,3,3 32k gpt-4o #201
Uses /terms and has the text search limit
Code file match % 25% *
Test file match % 21%
Test patch gen % 78%
Inverted patch gen % 74%
Score = 0 12 7%
Score = 1 106 65%
Score = 2 16 10%
Score = 3 27 16%
Resolved (=0) 2 17%
Resolved (=1) 18 17%
Resolved (=2) 5 31%
Resolved (=3) 17 63%
Resolved % 25.6%
* Plan is too strict about editing a single file
* Progress update
## verified_33_2 3,3,3,3 -test -observe 16k gpt-4o #224
## verified_33_1 3,3,3,3 -test -observe 16k gpt-4o #223
33% run using /noterms and no character limit on text search.
It does not use the existing synthetic tests, and the observe flag is off.
Overall score rates are lower than #217, below.
Pass rate at Score=3 was 55%.
## verified_30_pct 3,1,3,3 32k gpt-4o #217
This is a 33% run which uses the /noterms flag and removes the character limit on text search.
It uses the existing synthetic tests. Observe tests flag is on.
Pass rate at Score=3 was 62%.
* Use /noterms, remove 50k file limit
https://docs.google.com/spreadsheets/d/1lCT67I8WK64dQjzcIo5KQ3gf9Z_Q6vj5LS8Wga7tjk4/edit?gid=1462583265#gid=1462583265
* Sonnet rate limits are unhandled
https://github.com/getappmap/appmap-js/issues/1996
* AppMap data doesn't seem effective in identifying the code patch file
https://docs.google.com/spreadsheets/d/1lCT67I8WK64dQjzcIo5KQ3gf9Z_Q6vj5LS8Wga7tjk4/edit?gid=1462583265#gid=1462583265
* Parameter comparison tests
Running the 30% instance set
============================
* 10884295754 #198
1/1 in test, 3/3 in code, 8k context tokens
25% solve rate overall
Edit test file % of only 74%. Needs more edit test file attempts to find a pass-to-pass.
60% of issues end with score=1, only 13% of which pass. Increase the code limits to try and find a solution.
Increase the token limit to improve the solver chances.
* 10885877926 #201
3/1 in test, 3/3 in code, 32k context tokens
Sonnet
======
8k token limit
https://github.com/getappmap/navie-benchmark/issues/33
https://github.com/getappmap/navie-benchmark/issues/32
Token limit #139 vs #136
============
16k token limit vs 8k
Solve rate of 25.3% vs 21.6%
NOTE: verified_10_pct 1,1,3,3 sonnet #199 not actually sonnet, was gpt-4o. Re-run as https://github.com/getappmap/navie-benchmark/actions/runs/10885877926
* Patch generation progress
Test patches
============
10863347641 - 32 patches on 218 examples
10863770934 - 22 patches on 186 examples
Code patches
============
10863315253 - 14 patches on 160 examples
TODO: Are we only looking for code solutions when we have a patch solution?
*** We should be adding newly solved examples to the code instances
* Summarize the root cause of test errors before feeding them back in
* Sonnet - truncated output
solve/django__django-14034/navie/generate-code/attempt-1/code-1/generate/generate.md
/Users/kgilpin/Downloads/solve-1 (3)/django__django-11211/navie/generate-code/attempt-1/code-1/generate/generate.md
<change>
<file change-number-for-this-file="1">django/db/models/fields/related_lookups.py</file>
<original line-count="14" no-ellipsis="true"><![CDATA[
class RelatedLookupMixin:
def get_prep_lookup(self):
if not isinstance(self.lhs, MultiColSource) and self.rhs_is_direct_value():
* If there is little or no AppMap data, supplement with Context
* astropy__astropy-14365 missing
https://github.com/getappmap/navie-benchmark/actions/runs/10819249064
- [ ] Limit the size of AppMap derived context
* Needs to be a limit on the amount of context selected for the edit test file
Select edit test file with the smallest AppMap data?
* Custom context should be "system" prompt rather than user code selection, at least to match the behavior of the frontend.
* Verify that this test passes if the patch is reverted:
solve/django__django-13658/navie/generate-code/attempt-1/run-test/test-patch/code_0.patch
[workflow] (django__django-13658) Adapting test file: tests/admin_scripts/management/commands/base_command.py
[test/lint-repair] (django__django-13658) Making attempt 1 to generate code that lints cleanly
[generate-test] (django__django-13658) Generated test file: tests/admin_scripts/management/commands/management_utility_command_parser_fix_test.py
[test/lint-repair] (django__django-13658) Code lints cleanly
[generate-test] (django__django-13658) Test patch generated after 1 attempts.
HEAD is now at 48212eb Baseline commit
[workflow] (django__django-13658) Cleaned git state
[run-test] (django__django-13658) Running tests tests/admin_scripts/management/commands/management_utility_command_parser_fix_test.py
in /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/generate-test/attempt-1_from-base_command.py/test-2/run-test/test-patch
[execute-container] (django__django-13658)
Saved output to log file: /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/generate-test/attempt-1_from-base_command.py/test-2/run-test/test-patch/run_test.log
[run-test] (django__django-13658) Test tests/admin_scripts/management/commands/management_utility_command_parser_fix_test.py completed with status TestStatus.PASSED.
[workflow] (django__django-13658) Test passed. Accepting test.
[workflow] (django__django-13658) Inverting test
HEAD is now at 48212eb Baseline commit
[workflow] (django__django-13658) Cleaned git state
[invert-test/lint-repair] (django__django-13658) Making attempt 1 to generate code that lints cleanly
[generate-test] (django__django-13658) Generated test file: tests/admin_scripts/management/commands/management_utility_command_parser_fix_test_inverted.py
[invert-test/lint-repair] (django__django-13658) Code lints cleanly
HEAD is now at 48212eb Baseline commit
[workflow] (django__django-13658) Cleaned git state
[invert-test] (django__django-13658) Test patch inverted after 1 attempts.
[run-test] (django__django-13658) Running tests tests/admin_scripts/management/commands/management_utility_command_parser_fix_test_inverted.py in /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/generate-test/attempt-1_from-base_command.py/test-2/invert
[execute-container] (django__django-13658) Saved output to log file: /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/generate-test/attempt-1_from-base_command.py/test-2/invert/run_test.log
[run-test] (django__django-13658) Test tests/admin_scripts/management/commands/management_utility_command_parser_fix_test_inverted.py completed with status TestStatus.FAILED.
[workflow] (django__django-13658) Inverted test failed with the expected marker error. Accepting test.
[workflow] (django__django-13658) Optimal test patch generated for tests/admin_scripts/management/commands/base_command.py
[workflow] (django__django-13658) Patch file generated to /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/test.patch
[workflow] (django__django-13658) Patch file generated to /Users/kgilpin/source/appland/navie-benchmark/solve/django__django-13658/navie/test-inverted.patch
* Repeatedly emitting the following:
+class ManagementUtilityCommandParserFixTest(SimpleTestCase):
+ def test_prog_name_from_argv(self):
+ original_argv = sys.argv
+ try:
+ # Simulate an environment where sys.argv[0] is None
+ sys.argv = [None, 'test_command']
+ with self.assertRaises(TypeError):
+ execute_from_command_line(['manage.py', 'test_command'])
* Feedback test errors into code generation
TODO: Utilize the same logic with test generation?
* Patch file matches, but no solution obtained
https://docs.google.com/spreadsheets/d/1lCT67I8WK64dQjzcIo5KQ3gf9Z_Q6vj5LS8Wga7tjk4/edit?gid=539801750#gid=539801750
* Ignore user-provided Python versions
```sh
python3.8 -m pip install --user --upgrade 'git+git://github.com/sphinx-doc/[email protected]#egg=sphinx'
```
## Python environment
Do not use Python features that are not available in this Python version.
Python 3.11.5
* Do not take the user description overly literally ---
+ self.locale_dir = os.path.join(self.repo_dir, "locale", "da", "LC_MESSAGES")
+ self.build_dir = os.path.join(self.repo_dir, "_build", "html")
+ self.index_html = os.path.join(self.build_dir, "index.html")
+
+ # Clone the repository
+ subprocess.run(["git", "clone", self.repo_url, self.repo_dir], check=True)
+ subprocess.run(["git", "checkout", self.commit_hash], cwd=self.repo_dir, check=True)
+
+ # Create a virtual environment and install Sphinx
+ subprocess.run(["python3", "-m", "venv", "env"], cwd=self.repo_dir, check=True)
+ subprocess.run([os.path.join(self.repo_dir, "env", "bin", "pip"), "install", "sphinx"], check=True)
* Require a code file to be present in the trace of the test case.
* Feed test errors back into the solver.
[run-test] (sympy__sympy-17318) Interpreting test output from log file: /Users/kgilpin/source/appland/navie-benchmark/solve/sympy__sympy-17318/navie/generate-test/attempt-1_from-test_sqrtdenest.py/test-2/run-test/test-patch/run_test.log
[run-test] (sympy__sympy-17318) Interpreting test output from log file: /Users/kgilpin/source/appland/navie-benchmark/solve/sympy__sympy-17318/navie/generate-test/attempt-3_from-test_radsimp.py/test-1/run-test/test-patch/run_test.log
Ignore failures due to deprecation?
sympy__sympy-17318 - SOLVED - IN https://docs.google.com/spreadsheets/d/1gwF8VrAeWqsq6yH91Y5zvCGPIhYKA9M5FQG3KmwYYDk/edit?gid=1702640221#gid=1702640221
Not in https://docs.google.com/spreadsheets/d/1lCT67I8WK64dQjzcIo5KQ3gf9Z_Q6vj5LS8Wga7tjk4/edit?gid=14622705#gid=14622705
* Premature acceptance of "best" code patch
[generate-and-validate-code] (sympy__sympy-15599) Code patch succeeded the pass-to-pass test, and there are no test patches to try. Accepting code patch.
[workflow] (sympy__sympy-15599) Optimal code patch generated (for available tests)
* Instances with empty patches: 5
https://github.com/getappmap/navie-benchmark/actions/runs/10761576039/job/29840927112#step:7:5439
* Extend the token limit by the size of the observed errors
* Generate multiple sets of pass to fail and fail to pass examples?
* Score the patch based on passing all the pass_to_pass tests?
* Generates the same bad patch on both run_test
/Users/kgilpin/Downloads/solve-0/scikit-learn__scikit-learn-13328/navie/generate-code/2/run_test/6db4f467feabd60a471aa9f5d3169f28326cd0bd40bbc7a8f8b20a84adbf91a2/run_test.log
Try a new test to patch
* Analyze performance vis a vis parameters:
Edited file limit
Lint retries
Test retries
Code retries
* Increase context size on retries? Or overall? Or make configurable?
* Test is getting "SKIPPED"
https://github.com/getappmap/navie-benchmark/actions/runs/10753160219/job/29822284291#step:7:397
[generate-code] (django__django-14559) Code patch generated after 1 attempts.
[generate-and-validate-code] (django__django-14559) Running pass-to-pass test for attempt 2
[workflow] (django__django-14559) Running test
[run-test] (django__django-14559) Running tests tests/postgres_tests/test_bulk_update.py in /home/runner/work/navie-benchmark/navie-benchmark/solve/django__django-14559/navie/generate-code/attempt-2/run-test/pass-to-pass
[run-test] (django__django-14559) Creating run-test container for django__django-14559...
[run-test] (django__django-14559) Test run includes 1 code patches.
[run-test] (django__django-14559) Test tests/postgres_tests/test_bulk_update.py completed with status TestStatus.SKIPPED.
* "Code patch is not optimal" can be emitted when there is no successful edit test file
[generate-and-validate-code] (django__django-13658) Code patch is not optimal. Will look for a better patch.
* ModelChoiceField doesn't select the expected file (django/forms/models)
Context: ModelChoiceField Value ValidationError invalid choice Django
Instructions: Summarize the issue and design a solution involving at most one file modification.
---
Terms: +ModelChoiceField ValidationError invalid invalid_choice value template Django
940ms [vectorTerms] +ModelChoiceField ValidationError invalid invalid_choice value template Django
Explain received context request: search
[collectContext] keywords: model choice modelchoice field choicefield validation error validationerror invalid invalid choice invalidchoice value template django
* Just report presence of test frameworks (pytest, unittest) to the solver?
* Split workflow.run into solve_test and solve_code
* Malformed patch
<change>
<file change-number-for-this-file="3">/home/runner/work/navie-benchmark/navie-benchmark/solve/astropy__astropy-14365/source/astropy/io/ascii/qdp.py</file>
<original line-count="14" no-ellipsis="true"><![CDATA[
if datatype.startswith("data"):
# The first time I find data, I define err_specs
if err_specs == {} and command_lines != "":
for cline in command_lines.strip().split("\n"):
command = cline.strip().split()
# This should never happen, but just in case.
if len(command) < 3:
continue
err_specs[command[1].lower()] = [int(c) for c in command[2:]]
if colnames is None:
colnames = _interpret_err_lines(err_specs, ncol, names=input_colnames)
if current_rows is None:
current_rows = []
]]></original>
<modified line-count="14" no-ellipsis="true"><![CDATA[
if datatype.startswith("data"):
# The first time I find data, I define err_specs
if err_specs == {} and command_lines != "":
for cline in command_lines.strip().split("\n"):
command = cline.strip().split()
# This should never happen, but just in case.
if len(command) < 3:
continue
err_specs[command[1].lower()] = [int(c) for c in command[2:]]
if colnames is None:
colnames = _interpret_err_lines(err_specs, ncol, names=input_colnames)
if current_rows is None:
current_rows = []
]]></modified>
</change>
* Found 2 changes, but the limit is 1
https://github.com/getappmap/navie-benchmark/actions/runs/10655942503/job/29534052664
[workflow/generate-code] (pytest-dev__pytest-7490) Found 2 changes, but the limit is 1
[workflow/generate-code] (pytest-dev__pytest-7490) Applied code changes to src/_pytest/skipping.py, src/_pytest/nodes.py
[code/lint-repair] (pytest-dev__pytest-7490) Code has lint errors: src/_pytest/nodes.py:287:25: F821 undefined name 'xfailed_key'
* astropy example fails due to a too-advanced version of numpy
https://stackoverflow.com/questions/74946845/attributeerror-module-numpy-has-no-attribute-int
sweb.eval.x86_64.astropy__astropy-8707
astropy/table/_np_utils.pyx:15: in init astropy.table._np_utils
DTYPE = np.int
/opt/miniconda3/envs/testbed/lib/python3.9/site-packages/numpy/__init__.py:319: in __getattr__
raise AttributeError(__former_attrs__[attr])
E AttributeError: module 'numpy' has no attribute 'int'.
E `np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
[run-test] (astropy__astropy-8707) Interpreting test output from log file: /Users/kgilpin/source/appland/navie-benchmark/solve/astropy__astropy-8707/navie/generate-test/attempt-5_from-test_division.py/test-1/run-test/test-patch/run_test.log
Also this other depercation warning
self = <astropy.io.fits.tests.test_header_fromstring_bytes.TestHeaderFromStringBytes object at 0x400b83be50>
request = <SubRequest '_xunit_setup_method_fixture_TestHeaderFromStringBytes' for <Function test_card_fromstring_str>>
@fixtures.fixture(
autouse=True,
scope="function",
# Use a unique name to speed up lookup.
name=f"_xunit_setup_method_fixture_{self.obj.__qualname__}",
)
def xunit_setup_method_fixture(self, request) -> Generator[None, None, None]:
method = request.function
if setup_method is not None:
func = getattr(self, setup_name)
_call_with_optional_argument(func, method)
if emit_nose_setup_warning:
> warnings.warn(
NOSE_SUPPORT_METHOD.format(
nodeid=request.node.nodeid, method="setup"
),
stacklevel=2,
)
E pytest.PytestRemovedIn8Warning: Support for nose tests is deprecated and will be removed in a future release.
E astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_card_fromstring_str is using nose-specific method: `setup(self)`
E To remove this warning, rename it to `setup_method(self)`
E See docs: https://docs.pytest.org/en/stable/deprecations.html#support-for-tests-written-for-nose
/opt/miniconda3/envs/testbed/lib/python3.9/site-packages/_pytest/python.py:898: PytestRemovedIn8Warning
=========================== short test summary info ============================
ERROR astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_header_fromstring_bytes
ERROR astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_header_fromstring_str
ERROR astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_card_fromstring_bytes
ERROR astropy/io/fits/tests/test_header_fromstring_bytes.py::TestHeaderFromStringBytes::test_card_fromstring_str
============================== 4 errors in 7.49s ===============================