Fix config_db.json backup in `test_counterpoll_watermark.py` #16002

arista-nwolfe · 2024-12-11T02:38:21Z

Details of failure described in more detail in #15991

Created a new fixture backup_and_restore_config_db_frontend which uses enum_rand_one_per_hwsku_frontend_hostname so test_counterpoll_watermark.py can run it.

Updated _backup_and_restore_config_db to work on multi-asic LCs

Summary:
Fixes #15991

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Back port request

Also fix '_backup_and_restore_config_db' to work on multi-asic lcs

mssonicbld · 2024-12-11T02:39:36Z

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed fix end of files.........................................................Passed check yaml...........................................(no files to check)Skipped check for added large files..............................................Passed check python ast.........................................................Passed flake8...................................................................Failed - hook id: flake8 - exit code: 1 tests/common/fixtures/duthost_utils.py:51:23: E711 comparison to None should be 'if cond is None:' tests/common/fixtures/duthost_utils.py:69:23: E711 comparison to None should be 'if cond is None:' flake8...............................................(no files to check)Skipped check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
docker container.
Ensure that the pre-commit package is installed:

sudo pip install pre-commit

Go to repository root folder
Install the pre-commit hooks:

pre-commit install

Use pre-commit to check staged file:

pre-commit

Alternatively, you can check committed files using:

pre-commit run --from-ref <commit_id> --to-ref <commit_id>

yxieca · 2024-12-11T16:26:32Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.

@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

arista-nwolfe · 2024-12-11T17:04:03Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.

@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Note that this config reload due to config-mismatch is already happening today before this change.
It's the conftest.core_dump_and_config_check which does that.
The config reload will correct the current running configuration but it won't fix the config_db.json files, that's all this review is fixing.
The copying and restoring of the config_db.json files should be a very quick operation.

I agree we could improve this test to correct/revert it's config within the test but I don't think that needs to block this review as it shouldn't change the duration of the test. It'll just correct the saved config so future tests don't fail due to this.

yutongzhang-microsoft · 2024-12-12T03:11:55Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.

@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in /etc/sonic/running_golden_config.json before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?

And I'm also wondering, our fixture core_dump_and_config_check will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to config_db.json. So theoretically, we don't need manually restore config_db.json...

arista-nwolfe · 2024-12-12T18:39:19Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in /etc/sonic/running_golden_config.json before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?

And I'm also wondering, our fixture core_dump_and_config_check will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to config_db.json. So theoretically, we don't need manually restore config_db.json...

It looks like it core_dump_and_config_check only calls __dut_reload to restore the config_db.json on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677

This came from a fairly recent PR: #14368
Tagging @yejianquan in case this was an unintended change:

            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)

recover_chassis doesn't save to config_db.json but __dut_reload does.

yejianquan · 2024-12-13T01:13:43Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ...core_dump_and_config_check``config_db.json``config_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677`core_dump_and_config_check``__dut_reload``config_db.json`

This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:
            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)
recover_chassis doesn't save to but does.config_db.json``__dut_reload

@arista-nwolfe It's expected behavior.
To have a thorough recovery, we config reload RP and all LC together and it have being working well.
And yes there is mismatch logic between __dut_reload and recover_chassis.
But I suppose it shouldn't make any difference?

The logic of core_dump_and_config_check is

At the start, use current config to
1.a: Generate golden config
1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]
After the test, if it's not chassis, it
2.a: Generate config files with memory data
2.b: Config reload with the new generated(2.a) config.
After the test, if it's not chassis, it:
3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.

@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a.
I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

yejianquan · 2024-12-13T01:17:47Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ... core_dump_and_config_checkconfig_db.jsonconfig_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677core_dump_and_config_check``__dut_reload``config_db.json
This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:
            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)
recover_chassis doesn't save to but does. config_db.json__dut_reload ``
@arista-nwolfe It's expected behavior. To have a thorough recovery, we config reload RP and all LC together and it have being working well. And yes there is mismatch logic between __dut_reload and recover_chassis. But I suppose it shouldn't make any difference?

The logic of core_dump_and_config_check is

At the start, use current config to
1.a: Generate golden config
1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]

After the test, if it's not chassis, it
2.a: Generate golden config with memory data
2.b: Config reload with the new generated(2.a) golden config.

After the test, if it's not chassis, it:
3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.

@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a. I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

But I think yes it's a gap that the chassis_recover should add the logic that save back the config_db*.json
@arista-nwolfe could you enhance the logic here to restore the both running config and config file instead only restore running config ?

sonic-mgmt/tests/conftest.py

Line 2674 in dd14c06

recover_chassis(duthosts)

So that we don't need this PR's fixture to store and restore config files?
Given the functionality is covered by core_dump_and_config_check alraeady

yutongzhang-microsoft · 2024-12-13T01:22:48Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ... core_dump_and_config_checkconfig_db.jsonconfig_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677core_dump_and_config_check``__dut_reload``config_db.json
This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:
            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)
recover_chassis doesn't save to but does. config_db.json__dut_reload ``
@arista-nwolfe It's expected behavior. To have a thorough recovery, we config reload RP and all LC together and it have being working well. And yes there is mismatch logic between __dut_reload and recover_chassis. But I suppose it shouldn't make any difference?

The logic of core_dump_and_config_check is

At the start, use current config to
1.a: Generate golden config
1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]

After the test, if it's not chassis, it
2.a: Generate golden config with memory data
2.b: Config reload with the new generated(2.a) golden config.

After the test, if it's not chassis, it:
3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.

@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a. I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

Hi, @yejianquan For no chassis, we will copy it back to config_db.json https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364. Maybe there is a typo of step 2 or 3?

yejianquan · 2024-12-13T01:26:44Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ... core_dump_and_config_checkconfig_db.jsonconfig_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677 core_dump_and_config_check__dut_reloadconfig_db.json
This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:
            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)
recover_chassis doesn't save to but does. config_db.json__dut_reload ``
@arista-nwolfe It's expected behavior. To have a thorough recovery, we config reload RP and all LC together and it have being working well. And yes there is mismatch logic between __dut_reload and recover_chassis. But I suppose it shouldn't make any difference?
The logic of core_dump_and_config_check is

At the start, use current config to
1.a: Generate golden config
1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]

After the test, if it's not chassis, it
2.a: Generate golden config with memory data
2.b: Config reload with the new generated(2.a) golden config.

After the test, if it's not chassis, it:
3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.
@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a. I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?
Hi, @yejianquan For no chassis, we will copy it back to config_db.json https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364. Maybe there is a typo of step 2 or 3?

Thanks, updated

arista-nwolfe · 2024-12-13T16:14:04Z

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ... core_dump_and_config_checkconfig_db.jsonconfig_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677core_dump_and_config_check``__dut_reload``config_db.json
This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:
            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)
recover_chassis doesn't save to but does. config_db.json__dut_reload ``
@arista-nwolfe It's expected behavior. To have a thorough recovery, we config reload RP and all LC together and it have being working well. And yes there is mismatch logic between __dut_reload and recover_chassis. But I suppose it shouldn't make any difference?

The logic of core_dump_and_config_check is

At the start, use current config to
1.a: Generate golden config
1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]

After the test, if it's not chassis, it
2.a: Generate config files with memory data
2.b: Config reload with the new generated(2.a) config.

After the test, if it's not chassis, it:
3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.

@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a. I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

@yejianquan the fact that the config_db#.json files are no longer being written to by the saved config from the start of the test is a significant difference and is the reason for issues like #15991. If the config_db#.json files aren't corrected the config that the test modified will be persistent into the next test. In the case of this issue, causing future tests which depend on flex counters to fail due to flex counters being disabled by test_counterpoll_watermark.py

Fix config_db backup in test_counterpoll_watermark.py

239af90

Also fix '_backup_and_restore_config_db' to work on multi-asic lcs

arista-nwolfe requested a review from prgeor as a code owner December 11, 2024 02:38

Fix conditional failing linter

fb72287

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix config_db.json backup in `test_counterpoll_watermark.py` #16002

Fix config_db.json backup in `test_counterpoll_watermark.py` #16002

arista-nwolfe commented Dec 11, 2024

mssonicbld commented Dec 11, 2024

yxieca commented Dec 11, 2024

arista-nwolfe commented Dec 11, 2024

yutongzhang-microsoft commented Dec 12, 2024 •

edited

Loading

arista-nwolfe commented Dec 12, 2024

yejianquan commented Dec 13, 2024 •

edited

Loading

yejianquan commented Dec 13, 2024 •

edited

Loading

yutongzhang-microsoft commented Dec 13, 2024 •

edited

Loading

yejianquan commented Dec 13, 2024

arista-nwolfe commented Dec 13, 2024 •

edited

Loading

Fix config_db.json backup in test_counterpoll_watermark.py #16002

Are you sure you want to change the base?

Fix config_db.json backup in test_counterpoll_watermark.py #16002

Conversation

arista-nwolfe commented Dec 11, 2024

Type of change

Back port request

mssonicbld commented Dec 11, 2024

yxieca commented Dec 11, 2024

arista-nwolfe commented Dec 11, 2024

yutongzhang-microsoft commented Dec 12, 2024 • edited Loading

arista-nwolfe commented Dec 12, 2024

yejianquan commented Dec 13, 2024 • edited Loading

yejianquan commented Dec 13, 2024 • edited Loading

yutongzhang-microsoft commented Dec 13, 2024 • edited Loading

yejianquan commented Dec 13, 2024

arista-nwolfe commented Dec 13, 2024 • edited Loading

Fix config_db.json backup in `test_counterpoll_watermark.py` #16002

Fix config_db.json backup in `test_counterpoll_watermark.py` #16002

yutongzhang-microsoft commented Dec 12, 2024 •

edited

Loading

yejianquan commented Dec 13, 2024 •

edited

Loading

yejianquan commented Dec 13, 2024 •

edited

Loading

yutongzhang-microsoft commented Dec 13, 2024 •

edited

Loading

arista-nwolfe commented Dec 13, 2024 •

edited

Loading