Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix config_db.json backup in test_counterpoll_watermark.py #16002

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

arista-nwolfe
Copy link
Contributor

Details of failure described in more detail in #15991

Created a new fixture backup_and_restore_config_db_frontend which uses enum_rand_one_per_hwsku_frontend_hostname so test_counterpoll_watermark.py can run it.

Updated _backup_and_restore_config_db to work on multi-asic LCs

Summary:
Fixes #15991

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Also fix '_backup_and_restore_config_db' to work on multi-asic lcs
@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/common/fixtures/duthost_utils.py:51:23: E711 comparison to None should be 'if cond is None:'
tests/common/fixtures/duthost_utils.py:69:23: E711 comparison to None should be 'if cond is None:'

flake8...............................................(no files to check)Skipped
check conditional mark sort..........................(no files to check)Skipped

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

@yxieca
Copy link
Collaborator

yxieca commented Dec 11, 2024

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.

@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

@arista-nwolfe
Copy link
Contributor Author

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.

@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Note that this config reload due to config-mismatch is already happening today before this change.
It's the conftest.core_dump_and_config_check which does that.
The config reload will correct the current running configuration but it won't fix the config_db.json files, that's all this review is fixing.
The copying and restoring of the config_db.json files should be a very quick operation.

I agree we could improve this test to correct/revert it's config within the test but I don't think that needs to block this review as it shouldn't change the duration of the test. It'll just correct the saved config so future tests don't fail due to this.

@yutongzhang-microsoft
Copy link
Contributor

yutongzhang-microsoft commented Dec 12, 2024

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.

@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in /etc/sonic/running_golden_config.json before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?

And I'm also wondering, our fixture core_dump_and_config_check will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to config_db.json. So theoretically, we don't need manually restore config_db.json...

@arista-nwolfe
Copy link
Contributor Author

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in /etc/sonic/running_golden_config.json before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?

And I'm also wondering, our fixture core_dump_and_config_check will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to config_db.json. So theoretically, we don't need manually restore config_db.json...

It looks like it core_dump_and_config_check only calls __dut_reload to restore the config_db.json on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677

This came from a fairly recent PR: #14368
Tagging @yejianquan in case this was an unintended change:

            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)

recover_chassis doesn't save to config_db.json but __dut_reload does.

@yejianquan
Copy link
Collaborator

yejianquan commented Dec 13, 2024

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ...core_dump_and_config_check``config_db.json``config_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677`core_dump_and_config_check``__dut_reload``config_db.json`

This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:

            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)

recover_chassis doesn't save to but does.config_db.json``__dut_reload

@arista-nwolfe It's expected behavior.
To have a thorough recovery, we config reload RP and all LC together and it have being working well.
And yes there is mismatch logic between __dut_reload and recover_chassis.
But I suppose it shouldn't make any difference?

The logic of core_dump_and_config_check is

  1. At the start, use current config to
    1.a: Generate golden config
    1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]
  2. After the test, if it's not chassis, it
    2.a: Generate config files with memory data
    2.b: Config reload with the new generated(2.a) config.
  3. After the test, if it's not chassis, it:
    3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.

@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a.
I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

@yejianquan
Copy link
Collaborator

yejianquan commented Dec 13, 2024

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ... core_dump_and_config_checkconfig_db.jsonconfig_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677core_dump_and_config_check``__dut_reload``config_db.json
This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:

            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)

recover_chassis doesn't save to but does. config_db.json__dut_reload ``

@arista-nwolfe It's expected behavior. To have a thorough recovery, we config reload RP and all LC together and it have being working well. And yes there is mismatch logic between __dut_reload and recover_chassis. But I suppose it shouldn't make any difference?

The logic of core_dump_and_config_check is

  1. At the start, use current config to
    1.a: Generate golden config
    1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]
  2. After the test, if it's not chassis, it
    2.a: Generate golden config with memory data
    2.b: Config reload with the new generated(2.a) golden config.
  3. After the test, if it's not chassis, it:
    3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.

@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a. I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

But I think yes it's a gap that the chassis_recover should add the logic that save back the config_db*.json
@arista-nwolfe could you enhance the logic here to restore the both running config and config file instead only restore running config ?

recover_chassis(duthosts)

So that we don't need this PR's fixture to store and restore config files?
Given the functionality is covered by core_dump_and_config_check alraeady

@yutongzhang-microsoft
Copy link
Contributor

yutongzhang-microsoft commented Dec 13, 2024

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ... core_dump_and_config_checkconfig_db.jsonconfig_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677core_dump_and_config_check``__dut_reload``config_db.json
This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:

            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)

recover_chassis doesn't save to but does. config_db.json__dut_reload ``

@arista-nwolfe It's expected behavior. To have a thorough recovery, we config reload RP and all LC together and it have being working well. And yes there is mismatch logic between __dut_reload and recover_chassis. But I suppose it shouldn't make any difference?

The logic of core_dump_and_config_check is

  1. At the start, use current config to
    1.a: Generate golden config
    1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]
  2. After the test, if it's not chassis, it
    2.a: Generate golden config with memory data
    2.b: Config reload with the new generated(2.a) golden config.
  3. After the test, if it's not chassis, it:
    3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.

@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a. I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

Hi, @yejianquan For no chassis, we will copy it back to config_db.json https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364. Maybe there is a typo of step 2 or 3?

@yejianquan
Copy link
Collaborator

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ... core_dump_and_config_checkconfig_db.jsonconfig_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677 core_dump_and_config_check__dut_reloadconfig_db.json
This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:

            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)

recover_chassis doesn't save to but does. config_db.json__dut_reload ``

@arista-nwolfe It's expected behavior. To have a thorough recovery, we config reload RP and all LC together and it have being working well. And yes there is mismatch logic between __dut_reload and recover_chassis. But I suppose it shouldn't make any difference?
The logic of core_dump_and_config_check is

  1. At the start, use current config to
    1.a: Generate golden config
    1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]
  2. After the test, if it's not chassis, it
    2.a: Generate golden config with memory data
    2.b: Config reload with the new generated(2.a) golden config.
  3. After the test, if it's not chassis, it:
    3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.
@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a. I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

Hi, @yejianquan For no chassis, we will copy it back to config_db.json https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364. Maybe there is a typo of step 2 or 3?

Thanks, updated

@arista-nwolfe
Copy link
Contributor Author

arista-nwolfe commented Dec 13, 2024

@arista-nwolfe using config db backup/restore to keep original configuration is a very expensive way of keep the configuration consistent. Because when using this method, when there is diff, the system will go through config reload, which is a 3 minutes+ operation on single ASIC. I don't know the operation time on multi-ASIC/multi-DUT. could be much longer.
@yutongzhang-microsoft can you give some architectural guidance here? I feel that test should try best to revert configuration change to original and use the backup/restore infra to check and make sure that the configuration is restored?

Hi, @yxieca , @arista-nwolfe . We have the fixture https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2403 to check the running config before and after the test script running. To achieve this, we have already backup the config in before running. Can we directly use this golden config to cover the config_db.json instead of generating another one?/etc/sonic/running_golden_config.json
And I'm also wondering, our fixture will do config reload if the configurations are modified during testing. As in https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2364, we will copy the golden running config back to . So theoretically, we don't need manually restore ... core_dump_and_config_checkconfig_db.jsonconfig_db.json

It looks like it only calls to restore the on non-modular systems https://github.com/sonic-net/sonic-mgmt/blob/master/tests/conftest.py#L2673-L2677core_dump_and_config_check``__dut_reload``config_db.json
This came from a fairly recent PR: #14368 Tagging @yejianquan in case this was an unintended change:

            is_modular_chassis = duthosts[0].get_facts().get("modular_chassis")
            if is_modular_chassis:
                results = recover_chassis(duthosts)
            else:
                results = parallel_run(__dut_reload, (), {"duts_data": duts_data}, duthosts, timeout=360)

recover_chassis doesn't save to but does. config_db.json__dut_reload ``

@arista-nwolfe It's expected behavior. To have a thorough recovery, we config reload RP and all LC together and it have being working well. And yes there is mismatch logic between __dut_reload and recover_chassis. But I suppose it shouldn't make any difference?

The logic of core_dump_and_config_check is

  1. At the start, use current config to
    1.a: Generate golden config
    1.b: Store in memory json.dumps(duts_data[node.hostname]["pre_running_config"]
  2. After the test, if it's not chassis, it
    2.a: Generate config files with memory data
    2.b: Config reload with the new generated(2.a) config.
  3. After the test, if it's not chassis, it:
    3.a: Config reload with the new generated(1.a) golden config.

@yutongzhang-microsoft please confirm the logic.

@arista-nwolfe The only difference is non-chassis uses the golden config from 2.a, chassis uses the golen config from 1.a. I suppose the test module won't write golden config, so there shouldn't have any difference, isn't it?

@yejianquan the fact that the config_db#.json files are no longer being written to by the saved config from the start of the test is a significant difference and is the reason for issues like #15991. If the config_db#.json files aren't corrected the config that the test modified will be persistent into the next test. In the case of this issue, causing future tests which depend on flex counters to fail due to flex counters being disabled by test_counterpoll_watermark.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
5 participants