Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

env_process: Refactor huge pages setup/cleanup steps #4054

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bgartzi
Copy link
Contributor

@bgartzi bgartzi commented Jan 17, 2025

Creating a new Setuper subclass for setting and cleaning huge pages up. Removing the original code from virttest.env_process and replacing it instead with the new HugePagesSetup class being registered in the setup_manager.

_pre_hugepages_surp and _post_hugepages_surp were left in env_process. Their goal is to provide a mechanism in env_process to raise a TestFail in case pages were leaked during a test. If that mechanism was refactored into the setuper, the TestFail would be masked by just an Error due to the way setup_manager handles postprocess exceptions. Changing the way SetupManager handles that requires bigger discussion on how the test infrastructure should handle test status reports, which is a way broader topic that what this patch aims to be.

This is a patch from a larger patch series refactoring the env_process preprocess and postprocess functions. In each of these patches, a pre/post process step is identified and replaced with a Setuper subclass so the following can finally be met:
- Only cleanup steps of successful setup steps are run to avoid possible environment corruption or hard to read errors.
- Running setup/cleanup steps symmetrically during env pre/post process.
- Reduce explicit pre/post process function code length.

Creating a new Setuper subclass for setting and cleaning huge pages up.
Removing the original code from virttest.env_process and replacing it
instead with the new HugePagesSetup class being registered in the
setup_manager.

_pre_hugepages_surp and _post_hugepages_surp were left in env_process.
Their goal is to provide a mechanism in env_process to raise a TestFail
in case pages were leaked during a test. If that mechanism was
refactored into the setuper, the TestFail would be masked by just an
Error due to the way setup_manager handles postprocess exceptions.
Changing the way SetupManager handles that requires bigger discussion on
how the test infrastructure should handle test status reports, which is
a way broader topic that what this patch aims to be.

This is a patch from a larger patch series refactoring the env_process
preprocess and postprocess functions. In each of these patches, a
pre/post process step is identified and replaced with a Setuper subclass
so the following can finally be met:
    - Only cleanup steps of successful setup steps are run to avoid
      possible environment corruption or hard to read errors.
    - Running setup/cleanup steps symmetrically during env pre/post
      process.
    - Reduce explicit pre/post process function code length.

Signed-off-by: Beñat Gartzia Arruabarrena <[email protected]>
@YongxueHong
Copy link
Contributor

Hi @PaulYuuu
Could you help to review it? Thanks.

self.params["setup_hugepages"] = "yes"
if self.params.get("setup_hugepages") == "yes":
h = test_setup.HugePageConfig(self.params)
env_process._pre_hugepages_surp = h.ext_hugepages_surp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_pre_hugepages_surp and _post_hugepages_surp are for hugepage leak check, with this Setuper, I would suggest dropping them. by returning a variable after do_cleanup. so leak_num = _post_hugepages_surp - _pre_hugepages_surp can short to leak_num = <new_var>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, you are proposing to make the cleanup method of this Setuper return the leak_num value?
That would also involve updating the SetupManager behavior to meet the demands of HugePagesSetup. If that's the case, we would permit every Setuper return a value, which goes against the current implementation. We would then have to update the SetupManager do_cleanup logic to handle that.

In my opinion, if we were to do that, we would have to think of a protocol of some sort to make this implementable by each Setuper instead of adding specific Setuper logic into the rather general SetupManager. Could something like adding a post_cleanup_check function into the core Setuper and calling it after the cleanup method has been called from SetupManager.do_cleanup be the answer to that issue?

I also thought on other approaches, as implementing a core Singleton abstraction, so we would be able to reach Setuper instances instead of classes from within env_process so we could call extra functions on demand after the cleanup would have terminated. However, this approach sounds too complex for a workaround, and it could introduce further issues, as Setuper instances "surviving" from one test case run to the next one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, you are proposing to make the cleanup method of this Setuper return the leak_num value?
No, the workaround is the Setuper will calculate _post_hugepages_surp - _pre_hugepages_surp and set env_process._hugepage_leaks(take this name as example).

The complex solution you mentioned is that Setuper can return something. I agree we can do this, but not now, the implementation can closely combine env_process and Setuper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bgartzi @PaulYuuu
First, I am confused about checking the huge page leak, whenever it will result in the leak error during the post process if the user does not deallocate the huge page by setting hugepages_deallocate = no.
Because it just deallocates the huge page memory with self.deallocate is True.

def cleanup(self):
if self.deallocate:
error_context.context("trying to deallocate hugepage memory")

So I think that there is a potential logical issue about the leak checking with self.deallocate being False, sum it up, the checking leak needs to work with the deallocation of the huge page otherwise it will raise the huge pages leaked! error all the time.
Hi, @PaulYuuu Please correct me if I misunderstood the implementation of the huge page configuration and the related post_process. Thanks.

Second, Is it possible that we integrate the checking leak part into the cleanup() instead of calling it independently in the post process? From my understanding of the code of the post process, we can see that each test case will check the huge page whether it is leaked, if the test case needs the huge page. Hi @PaulYuuu Could you share the reason for calling the checking leak here instead of the cleanup()

My reason is that we could use a unified way to raise all errors by the following:

if err:
raise RuntimeError("Failures occurred while postprocess:\n%s" % err)

The err contains all the runtime errors during the postprocess. And I think the checking huge pages should be covered by the err
elif _post_hugepages_surp > _pre_hugepages_surp:
leak_num = _post_hugepages_surp - _pre_hugepages_surp
raise exceptions.TestFail("%d huge pages leaked!" % leak_num)

As a result, we could let the user know the error occurred in the post process, rather than it is a exceptions.TestFail.
BTW, we hope that it could raise the related env_process error during the pre-process and post-process. The test case raises the exceptions.TestFail

So I suggest that we could refactor the previous implementation by integrating the checking leak part into the cleanup. And @bgartzi you could refactor the HugePagesSetup in the normal way.
Please let me know your opinions. @PaulYuuu @bgartzi
Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, IIRC, if hugepages_deallocate is not set, we will never hit the error, HugePages_Surp only changes when leaked or using THP, which is unrelated to setup/teardown. And yes we should move self.ext_hugepages_surp = utils_memory.get_num_huge_pages_surp() out from the if self.deallocate in cleanup.

Without the Setuper, cannot handle huge page leak checks in the cleanup in the current pre/post process context, pre and post will init a separate HugePageConfig class, so we cannot directly check ext_hugepages_surp cross in 2 HugePageConfig.

For now, as we have each setuper, then check leak is possible, and we can do it, even in this PR or an individual one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. So this patch is acceptable to me :)

Copy link
Contributor

@YongxueHong YongxueHong Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bgartzi
Yeah, you are right. I realized my fault and then I delete the comment immediately ^_^.
Sorry for inconvenience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the deletion 😅. Also removing mine. BTW, do you think I should find a better name for the setuper class to avoid these kind of confusions in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bgartzi
Yeah, indeed, it may bring some confusions as the similar class name, but I could not come up with a better name for it now, if you think out a good one, I am really appreciate it.
BTW, the current name looks good for me too.
Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this patch looks good to me as well :)
Ps: I will trigger some hugepages tests later

Implements a specific exception to be raised when huge pages are leaked.

Signed-off-by: Beñat Gartzia Arruabarrena <[email protected]>
bgartzi added a commit to bgartzi/avocado-vt that referenced this pull request Feb 20, 2025
Postprocess checked whether hugepages were leaked after a related test
or not and raised a TestFail right after that.

During the refactoring of the logic in charge of setting up and cleaning
up huge pages for the test environment, a discussion was raised on
whether this should be tolerated or not (see [0]). The discussion
concluded that this should be avoided, even if the behavior of related
test cases in the event of huge pages leaks would change.

This commit removes the check from postprocess and adds it to the right
Setuper intended to run the huge page setup/cleanup steps. As the
TestFail would be masked by the RuntimeError that the SetupManager
raises in the end, this commit also raises a HugePagesError instead of
the former TestFail in favor of clarity.

[0] avocado-framework#4054

Signed-off-by: Beñat Gartzia Arruabarrena <[email protected]>
bgartzi added a commit to bgartzi/avocado-vt that referenced this pull request Feb 20, 2025
Postprocess checked whether hugepages were leaked after a related test
or not and raised a TestFail right after that.

During the refactoring of the logic in charge of setting up and cleaning
up huge pages for the test environment, a discussion was raised on
whether this should be tolerated or not (see [0]). The discussion
concluded that this should be avoided, even if the behavior of related
test cases in the event of huge pages leaks would change.

This commit removes the check from postprocess and adds it to the right
Setuper intended to run the huge page setup/cleanup steps. As the
TestFail would be masked by the RuntimeError that the SetupManager
raises in the end, this commit also raises a HugePagesError instead of
the former TestFail in favor of clarity.

[0] avocado-framework#4054

Signed-off-by: Beñat Gartzia Arruabarrena <[email protected]>
Postprocess checked whether hugepages were leaked after a related test
or not and raised a TestFail right after that.

During the refactoring of the logic in charge of setting up and cleaning
up huge pages for the test environment, a discussion was raised on
whether this should be tolerated or not (see [0]). The discussion
concluded that this should be avoided, even if the behavior of related
test cases in the event of huge pages leaks would change.

This commit removes the check from postprocess and adds it to the right
Setuper intended to run the huge page setup/cleanup steps. As the
TestFail would be masked by the RuntimeError that the SetupManager
raises in the end, this commit also raises a HugePagesError instead of
the former TestFail in favor of clarity, although it will be masked by
SetupManager's RuntimeError anyway.

[0] avocado-framework#4054

Signed-off-by: Beñat Gartzia Arruabarrena <[email protected]>
@mcasquer
Copy link
Contributor

Memory huge_page test runs passed with RHEL 10 and Win2025 guests

 (01/18) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.10.0.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: STARTED
 (01/18) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.10.0.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (955.79 s)
 (02/18) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.10.0.x86_64.io-github-autotest-qemu.system_reset_bootable.q35: STARTED
 (02/18) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.10.0.x86_64.io-github-autotest-qemu.system_reset_bootable.q35: PASS (385.09 s)
...
 (18/18) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.10.0.x86_64.io-github-autotest-qemu.hugepage_mem_stress.non_existent_mem_path.q35: STARTED
 (18/18) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.RHEL.10.0.x86_64.io-github-autotest-qemu.hugepage_mem_stress.non_existent_mem_path.q35:PASS (135.92 s)
RESULTS    : PASS 15 | ERROR 0 | FAIL 0 | SKIP 3 | WARN 0 | INTERRUPT 0 | CANCEL 0
 (01/18) Host_RHEL.m9.u6.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win2025.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: STARTED
 (01/18) Host_RHEL.m9.u6.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win2025.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (5306.26 s)
 (02/18) Host_RHEL.m9.u6.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win2025.x86_64.io-github-autotest-qemu.disable_win_update.q35: STARTED
 (02/18) Host_RHEL.m9.u6.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win2025.x86_64.io-github-autotest-qemu.disable_win_update.q35: PASS (499.09 s)
...
 (18/18) Host_RHEL.m9.u6.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win2025.x86_64.io-github-autotest-qemu.migrate.tcp.with_filter_off.with_post_copy.q35: STARTED
 (18/18) Host_RHEL.m9.u6.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win2025.x86_64.io-github-autotest-qemu.migrate.tcp.with_filter_off.with_post_copy.q35: PASS (274.00 s)
RESULTS    : PASS 15 | ERROR 0 | FAIL 0 | SKIP 3 | WARN 0 | INTERRUPT 0 | CANCEL 0

Copy link
Contributor

@mcasquer mcasquer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants