Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16501 build: Add libsanitize #15105

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

knard38
Copy link
Contributor

@knard38 knard38 commented Sep 9, 2024

Description

Add scons build option SANITIZERS allowing to use the libasan. This new option takes a list of sanitizer tool such as: AddressSanitizer (i.e. -fsanitize=address), ThreadSanitizer(i.e. -fsanitize=thread), LeakSanitizer (i.e. -fsanitize=leak), etc.

A list of the available santizer tools and their compatibility could be found in the gcc man page.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

TODO

Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
Copy link

github-actions bot commented Sep 9, 2024

Ticket title is 'LRZ: m02r01s10dao coredump - invalid free'
Status is 'Resolved'
Labels: 'lrz'
https://daosio.atlassian.net/browse/DAOS-16501

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

utils/rpms/daos.spec Outdated Show resolved Hide resolved
Miscelleneaous fixe\s.

Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

Fixing invalid compilation errors reported by gcc on el9.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@knard38
Copy link
Contributor Author

knard38 commented Jan 7, 2025

TBH, I have not done some performance test as my initial aim was not to use it in production but for helping to identify thread doing some memory corruptions (such as buffer overflow).

yes for sure this is just for dev, but when i tried enabling sanitizer on the server side a while ago, the server just hangs. maybe i did something wrong and it works for you. does this work also for the dependencies? like does it also report things for mercury for example?

I was not planning to manage the direct DAOS dependencies such as mercury.
I have planned look at it in a second time... if I am able to obtain an acceptable result with this PR 🤞

@liw
Copy link
Contributor

liw commented Jan 8, 2025

@mchaarawi, ASan requires an LD_LIBRARY_PATH for engine modules and a larger default Argobots stack size in your engine YAML file(s). Without them, engines either fail to start because they can't find librdb.so, or overrun stacks during pool creation. The performance has been amazing compared to Valgrind---I barely notice a difference when running daos_test on Wolf.

Mercury already has the support; @soumagne must have already fixed the issues that are easily found. Same for Argobots.

I've used ASan with DAOS plus daos_test in a rush before 2.6.1 and managed to fix the following product issues.

And the following test issues.

In that process, @knard38 and I discovered each other's work on ASan. We planned to land the support (this PR), present an introduction, and see if this is robust enough for certain regular CI runs. Beyond that, our future plan is to look into the memory leak reports (ignored so far), TSan, etc.

@mchaarawi
Copy link
Contributor

@mchaarawi, ASan requires an LD_LIBRARY_PATH for engine modules and a larger default Argobots stack size in your engine YAML file(s). Without them, engines either fail to start because they can't find librdb.so, or overrun stacks during pool creation. The performance has been amazing compared to Valgrind---I barely notice a difference when running daos_test on Wolf.

Mercury already has the support; @soumagne must have already fixed the issues that are easily found. Same for Argobots.

I've used ASan with DAOS plus daos_test in a rush before 2.6.1 and managed to fix the following product issues.

And the following test issues.

In that process, @knard38 and I discovered each other's work on ASan. We planned to land the support (this PR), present an introduction, and see if this is robust enough for certain regular CI runs. Beyond that, our future plan is to look into the memory leak reports (ignored so far), TSan, etc.

thanks for the detailed reply!
you obviously spent more time getting it to work than i did :-)

@liw
Copy link
Contributor

liw commented Jan 8, 2025

thanks for the detailed reply! you obviously spent more time getting it to work than i did :-)

You're very welcome to work with Cedric and I, if you like. We still don't know whether ASan will turn out to be robust enough or not, because I believe Argobots should cause some false positives, but I've seen zero, which puzzles me.

Fixing invalid compilation errors reported by gcc on el9.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
Fix libasan version for ubuntu image used with GHA
Fix invalid copyright

Signed-off-by: Cedric Koch-Hofer <[email protected]>
Update modification de

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

@daosbuild1

This comment was marked as outdated.

Add documentation on ASan lib usaged and limitations.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@knard38
Copy link
Contributor Author

knard38 commented Jan 20, 2025

I tried with SANITIZERS=address locally and got this
Your application is linked against incompatible ASan runtimes.

Could you tell me please with which linux distribution you get this error. Could it also be possible to have the build log and the version of asan used. Thanks in advance.

I had the same kind of issue with the CI stages building Debian RPMs.
It should have been fixed now.

@knard38
Copy link
Contributor Author

knard38 commented Jan 20, 2025

Trying with clang instead, I got this

src/object/srv_obj.c:4457:1: error: stack frame size (9112) exceeds limit (8192) in 'ds_cpd_handle_one' '[-Werror,-Wframe-larger-than] ds_cpd_handle_one(crt_rpc_t *rpc, struct daos_cpd_sub_head *dcsh, struct daos_cpd_disp_ent *dcde,

I was able to reproduce this issue and I have fixed it with latest commits.

@daosbuild1
Copy link
Collaborator

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15105/58/execution/node/632/log

@knard38 knard38 marked this pull request as ready for review January 21, 2025 19:15
@knard38 knard38 requested a review from a team as a code owner January 21, 2025 19:15
@knard38
Copy link
Contributor Author

knard38 commented Jan 21, 2025

The last failing Jenkins build was done with using the new Jenkins configuration parameters CI_SCONS_ARGS to build DAOS with the ASan lib.
image
As expected the unit tests are not passing as some memory leaks are detected.
As indicated in the documentation of this PR, fixing non regression test will be done in follow-up PRs as they will need some significant works.

Without setting the Jenkins configuration parameters CI_SCONS_ARGS, the build was fully passed https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15105/57/

@knard38
Copy link
Contributor Author

knard38 commented Jan 23, 2025

@jolivier23 , @mchaarawi , @liw and @grom72 could you tell me if the PR is OK with the current known limitations or if I need to work more on it to make it acceptable for landing.
More details on the current limitations could be found at docs/dev/development.md

@grom72
Copy link
Contributor

grom72 commented Jan 23, 2025

@jolivier23 , @mchaarawi , @liw and @grom72 could you tell me if the PR is OK with the current known limitations or if I need to work more on it to make it acceptable for landing. More details on the current limitations could be found at docs/dev/development.md

I see one issue that we should resolve before we land this great improvement.
It needs to be incorporated into regular CI execution, otherwise it will be a dead piece of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants