-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16916 container: check inflight open #15682
Conversation
Check inflight container open, which might be stucked in IV fetch, then the following cont open will just increase the open count, then if the previous container open failed, it will get the assertion failure. So let's retry if there are inflight container open. Signed-off-by: Di Wang <[email protected]>
Ticket title is 'ds_cont_local_open() Assertion 'hdl->sch_cont->sc_open == 1' failed: Unexpected open count for cont c6822a9a: 307' |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15682/1/execution/node/1468/log |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15682/2/execution/node/468/log |
fix wording Signed-off-by: Di Wang <[email protected]>
further update Signed-off-by: Di Wang <[email protected]>
fix spelling Signed-off-by: Di Wang <[email protected]>
sc_dtx_delay_reset : 1, sc_dtx_registered : 1, sc_props_fetched : 1, sc_stopping : 1, | ||
sc_destroying : 1, sc_vos_agg_active : 1, sc_ec_agg_active : 1, | ||
/* flag of CONT_CAPA_READ_DATA/_WRITE_DATA disabled */ | ||
sc_rw_disabled : 1, sc_scrubbing : 1, sc_rebuilding : 1, sc_open_initializing : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the recommended coding style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually only suppose to change 1 line, not sure why it messed up like this, which might due to git-hook sth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be messed up by clang-format hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, clang-format does this. I believe if you use ; instead of comma, you still use same space and don't have the formatting issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verified (though we do this in other places in DAOS)
#include <stdint.h>
#include <stdio.h>
struct foo {
uint64_t foo;
uint32_t bar:1;
uint32_t fub:1;
uint32_t other;
};
int main(int argc, char ** argv)
{
printf("%zu\n", sizeof(struct foo));
}
This prints 16 which means bar and fub are placed together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, you can make this change without any formatting and clang-format will fix it and make it look nice
if (rc != 0) | ||
if (rc != 0) { | ||
if (rc == -DER_AGAIN) | ||
goto retry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Busy retry? It may be better to wait for former inflight opening to return and then check again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a collective operation. which will yield, but I assume wait a few ms might be ok.
wait 50ms before retry. Signed-off-by: Di Wang <[email protected]>
src/include/daos_srv/container.h
Outdated
@@ -1,5 +1,5 @@ | |||
/* | |||
* (C) Copyright 2015-2024 Intel Corporation. | |||
* (C) Copyright 2015-2025 Intel Corporation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be Google and no update to Intel
src/container/srv_target.c
Outdated
@@ -1,5 +1,5 @@ | |||
/** | |||
* (C) Copyright 2016-2024 Intel Corporation. | |||
* (C) Copyright 2016-2025 Intel Corporation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google but no update to Intel
sc_dtx_delay_reset : 1, sc_dtx_registered : 1, sc_props_fetched : 1, sc_stopping : 1, | ||
sc_destroying : 1, sc_vos_agg_active : 1, sc_ec_agg_active : 1, | ||
/* flag of CONT_CAPA_READ_DATA/_WRITE_DATA disabled */ | ||
sc_rw_disabled : 1, sc_scrubbing : 1, sc_rebuilding : 1, sc_open_initializing : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verified (though we do this in other places in DAOS)
#include <stdint.h>
#include <stdio.h>
struct foo {
uint64_t foo;
uint32_t bar:1;
uint32_t fub:1;
uint32_t other;
};
int main(int argc, char ** argv)
{
printf("%zu\n", sizeof(struct foo));
}
This prints 16 which means bar and fub are placed together.
sc_dtx_delay_reset : 1, sc_dtx_registered : 1, sc_props_fetched : 1, sc_stopping : 1, | ||
sc_destroying : 1, sc_vos_agg_active : 1, sc_ec_agg_active : 1, | ||
/* flag of CONT_CAPA_READ_DATA/_WRITE_DATA disabled */ | ||
sc_rw_disabled : 1, sc_scrubbing : 1, sc_rebuilding : 1, sc_open_initializing : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, you can make this change without any formatting and clang-format will fix it and make it look nice
Signed-off-by: Jeff Olivier <[email protected]>
Signed-off-by: Jeff Olivier <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this PR has been landed, but looks there is a defect in the PR.
@@ -1663,6 +1678,7 @@ ds_cont_local_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid, uuid_t cont_uuid, | |||
dtx_cont_close(hdl->sch_cont, true); | |||
|
|||
err_cont: | |||
hdl->sch_cont->sc_open_initializing = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "sc_open_initializing" flag will be cleared when second open return -DER_AGAIN? Why don't we simply use a ABT_Mutex to serialize above open procedures?
Check inflight container open, which might be stucked in IV fetch, then the following cont open will just increase the open count, then if the previous container open failed, it will get the assertion failure. So let's retry if there are inflight container open. Signed-off-by: Di Wang <[email protected]>
Check inflight container open, which might be stucked in IV fetch, then the following cont open will just increase the open count, then if the previous container open failed, it will get the assertion failure. So let's retry if there are inflight container open. Signed-off-by: Di Wang <[email protected]>
Check inflight container open, which might be stucked in IV fetch, then the following cont open will just increase the open count, then if the previous container open failed, it will get the assertion failure.
So let's retry if there are inflight container open.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: