-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mithril-aggregator genesis bootstrap flakiness in e2e tests #2303
base: main
Are you sure you want to change the base?
Fix mithril-aggregator genesis bootstrap flakiness in e2e tests #2303
Conversation
Test Results 4 files ±0 56 suites ±0 10m 44s ⏱️ - 10m 14s Results for commit 08ade9d. ± Comparison against base commit 31caf35. This pull request removes 2 and adds 2 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
85f82e7
to
8c6a042
Compare
It used the current epoch but when we use this data we corellate them to the current signers which are retrieved using the `signer_retrieval_epoch` (-1). In order to have consistent data we now use the `signer_retrieval_epoch` to retrieve total spo/stake.
…e instead of failing When there's no stake distribution available. This address the specific case of the first aggregator start when there's no stake distribution available for the first two epochs.
…for the first time Epoch settings for the first "working" epoch were not available at start since they use the 'signer retrieval offset' of -1 and we handled discrepencies only starting the current epoch. Also `handle_discrepancies_at_startup` is modified to register data over four epochs instead of 3 to take in account the case when between its call and epoch service 'inform epoch' there's an epoch change.
8c6a042
to
9ba9d85
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
9ba9d85
to
2321d95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🔥
* mithril-aggregator from `0.7.2` to `0.7.3` * [js] mithril-explorer from `0.7.28` to `0.7.29`
Content
This PR fix a flakiness in mithril-end-to-end tests that manifest with a failure when bootstrapping the genesis certificate.
Analysis
Summary
In a successful run:
EpochService::inform_epoch
are with offset -1, 0, +1 and the one we register inhandle_discrepancies_at_startup
are with offset 0, 1, 2.handle_discrepancies_at_startup
too.update_epoch_settings
, allowing the genesis bootstrap to run since it was what was missing.In failing run:
handle_discrepancies_at_startup
and the first state machine cycleEpochService::inform_epoch
will fail, making it missing calls toupdate_epoch_settings
.EpochService::inform_epoch
, is not available sinceupdate_epoch_settings
was not called, making each subsequent cycles fails.Details
Details
handle_discrepancies_at_startup
: this fills epochs settings for the actual epoch plus the two following.update_epoch_settings
when transitioning fromidle
toready
state after calling toEpochService::inform_epoch
We can see epoch settings inserted by
handle_discrepancies_at_startup
in the database at epoch 5, 6 and 7 plus 10, 11, and 12 (when the bootstrap command was run). This means thathandle_discrepancies_at_startup
was run at epoch 5 and 10.In the logs we can see that between the run of
handle_discrepancies_at_startup
at epoch 5 and the first state machine cycle, the epoch changed.When reading the aggregator logs we see that the state machine got every cycle for the epoch 6 an error from
EpochService::inform_epoch
:This error is raised at the end of
EpochService::inform_epoch
when it tries to obtains the total cardano spo and stakes which is based on stakes distribution stored previously.In turn the state machine never get to
update_epoch_settings
since this call is done afterinform_epoch
.As a consequence of being unable to run
update_epoch_settings
the error raised at each cycle change after epoch change (to epoch 7). It fails earlier in the function when trying to obtains the epoch settings for signer registration epoch (+1):Afterward the state machine cycle will always fails for this same reason and won't be able to be ready for genesis bootstrap.
Issues
Unavailability of the stake distribution
Important
There's a missing offset of -1 (signer retrieval epoch) when fetching the stake distribution for total spo and stake in epoch service.
This is because the stake distribution that should be used is the one associated with the current signers (those who can sign), which use that offset.
This is normal, when we first start an aggregator the stake distribution that we register won't be usable before two epochs.
This means that we can't reliably provide total spo/stake in the epoch service, those have to be
Option
to take in account this gap when there's no stake distribution usable.Making this change solve the problems since the
inform_epoch
won't fail anymore, but the aggregator status route have to be adapted.Unavailability of the epoch settings
This is a minor issue as this don't stop the aggregator from bootstrapping a genesis certificate as long as we wait for one epoch after start.
But this makes the signers calls to
/epoch-settings
fails, delaying when they can register and filling their logs with errors.By also filling epoch settings with the retrieval offset in
handle_discrepancies_at_startup
we can avoid this problem. Also we should fills epoch settings for 4 epochs to avoid the edge case when there's a epoch change betweenhandle_discrepancies_at_startup
and the first state machine cycle.Changes
For
mithril-aggregator
:EpochService
:Option
since they are not available before two epochs when starting an aggregator for the first time/status
route: use 0 for total cardano stake and spo when the values returned by the Epoch Service are nonehandle_discrepancies_at_startup
:-1
so data are available forEpochService::inform_epoch
4
epochs instead of3
soEpochService::inform_epoch
won't fails if there's an epoch change between calls tohandle_discrepancies_at_startup
in the aggregator builder and the first state machine cycle.For
mithril-explorer
:Pre-submit checklist
Issue(s)
Relates to #2222