Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mithril-aggregator genesis bootstrap flakiness in e2e tests #2303

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Alenar
Copy link
Collaborator

@Alenar Alenar commented Feb 12, 2025

Content

This PR fix a flakiness in mithril-end-to-end tests that manifest with a failure when bootstrapping the genesis certificate.

Analysis

Summary

In a successful run:

  1. the stake distribution that we try to fetch (for the actual epoch, without offset) isn't immediately available since when we register them we do so with an offset of +1.
  2. we don't see stake distribution unavailability errors because ... the epoch settings also isn't available because the ones we are fetching in EpochService::inform_epoch are with offset -1, 0, +1 and the one we register in handle_discrepancies_at_startup are with offset 0, 1, 2.
  3. the errors solve themself after epoch change, then the stake distribution registered in the previous epoch is now available and the epoch settings registered by handle_discrepancies_at_startup too.
  4. The state machine will still have it cycles fails but after the call to update_epoch_settings, allowing the genesis bootstrap to run since it was what was missing.

In failing run:

  1. There's an epoch change between handle_discrepancies_at_startup and the first state machine cycle
  2. This time there's the three epoch settings available because of the epoch change
  3. But there's still no stake distribution available for the actual epoch so EpochService::inform_epoch will fail, making it missing calls to update_epoch_settings.
  4. After an epoch change the epoch settings for epoch +1, needed by EpochService::inform_epoch, is not available since update_epoch_settings was not called, making each subsequent cycles fails.

Details

Details

📓 Note
Logs extracted from the following failed run: https://github.com/input-output-hk/mithril/actions/runs/12178789405/job/33970064206#step:6:533

  1. The genesis bootstrap fails with the following error:
Error: genesis-tools: initialization error

Caused by:
    Missing protocol parameters for epoch 10
  1. The protocol parameters are part of the epoch settings that are inserted:
  • at startup when building the epoch settings store by running handle_discrepancies_at_startup: this fills epochs settings for the actual epoch plus the two following.
  • by the state machine update_epoch_settings when transitioning from idle to ready state after calling to EpochService::inform_epoch
  1. We can see epoch settings inserted by handle_discrepancies_at_startup in the database at epoch 5, 6 and 7 plus 10, 11, and 12 (when the bootstrap command was run). This means that handle_discrepancies_at_startup was run at epoch 5 and 10.

  2. In the logs we can see that between the run of handle_discrepancies_at_startup at epoch 5 and the first state machine cycle, the epoch changed.

  3. When reading the aggregator logs we see that the state machine got every cycle for the epoch 6 an error from EpochService::inform_epoch:

Epoch service failed to obtain the stake distribution for epoch: 6

Caused by:
    The stake distribution for epoch Epoch(6) is not available.
  1. This error is raised at the end of EpochService::inform_epoch when it tries to obtains the total cardano spo and stakes which is based on stakes distribution stored previously.
    In turn the state machine never get to update_epoch_settings since this call is done after inform_epoch.

  2. As a consequence of being unable to run update_epoch_settings the error raised at each cycle change after epoch change (to epoch 7). It fails earlier in the function when trying to obtains the epoch settings for signer registration epoch (+1):

Epoch service could not obtain signer registration epoch settings for epoch 8

Stack backtrace:
    0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
    1: mithril_aggregator::services::epoch_service::MithrilEpochService::get_epoch_settings::{{closure}}

Afterward the state machine cycle will always fails for this same reason and won't be able to be ready for genesis bootstrap.

Issues

Unavailability of the stake distribution

Important

There's a missing offset of -1 (signer retrieval epoch) when fetching the stake distribution for total spo and stake in epoch service.
This is because the stake distribution that should be used is the one associated with the current signers (those who can sign), which use that offset.

This is normal, when we first start an aggregator the stake distribution that we register won't be usable before two epochs.
This means that we can't reliably provide total spo/stake in the epoch service, those have to be Option to take in account this gap when there's no stake distribution usable.

Making this change solve the problems since the inform_epoch won't fail anymore, but the aggregator status route have to be adapted.

Unavailability of the epoch settings

This is a minor issue as this don't stop the aggregator from bootstrapping a genesis certificate as long as we wait for one epoch after start.
But this makes the signers calls to /epoch-settings fails, delaying when they can register and filling their logs with errors.

By also filling epoch settings with the retrieval offset in handle_discrepancies_at_startup we can avoid this problem. Also we should fills epoch settings for 4 epochs to avoid the edge case when there's a epoch change between handle_discrepancies_at_startup and the first state machine cycle.

Changes

For mithril-aggregator:

  • EpochService:
    • fix incorrect epoch for fetching the stake distribution used to compute total cardano stake and spo
    • total cardano stake and spo are now Option since they are not available before two epochs when starting an aggregator for the first time
  • /status route: use 0 for total cardano stake and spo when the values returned by the Epoch Service are none
  • handle_discrepancies_at_startup:
    • called with an offset of -1 so data are available for EpochService::inform_epoch
    • register epoch settings for 4 epochs instead of 3 so EpochService::inform_epoch won't fails if there's an epoch change between calls to handle_discrepancies_at_startup in the aggregator builder and the first state machine cycle.

For mithril-explorer:

  • Properly handle a total of 0 when computing percentages.

Pre-submit checklist

  • Branch
    • Tests are provided (if possible)
    • Crates versions are updated (if relevant)
    • CHANGELOG file is updated (if relevant)
    • Commit sequence broadly makes sense
    • Key commits have useful messages
  • PR
    • No clippy warnings in the CI
    • Self-reviewed the diff
    • Useful pull request description
    • Reviewer requested

Issue(s)

Relates to #2222

Copy link

github-actions bot commented Feb 12, 2025

Test Results

    4 files  ±0     56 suites  ±0   10m 44s ⏱️ - 10m 14s
1 597 tests ±0  1 597 ✅ ±0  0 💤 ±0  0 ❌ ±0 
1 901 runs  ±0  1 901 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 08ade9d. ± Comparison against base commit 31caf35.

This pull request removes 2 and adds 2 tests. Note that renamed tests count towards both.
mithril-aggregator ‑ store::epoch_settings_storer::tests::test_handle_discrepancies_at_startup_should_complete_at_least_two_epochs
mithril-stm ‑ eligibility_check::tests::early_break_taylor
mithril-aggregator ‑ http_server::routes::status::tests::total_cardano_stakes_and_spo_are_0_if_none_in_epoch_service
mithril-aggregator ‑ store::epoch_settings_storer::tests::test_handle_discrepancies_at_startup_should_complete_at_least_four_epochs

♻️ This comment has been updated with latest results.

@Alenar Alenar force-pushed the ensemble/2222/aggregator/unaivalable_stake_distribution_in_epoch_service branch 2 times, most recently from 85f82e7 to 8c6a042 Compare February 12, 2025 14:59
It used the current epoch but when we use this data we corellate them to
the current signers which are retrieved using the `signer_retrieval_epoch`
(-1).
In order to have consistent data we now use the `signer_retrieval_epoch`
to retrieve total spo/stake.
…e instead of failing

When there's no stake distribution available.

This address the specific case of the first aggregator start when
there's no stake distribution available for the first two epochs.
…for the first time

Epoch settings for the first "working" epoch were not available at start
since they use the 'signer retrieval offset' of -1 and we handled
discrepencies only starting the current epoch.

Also `handle_discrepancies_at_startup` is modified to register data over
four epochs instead of 3 to take in account the case when between its
call and epoch service 'inform epoch' there's an epoch change.
@Alenar Alenar force-pushed the ensemble/2222/aggregator/unaivalable_stake_distribution_in_epoch_service branch from 8c6a042 to 9ba9d85 Compare February 12, 2025 16:48
Copy link
Collaborator

@sfauvel sfauvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Alenar Alenar force-pushed the ensemble/2222/aggregator/unaivalable_stake_distribution_in_epoch_service branch from 9ba9d85 to 2321d95 Compare February 12, 2025 17:25
@Alenar Alenar marked this pull request as ready for review February 12, 2025 17:25
Copy link
Member

@jpraynaud jpraynaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥

* mithril-aggregator from `0.7.2` to `0.7.3`
* [js] mithril-explorer from `0.7.28` to `0.7.29`
@Alenar Alenar temporarily deployed to testing-preview February 12, 2025 17:46 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-sanchonet February 12, 2025 17:46 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-preview February 12, 2025 18:01 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-sanchonet February 12, 2025 18:01 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants