-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new time sliced overlay survey #4275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpointing here: I'm still reviewing data collection/relaying portion. Thanks for the changes! Appreciate how clean and meticulous the code is :) Most of my suggestions are minor; I did leave a few questions around backward compatibility when connecting to old versions of core as well as the future of old-style survey.
src/main/Application.h
Outdated
@@ -221,6 +222,7 @@ class Application | |||
virtual WorkScheduler& getWorkScheduler() = 0; | |||
virtual BanManager& getBanManager() = 0; | |||
virtual StatusManager& getStatusManager() = 0; | |||
virtual SurveyDataManager& getSurveyDataManager() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to bump OVERLAY_PROTOCOL_VERSION to indicate a version of core that supports new-style survey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a108620
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if this change lands in v21.1.0, we will need to bump overlay version again from 33 to 34, as we expect to bump it to 33 in v21.0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder to bump the version to 34.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumped in 0859d27
I just squashed and rebased this on top of |
src/main/Application.h
Outdated
@@ -221,6 +222,7 @@ class Application | |||
virtual WorkScheduler& getWorkScheduler() = 0; | |||
virtual BanManager& getBanManager() = 0; | |||
virtual StatusManager& getStatusManager() = 0; | |||
virtual SurveyDataManager& getSurveyDataManager() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if this change lands in v21.1.0, we will need to bump overlay version again from 33 to 34, as we expect to bump it to 33 in v21.0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the updates! I spent a bit more time thinking through non-happy-path scenarios, and left a few suggestions there. I'd say we should merge this PR as-is, then just do a follow-up with the last few tweaks (to make PR management easier). Looks like there are some merge conflicts, so let's just address those and squash commits. Thanks again!
src/main/Application.h
Outdated
@@ -221,6 +222,7 @@ class Application | |||
virtual WorkScheduler& getWorkScheduler() = 0; | |||
virtual BanManager& getBanManager() = 0; | |||
virtual StatusManager& getStatusManager() = 0; | |||
virtual SurveyDataManager& getSurveyDataManager() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder to bump the version to 34.
SurveyPhase phase(); | ||
|
||
// Reset survey data. Intended to be called when survey data expires. | ||
void reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is expiration only detected when we call phase
? If so, I think that means we don't know when survey data is going to be reset exactly. I think this is fine now since we call it every time we call modifyNodeData
, but that seems like a footgun in case implementation of modifyNodeData
changes. To be safe, can we check expiration and reset survey on a periodic basis? Simplest i think is to add the check to OverlayManager::tick, which is called every few seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's a good point about this being easy to screw up if/when we modify this in the future. I got rid of the phase()
function and replaced it with a member variable mPhase
that the various start
/stop
functions update. I also added a function updateSurveyPhase
that tick
calls that serves to reset the survey on expiration.
src/overlay/SurveyDataManager.cpp
Outdated
bool | ||
SurveyDataManager::nonceIsReporting(uint32_t nonce) | ||
{ | ||
return phase() == SurveyPhase::REPORTING && mNonce == nonce; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these two conditions be swapped? This way we won't reset the current survey if another surveyor erroneously sends a survey request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(actually, this probably suggests that phase
shouldn't be modifying state, and is related to my question around periodically checking state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the removal of the phase()
function I don't think order matters here anymore
src/overlay/SurveyDataManager.cpp
Outdated
{ | ||
ZoneScoped; | ||
|
||
if (phase() == SurveyPhase::REPORTING && mNonce == request.nonce && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, should we check nonce first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also no longer matters with the removal of the phase()
function
@@ -16,6 +18,74 @@ namespace stellar | |||
|
|||
uint32_t const SurveyManager::SURVEY_THROTTLE_TIMEOUT_MULT(3); | |||
|
|||
uint32_t constexpr TIME_SLICED_SURVEY_MIN_OVERLAY_PROTOCOL_VERSION = 33; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder: this should be 34.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0859d27
src/overlay/SurveyDataManager.cpp
Outdated
// Collecting phase is limited to 2 hours. If 2 hours pass without receiving | ||
// a StopSurveyCollecting message the `SurveyDataManager` will reset all data | ||
// and transition to the `INACTIVE` phase. | ||
constexpr std::chrono::hours COLLECTING_PHASE_MAX_DURATION{2}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there value in using this timeout to mark the end of collecting phase, and still allow surveyors to query results? Wondering if this should set to something we'd expect to use in reality, like 30 minutes? This may help maximize chances of more nodes responding to survey, as I imagine there'll be situations where messages are dropped due to our filtering policies (we'd still use the stop message to end survey earlier when needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I shortened the max collecting phase duration to 30 minutes and changed the code to automatically transition to the reporting phase if the collecting phase expires in 160d1f8
I've rebased and squashed this down to a single commit in preparation for merging! |
r+ 075df81 |
Description
Resolves #4169, #4164. Partially resolves #2592 (script update remains).
This pull request adds a new survey mode that aims to be more deterministic and easier to compare across runs. It also adds some additional fields to the survey schema. It is very faithful to the design outlined in the Unified Network Survey Work Proposal.
While this PR includes all of the new functionality,
it is still missing a few things I'd like to do before it's completely done:metrics.md
SurveyManager
next
) stellar-xdr#186 and update XDR submodules to point to those instead of my own branch.Checklist
clang-format
v8.0.0 (viamake format
or the Visual Studio extension)