Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit Stopping state (2/3) #1570

Open
wants to merge 5 commits into
base: mkeeter/simplify-faults
Choose a base branch
from

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Nov 21, 2024

Staged on top of #1568

There's an unfortunate ambiguity in certain states (e.g. DsState::Faulted), which represent two different things:

  • We've stopped the IO task due to a fault, and are waiting for it to restart
  • The IO task has restarted, and we're doing negotiation from a faulted state (i.e. will do live-repair)

This PR adds a new DsState::Stopping(ClientStopReason) state, which represents the former. The previous states (DsState::Faulted) now only mean that we're doing negotiation.

The new state subsumes DsState::Deactivated, DsState::Replacing, DsState::Disabled, which were specialized states that waited for the IO task to exit. Each of those states is now DsState::Stopping(..) with an appropriate ClientStopReason.

The vast majority of this PR is automatic OpenAPI changes. I don't think anyone is relying on the specific shape of DsState (which is only used in UpstairsInfo / the info endpoint), but please let me know if I'm wrong!

@mkeeter mkeeter force-pushed the mkeeter/explicit-stop-state branch from e31a3ad to 0282d48 Compare November 22, 2024 14:57
@mkeeter mkeeter force-pushed the mkeeter/explicit-stop-state branch 2 times, most recently from 3366d4b to 9973b28 Compare November 25, 2024 20:32
@mkeeter mkeeter force-pushed the mkeeter/simplify-faults branch from 8f67994 to 3d11ac2 Compare November 25, 2024 20:32
@mkeeter mkeeter force-pushed the mkeeter/simplify-faults branch from 3d11ac2 to de39d6e Compare December 2, 2024 14:44
@mkeeter mkeeter force-pushed the mkeeter/explicit-stop-state branch from 9973b28 to 16a680c Compare December 2, 2024 14:44
@mkeeter mkeeter changed the title Add explicit Stopping state Add explicit Stopping state (2/3) Dec 2, 2024
@mkeeter mkeeter force-pushed the mkeeter/simplify-faults branch from de39d6e to 26823a9 Compare December 3, 2024 15:55
@mkeeter mkeeter force-pushed the mkeeter/explicit-stop-state branch 2 times, most recently from 18d86f5 to 26061a7 Compare December 9, 2024 14:51
@mkeeter mkeeter force-pushed the mkeeter/explicit-stop-state branch from 26061a7 to 05e8ba6 Compare December 9, 2024 15:09
@leftwo
Copy link
Contributor

leftwo commented Dec 9, 2024

The vast majority of this PR is automatic OpenAPI changes. I don't think anyone is relying on the specific shape of DsState (which is only used in UpstairsInfo / the info endpoint), but please let me know if I'm wrong!

The tools/test_fail_live_repair.sh test uses /info (poorly)

Here is a diff to fix it. I can also just push this to your branch directly (I thinik..)

diff --git a/tools/test_fail_live_repair.sh b/tools/test_fail_live_repair.sh
index 333f95a..0a71a87 100755
--- a/tools/test_fail_live_repair.sh
+++ b/tools/test_fail_live_repair.sh
@@ -53,6 +53,12 @@ for bin in $cds $crucible_test $dsc; do
     fi
 done
 
+# The jq program is required for processing the /info endpoint
+if ! jq --version > /dev/null; then
+    echo "Can't find jq program, required for this test"
+    exit 1
+fi
+
 # Verify there is not a downstairs already running.
 if pgrep -fl -U "$(id -u)" "$cds"; then
     echo "Downstairs already running" >&2
@@ -160,13 +166,7 @@ while [[ $count -le $loops ]]; do
     choice_state="undefined"
     while [[ "$choice_state" != "faulted" ]]; do
         sleep 3
-        if [[ $choice -eq 0 ]]; then
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $8}')
-        elif [[ $choice -eq 1 ]]; then
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $10}')
-        else
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $12}')
-        fi
+        choice_state=$(curl -s http://127.0.0.1:7890/info | jq -r .ds_state[$choice].type)
     done
 
     if [[ $pstop -eq 0 ]]; then
@@ -180,13 +180,7 @@ while [[ $count -le $loops ]]; do
     choice_state="undefined"
     while [[ "$choice_state" != "live_repair" ]]; do
         sleep 2
-        if [[ $choice -eq 0 ]]; then
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $8}')
-        elif [[ $choice -eq 1 ]]; then
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $10}')
-        else
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $12}')
-        fi
+        choice_state=$(curl -s http://127.0.0.1:7890/info | jq -r .ds_state[$choice].type)
     done
 
     # Give the live repair between 5 and 10 seconds to start repairing.
@@ -204,13 +198,7 @@ while [[ $count -le $loops ]]; do
     choice_state="undefined"
     while [[ "$choice_state" != "faulted" ]]; do
         sleep 3
-        if [[ $choice -eq 0 ]]; then
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $8}')
-        elif [[ $choice -eq 1 ]]; then
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $10}')
-        else
-            choice_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $12}')
-        fi
+        choice_state=$(curl -s http://127.0.0.1:7890/info | jq -r .ds_state[$choice].type)
     done
 
     sleep 2
@@ -223,10 +211,11 @@ while [[ $count -le $loops ]]; do
 
     # Now wait for all downstairs to be active
     echo Now wait for all downstairs to be active | tee -a "$test_log"
-    all_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $8","$10","$12}')
-    while [[ "${all_state}" != "active,active,active" ]]; do
+    all_state=$(curl -s http://127.0.0.1:7890/info | jq -r .ds_state[].type | tr '\n' ',')
+    # The trailing comma here is required
+    while [[ "${all_state}" != "active,active,active," ]]; do
         sleep 5
-        all_state=$(curl -s http://127.0.0.1:7890/info | awk -F\" '{print $8","$10","$12}')
+        all_state=$(curl -s http://127.0.0.1:7890/info | jq -r .ds_state[].type | tr '\n' ',')
     done
 
     echo All downstairs active, now stop IO test and wait for it to finish | tee -a "$test_log"

Comment on lines 927 to 929
// XXX there are some `Stopping` cases which should never happen,
// should we panic on them (like we do for negotiation states
// below)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say yes but I think that should be addressed in the next PR in the series (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both ways, but because it's just affecting DsState::Stopping(..), it seems reasonable to do here.

DsState::Faulted |
// It's also possible for a Downstairs to be in the process of
// stopping, due a fault or disconnection
DsState::Stopping(..) // XXX should we be more specific here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe?

also, slight nit: I think we should do away with this assert, move the comments to the match on client state below, and panic on the catch-all case. We're missing both Faulted and Stopping(..) enum variants in the below match, and I'd rather be explicit there that we don't have to do anything in those cases.

@mkeeter
Copy link
Contributor Author

mkeeter commented Dec 10, 2024

@leftwo I applied your patch, thanks!

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some simple questions for you

| DsState::LiveRepairReady => EnqueueResult::Skip,
| DsState::LiveRepairReady
| DsState::Stopping(..) => EnqueueResult::Skip,
// XXX there are some `Stopping` cases which should never happen,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote yes for panic. If we really think they should never happen, and we silently allow them through, what would that mean? Could we allow an IO to go out that should not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panics here are last-chance invariant maintenance. I've split up DsState::Stopping into multiple different substates in 3eb259f ; opinions are welcome as to which should panic versus skipping the IOs (the only obvious one is Deactivated, which should not send further IO).

DsState::Deactivated => "DAV".to_string(),
DsState::Disabled => "DIS".to_string(),
DsState::Replacing => "RPC".to_string(),
DsState::Stopping(ClientStopReason::Deactivated) => "DAV".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed, this DsState change does impact how the dtrace scripts (in tools/dtrace) output looks when we print the generic state. When the final PR from this group lands we will have to update the DTrace scripts so columns align.

upstairs/src/client.rs Show resolved Hide resolved
upstairs/src/client.rs Show resolved Hide resolved
DsState::Faulted |
// It's also possible for a Downstairs to be in the process of
// stopping, due a fault or disconnection
DsState::Stopping(..) // XXX should we be more specific here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would ClientStopReason::NegotiationFailed be valid here? That would be the only one I might not expect to see.

@leftwo
Copy link
Contributor

leftwo commented Dec 10, 2024

Looking at 3/3 in this series, some of the questions/comments here might be obsolete with what happens in 3/3.
If so, just note that and ignore the question/comment.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just have one question left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants