Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RA: Filesystem: bind mount point unmounting postcondition check failure (for being member of 2+ mounts in misordered chaining scenario) has the power to unintendedly(!) kill all processes #1304

Open
jnpkrn opened this issue Mar 14, 2019 · 12 comments
Assignees

Comments

@jnpkrn
Copy link
Contributor

jnpkrn commented Mar 14, 2019

Follow:

fs_stop() {
        local SUB=$1 timeout=$2 sig cnt
        for sig in TERM KILL; do
                cnt=$((timeout/2)) # try half time with TERM
                while [ $cnt -gt 0 ]; do
                        try_umount $SUB &&
                                return $OCF_SUCCESS
                        ocf_exit_reason "Couldn't unmount $SUB; trying cleanup with $sig"
                        signal_processes $SUB $sig
                        cnt=$((cnt-1))
                        sleep 1
                done
        done
        return $OCF_ERR_GENERIC
}

In case of bind mount and in rare circumstance that try_umount $SUB
fails (e.g. because the source of the bind mounting was itself
a mount point that hung for some reason), resource agent will proceed
to kill every process touching the same filesystem as the target
directory of this bind mount (as detected by fuser -m), but at
that point, it boils down to any process touching the root mount
point of /, which is practically the whole world (if not for anything
else, every non-chrooted process under Linux will touch that very
filesystem because of possessing a reference to it as its "root").

That's very bad, since this is a misbehaviour and only the cluster
resource manager shall take decisions about what to do on the failed
stop action (if that indeed shall be the expected outcome in this
very case).

Case in point (thanks @computer045 for the report):
ClusterLabs/pcs#197

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Mar 14, 2019

To add, cannot exclude this is not specific to bind mounts, but
that's the context it was observed with.


Simple yet possibly non-portable or otherwise problematic solutions:

  1. pass -M to fuser as well

  2. use stat output to compare "device identifiers" for given
    alleged mount point and that of / -- do not proceed if match
    (does anybody know of a counterexample where it would actually
    be desirable? I was thinking about namespaces ~ chrooting,
    but then, majority of the agents is likely flawed in some
    aspect anyway, so wouldn't follow this rabbit hole)


Also, for a better auditability, the

sending signal TERM to: root      3035     1  0 Mar07

log lines (how unwieldy the above looks without squeezing the
white space, at least), there should also be the reason, e.g.,

sending signal TERM for holding /net/home filesystem handle to ...

It's really a dangerous thing for a privileged process to decide
going on killing spree towards other processes based on the only
but arbitrarily wide criterion, extreme care + good auditing desirable.

@oalbrigt
Copy link
Contributor

Have you tried with force_unmount=safe? You could also set it to false.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Mar 15, 2019

This may help, however the degenerate case of burning everything down
shall rather be prevented in any case, since causality precludes it
would ever be the right thing to do -- the only case that pops out is
that the Filesystem resource runs prior to cluster stack starting and
and it's started from such a mount point (undesirably), and that case
could be detected thanks to parent-process chain going from the agent's
process up to the cluster resource manager (hitting pacemaker-execd,
then PID 1 in case of pacemaker 2+, and none of these shall carry
handles towards these mounts directly in sane configurations).

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Mar 16, 2019

EDIT: the root cause detailed in the next comment:
#1304 (comment)
but this is the next real threat there, which would likely be
triggered at some point, too, were there not the root issue.

Actually, the problem here is that the Filesystem agent doesn't satisfy
unpronounced re-entrancy requirement as discussed recently:

https://lists.clusterlabs.org/pipermail/users/2019-February/025454.html

To be pointed out this was rather a naive original proposition
(OTOH first grounding claim about the matter of this sort at all!):

So in practice, I think that's the guideline: an agent should be
re-entrant for distinct sets of parameters, but not necessarily any
one set of parameters.

Knowing the problem at hand (see below), let's refine this constraint
further:

Agent shall take all precautions for it to be re-entrant, i.e.,
the parallel execution of the agent even with non-identical
sets of user-configured configured parameters[*] and regardless
of the respective actions being performed will not cause any
adversary effects. It's the agent's responsibility to figure out
semantic-based interference amongst such executions in that case,
and apply suitable mesures to prevent them.

For instance, when the resource agent at hand is meant to
establish a mount, pinning device/path A to the existing
filesystem path B, its instance shall, upon its start
or stop action (operations hazarding an interference)
resolve the A over possibly transitive chain of bind
mounts to the very source backing device, use that to derive
a lock name, and try taking the lock, resorting to
retest-situation-and-possibly-proceed-again waiting
otherwie. This will prevent possible uncoordinated and
destructive interference whereby the race between checking
the mount is active and actual unmounting in one instance
of the agent can be interleaved with the same execution
of the parallel instance, making the unmount fail for the
former, possibly leading to undesired failover consequences.
At the most conservative case, there can be just a single
lock to be used by whatever agent's instance, but this will,
undesirably, amplify timeout-based issues and reduce
parallelism needlessly. Finally, the solution could be
offloaded onto configuration tools, but that shatters
the point-of-authority centralized abstraction around
the particular thing to control, which the agents happen
to be in this arrangement.

[*] resource manager complying with OCF XY is mandated to
prevent parallel execution of the agent instances with
the identical sets of user-configured parameters
(or ditto applied exclusively on the unique-group ones)


Full circle back...

ClusterLabs/pcs#197 (comment)
is effectively a subtle configuration error (two in practice identical
mount managers established, in parallel execution stomping on each
other's foot), the one we cannot reasonably require user to
spot/realize (reliability means also resilience to human error), so
the agent (here Filesystem needs to gain this broken-config-proofing
on behalf of the users, with possible solutions sketched in the
above example accompanying the proposal for extending the OCF standard.
Partial solutions suggested in the 2nd comment are rather a last
resort solution that, in an ideal case, would never be exercised
and hence could be omitted.

Other sorts of agents are eligible for similarly critical review
in this regard.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Mar 16, 2019

Actually, referenced issue is not necessarily of that
"parallel execution" sort, but it's a generally valid
concern nonetheless, for "self-inflicted failure to start/stop"
if the two parallel actions go in the opposite direction, etc.

The very root cause of the referenced case is a situation
like this with a time progression:

mount -o bind A B:
device               path A <---bind-mount---- path B

---

mount device A

expected situation:
device <---mount---- path A <---bind-mount---- path B

per /etc/mtab:
device <---mount---- path A <---bind-mount---- path B
+
                 +-------------------mount-------+
                 |                               |
device <---------+   path A <---bind-mount---- path B

---

umount B

expected situation:
device <---mount---- path A                    path B

per /etc/mtab:
device               path A <---bind-mount---- path B

--> i.e. B still mounted ~ failure to stop
    (isolated within agent, attempts to recover by
    killing all the processes pertaining filesystem
    of (likely) A (which is not mounted, hence likely
    that of root of the tree)

since when mount -o bind A B will run ahead of mount device A,
the net result is that umount B (against the expectations of
the resource manager) has the same meaning as umount A
(stopping A will cause a stopped B under the hood, but
that's not an immediate concern since the killing spree
has already begun and the resource manager will not be
notified about this since it's killed as well unless
processes touching root filesystem are assurably never
killed as proposed), and said parallelism may only help to
circumvent the logic further.

The other part of thinkable configuration sanity enforcement
(that would get useful if either force_unmount=false or
when the proposed sanity preservation is in place)
is that bind mounts shall only be allowed in the resource
manager managed mount chains (as in with enforced ordering,
moreover with clear matches between the paths in the chain,
with cardinality enforcements and such), not in an isolation
(where its usage would generally be unrestricted).

Rgmanager as a cluster resource manager was superior in this
regard, allowing to model such chains concisely and (IIRC)
with exactly such sanity enforcements (implicit validation
of resource stackability).

Pacemaker is more universal but lacks all this implicit
guidance whatsoever, and resource agents themselves is
a wrong level of abstraction to make decisions based on
(resource manager specific) introspection. Now, perhaps
some arrangement like that of rgmanager could be retrofitted
partially, but without putting the physical configuration
format in line with the way of thinking about the resources
(as done with rgmanager), this won't get the decent adoption,
I am afraid.

(sorry if it's all rather chaotic, don't have weeks to
examine this to the tiniest detail, it's enough that something
went wrong with allegedly legitimate configuration and we
have failed to prevent the arising disaster with enough
of sanity guarding)

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Mar 16, 2019

Immediate solution here is likely along the lines of
#1304 (comment)
since destroyed world including its master is the worst
outcome possible at any scenario.

Regarding wider surface, locking (journaling) like approach could
possibly help here if it enforced that umounts are silently skipped
if there are other bind mount inflicted possesions of the same
underlying device. Perhaps there are flags to mount to deal with
this as well, but it would then be a portability issue.

And finally, I think systemic solution is needed, will OCF help
to enforce law & order on the inter-agent(-instances) scales?
ClusterLabs/OCF-spec#22

@jnpkrn jnpkrn changed the title RA: Filesystem: failing to umount bind mount point will possibly kill everything, effectively committing suicide unintendedly RA: Filesystem: bind mount point unmounting postcondition check failure (for being member of 2+ mounts in misordered chaining scenario) has the power to unintendedly(!) kill all processes Mar 16, 2019
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Mar 17, 2019

Perhaps it'd be enough if mounting over an existing bind mount point
failed as unsupported, but that could cause a regression in some
extreme corner cases. Discretion preventing the agent to become
the world destroyer is likely in order regardless, as there's
hardly a counter-example intention to do that -- and if there is,
it is solvable with non-default force_unmount=strict.

@krig
Copy link
Contributor

krig commented Mar 18, 2019

Wow, that's a nasty bug.

@oalbrigt
Copy link
Contributor

I guess we could do a findmnt -n --target <mount> | grep -q "\[" to check if it's mounted on a bind mount point.

@jnpkrn Did you mean safe in your last comment force_unmount=strict? Anything but true or safe is treated as false for force_unmount.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Mar 18, 2019 via email

@oalbrigt
Copy link
Contributor

Sounds like this patch might solve this issue: #1362

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Aug 26, 2019

Vaguely, it would only solve the / rooted scenario, which is just
a subset of the general case, I think (have also some comments on
the "how" for said #1362, more discussion there).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants