-
Notifications
You must be signed in to change notification settings - Fork 302
registry: remove units from etcd registry upon DestroyUnit() #1510
base: master
Are you sure you want to change the base?
Conversation
@@ -316,9 +365,81 @@ func (r *EtcdRegistry) DestroyUnit(name string) error { | |||
} | |||
|
|||
// TODO(jonboulle): add unit reference counting and actually destroying Units | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonboulle could you please remember what was the use case that made you add this comment ? "unit reference counting" is there some optimization involved or something else ? thanks :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the delete operation needs to be atomic (essentially a
compare and delete, "delete this unit file IFF no other units reference
it"), but that information is not stored in one place in etcd at the
moment. That's where a reference count would come in.
I haven't looked at the PR code yet but - how does it deal with this
scenario of a race between destroying a unit and creating a new unit with
an identical unit file hash?
On Fri, Mar 18, 2016 at 2:00 PM Djalal Harouni [email protected]
wrote:
In registry/job.go
#1510 (comment):@@ -316,9 +365,81 @@ func (r *EtcdRegistry) DestroyUnit(name string) error {
}// TODO(jonboulle): add unit reference counting and actually destroying Units
@jonboulle https://github.com/jonboulle could you please remember what
was the use case that made you add this comment ? "unit reference counting"
is there some optimization involved or something else ? thanks :-)—
You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub
https://github.com/coreos/fleet/pull/1510/files/0be1141333baa76d27b0560f96614a371350a663#r56651729
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the delete operation needs to be atomic (essentially a compare and delete, "delete this unit file IFF no other units reference it"), but that information is not stored in one place in etcd at the moment. That's where a reference count would come in.
I haven't looked at the PR code yet but - how does it deal with this scenario of a race between destroying a unit and creating a new unit with an identical unit file hash?
Yeah, I see your point. This PR is basically not about changing the current design of fleet. I don't think I'm capable of doing it. Currently I don't have a better idea about how to do refcounting to prevent such a race between creating and destroying a unit.
But on the other hand, I think that the delete operation needs to be somehow implemented, from the perspective of usability. Thus I had to make a compromise, writing such a lengthy verification function before removing a unit. Of course it's not perfect, but that's probably one of doable things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question here is whether it's acceptable to introduce a race condition potentially leading to hard failure, only to handle something that is merely an annoyance... Doesn't sound like a good tradeoff to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question here is whether it's acceptable to introduce a race condition potentially leading to hard failure, only to handle something that is merely an annoyance... Doesn't sound like a good tradeoff to me.
If anyone could provide a better patch for this issue, I would be happy to close this PR in favor of the new one. Until then, I'd still think this PR is the only one I can come up with. OTOH given the fact that the 'TODO' comment was written 2 years ago, I don't expect anything to suddenly happen in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongsu, fair enough! thanks for the patches.
@dongsupark is a functional test possible for this case ? |
@tixxdz First of all, I don't think it's possible to test such an extreme case that 2 arbitrary units end up having the same sha1 hash. Theoretically it would be possible, but how would it be practically look like...? |
@dongsupark yes of course! yeh I was referencing "verify if the unit was actually removed" parts, thanks! |
Creating units with the same hash is easy. echo blabla > foo.service; cp On Fri, Mar 18, 2016 at 3:14 PM Dongsu Park [email protected]
|
@jonboulle thx! |
110eb5f
to
0ef648c
Compare
I added a new functional test |
628ee52
to
c254991
Compare
So finally I managed to get the new functional test Though the functional test still does not succeed completely. Other tests like |
functional/unit_action_test.go
Outdated
t.Fatalf("Unable to submit fleet unit: %v", err) | ||
} | ||
var stdout string | ||
stdout, _, err = cluster.Fleetctl(m, "list-units", "--no-legend") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't define the variable separately: use stdout, _, err := ...
instead.
5a0c768
to
8948883
Compare
@dongsupark function was restored. that was temporarily behavior. sorry for confusing. |
8948883
to
2193e1d
Compare
3b60e93
to
875d938
Compare
bdc94e8
to
fa5aa3a
Compare
So far each command "fleetctl destroy unit" has removed job entries from the etcd registry, under /_coreos.com/fleet/job. But it has not removed its unit file, under /_coreos.com/fleet/unit. As a result, fleet left lots of garbages in the etcd registry, so users had to manually clean them up. So this patch gets unit contents deleted actually from etcd registry when DestroyUnit() gets called. To avoid potential hash collisions, it first fetches a list of units from registry, to check there's any duplicated entry. Only if no duplicated unit is found, fleetd actually deletes the unit from registry. Fixes: coreos#1456 Fixes: coreos#1290 Reference: coreos#1291
Introduce new helpers for running etcdctl for functional tests. Also export a method Cluster.Keyspace() to get it called by functional tests.
A new test TestUnitDestroyFromRegistry() checks for a submitted unit being actually deleted from the etcd registry. To compare the old unit body with the one registered in the etcd registry, we need to go through several steps for queries to etcd.
9621433
to
60c9c80
Compare
20a3e96
to
3aaa1ab
Compare
eb6872f
to
365565e
Compare
39a99ba
to
44591b0
Compare
0132632
to
6974811
Compare
So far each command
"fleetctl destroy unit"
has removed job entries fromthe etcd registry, under
/_coreos.com/fleet/job
. But it has not removedits unit file, under
/_coreos.com/fleet/unit
. As a result, fleet leftlots of garbages in the etcd registry, so users had to manually clean
them up.
So this patch gets unit contents deleted actually from etcd registry
when
DestroyUnit()
gets called. To avoid potential hash collisions,it first fetches a list of units from registry, to check there's any
duplicated entry. Only if no duplicated unit is found, fleetd actually
deletes the unit from registry.
Fixes: #1456
Fixes: #1290
Reference: #1291
/cc @kayrus @wuqixuan