-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/zenko 4912 #2160
Bugfix/zenko 4912 #2160
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
5fd6b9f
to
c3c33ac
Compare
74b980d
to
5bbb61a
Compare
6b3899b
to
9592b40
Compare
f60bcbf
to
9592b40
Compare
@@ -1,4 +1,4 @@ | |||
VERSION="2.10.1" | |||
VERSION="2.10.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to rebase: this should be the last commit of the PR (where we will make the release)
dcac175
to
79f84aa
Compare
aa04696
to
700ccd0
Compare
monitoring/mongodb/alerts.yaml
Outdated
@@ -183,10 +183,10 @@ groups: | |||
|
|||
- alert: MongoDbRSNotSynced | |||
expr: | | |||
sum by (rs_nm) (mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) != (${replicas} - 1) | |||
group by(rs_nm)(count by (rs_nm, pod)(mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) != (${replicas} - 1) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent to improve readabiltiy...
group by(rs_nm)(count by (rs_nm, pod)(mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) != (${replicas} - 1) ) | |
group by(rs_nm) ( | |
count by(rs_nm, pod) (mongodb_rs_members_state{namespace="${namespace}", pod=~"${service}.*", member_state="SECONDARY"}) | |
!= (${replicas} - 1) | |
) |
monitoring/mongodb/alerts.test.yaml
Outdated
- series: mongodb_rs_members_state{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-2", member_state="SECONDARY", rs_nm="data-db-mongodb-sharded-shard-0", member_idx="data-db-mongodb-sharded-shard0-data-1.data-db-mongodb-sharded-headless.zenko.svc.cluster.local:27017"} | ||
values: 2x10 | ||
- series: mongodb_rs_members_state{namespace="zenko", pod="data-db-mongodb-sharded-shard0-data-2", member_state="(not reachable/healthy)", rs_nm="data-db-mongodb-sharded-shard-0", member_idx="data-db-mongodb-sharded-shard0-data-2.data-db-mongodb-sharded-headless.zenko.svc.cluster.local:27017"} | ||
values: 2 _ _ _ _ _ _ _ _ _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when this serie disappears, the other series will actually change as well: as the remaining pods will not report on its status anymore... and the state reporte looks strange :/
- On "startup", we should have series : everything is running fine (3 pods having the status of 3 members)
- Then one pod "crashes" : the 3 associated series stop
- Then the other pods update their "vision" of the cluster : they either report a different member state (not reachable?) or drop the member completely (not sure exactly what happens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I saw the state is (not reachable/healthy) , the test was updated accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every metric from pod="data-db-mongodb-sharded-shard0-data-2"
should be removed when we simulate the pod crashing...
so i guess time series should look something like this:
# First pod
- series: member_state{pod="data-0", state="PRIMARY", member_idx="data-0"}
values: 1x10 1x10
- series: member_state{pod="data-0", state="SECONDARY", member_idx="data-1"}
values: 2x10 2x10
- series: member_state{pod="data-0", state="SECONDARY", member_idx="data-2"}
values: 2x10 stale # `stale` (i.e. not present) after the 10s sample
- series: member_state{pod="data-0", state="(not reachable/healthy)", member_idx="data-2"}
values: _x10 8x10 # appears from the 11s sample
# Second pod
- series: member_state{pod="data-1", state="PRIMARY", member_idx="data-0"}
values: 1x10 1x10
- series: member_state{pod="data-1", state="SECONDARY", member_idx="data-1"}
values: 2x10 2x10
- series: member_state{pod="data-1", state="SECONDARY", member_idx="data-2"}
values: 2x10 stale # `stale` (i.e. not present) after the 10s sample
- series: member_state{pod="data-1", state="(not reachable/healthy)", member_idx="data-2"}
values: _x10 8x10 # appears from the 11s sample
# Third pod : stops responding after 10th sample
- series: member_state{pod="data-2", state="PRIMARY", member_idx="data-0"}
values: 1x10 stale
- series: member_state{pod="data-2", state="SECONDARY", member_idx="data-1"}
values: 2x10 stale
- series: member_state{pod="data-2", state="SECONDARY", member_idx="data-2"}
values: 2x10 stale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing this : series: member_state{pod="data-0", state="SECONDARY", member_idx="data-2"}
values: 2x10 stale # stale
(i.e. not present) after the 10s sample
will cause the expression to return 2 => expected number of secondaries as it's not taking the state="(not reachable/healthy)" metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, stale
removes the metric series, which should be what really happens:
- the serie with state label equal to
secondary
(and value2
) stops - another serie starts with state label equal to
(not reachable/healthy)
and value8
(since one label value changes, this is really another serie : even if [as human] we think it is the same one...)
6098f81
to
36a0a30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but alert test fail...
66edb90
to
d35ba07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor test improvements : please rebase/cleanup before approving
572ad76
to
deff4f0
Compare
MongoDbRSNotSynced firing when it shouldn't because today we sum up the members state , as we are calculating the number of secondaries, we end up with a higher value | than the expected one to have the right value an additional filtering based on the instance have been introduced as well Issue: ZENKO-4912
deff4f0
to
80ff4c7
Compare
/approve |
In the queueThe changeset has received all authorizations and has been added to the The changeset will be merged in:
The following branches will NOT be impacted:
There is no action required on your side. You will be notified here once IMPORTANT Please do not attempt to modify this pull request.
If you need this pull request to be removed from the queue, please contact a The following options are set: approve |
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
Please check the status of the associated issue ZENKO-4912. Goodbye benzekrimaha. |
Issue : ZENKO-4912