Fix wrong end_time #946

hmpf · 2024-11-15T11:24:31Z

Hopefylly closes #935

Also, attempt to clean up/document event types, and event type validation in the API.

src/argus/incident/models.py

elfjes · 2024-11-18T06:33:00Z

src/argus/incident/models.py

+        if self.stateless_event:
+            # Weird, stateless event without stateless end_time, fix
+            self.end_time = None
+            self.save()
+            LOG.warn("Mismatch between self %s end_time and event type: set stateless", self.pk)
+            return True


Seems like a normalization issue? Two ways of indicating that an event is stateless?

Yep, and end_time = None is the more important/older one.

oh well 🤷

There is a question what should happen if there is a stateless event but end_time is wrong. Here, I trust the statetless event and unset end_time, but we could also delete the event. Though, then we would have to check that there is an incident-start event etc. etc. etc.

elfjes · 2024-11-18T06:34:32Z

src/argus/incident/models.py

+    def event_already_exists(self, event_type):
+        return self.events.filter(type=event_type).exists()
+
+    def repair_end_time(self) -> Optional[bool]:


Make sure that you have prefetched all events here, because otherwise thie function may cause a bunch of round trips to the db

Not sure what you mean. This is on the model instance. Prefetch before a loop over instances?

self.events.prefetch_related(..) would work but the only relevant thing to prefetch would be acknowledgments, which is irrelevant for exists(), and might be handled behind the scenes anyway thanks to the OneToOneField.

Did you mean something in repair_end_time?

Yes, I meant in repair_end_time, should've made that clearer. It has a bunch of checks based on properties that each filter on self.events, so make sure you're not firing a query for every check. Maybe it's prudent to just collect all events and manually process the list. As long as it's not too large, remembering incidents with 1000s of events

src/argus/incident/models.py

elfjes · 2024-11-18T06:52:44Z

src/argus/incident/views.py

+                    incident.repair_end_time()
+                    abort_due_to_too_many_events(incident, event_type)


This is where we would end up in our case. There already is an END event, but the incident is reported open, so we fire another END event. The incident is repaired, so the end_time is correctly set (closing the incident), but we still get a 400 error back like we did something wrong. We're being gaslighted here: "I didn't make a mistake and just fixed it, you made a mistake!" 😂

perhaps look a the return value of the repair?

The difficulty is: we must prevent storing the new event since it is redundant after the repair. How do we report that fact back to the API client in the best way?

incident.repair_end_time does not create any events itself.

Do we need to report that back? The client sees the incident is open and after their request the incident is closed. The way I see it is: as far as the client is concerned, the request was succesful.

We might have to fake something in the API endpoint then, I'll dig into the drf-code...

I'll redirect to the GET incidents-endpoint if the client is provably outdated.

src/argus/incident/views.py

/.. also, attempt to clean up/document event types, and event type validation in the API.

elfjes · 2024-12-10T09:40:17Z

src/argus/incident/views.py

+                    incident.repair_end_time()
+                    abort_due_to_too_many_events(incident, event_type)


Do we need to report that back? The client sees the incident is open and after their request the incident is closed. The way I see it is: as far as the client is concerned, the request was succesful.

src/argus/incident/views.py

elfjes · 2024-12-10T09:52:26Z

src/argus/incident/models.py

+        if not self.stateful:
+            # the vital part for statelessness is set correctly
+            LOG.info("Incident %s: No mismatch, correctly stateless", self.pk)
+            return


This implies that you only want to call repair_end_time on stateful events? is that worth documenting?

I don't see how that is implied.

repair_end_time has been designed to be called from anywhere and to do the right thing even if nothing is wrong. It does not assume pre-validation.

If I removed this block, or the comment in it, then I would almost be willing to bet the big bux (I don't bet though, not even LOTTO. Too much knowledge of the right kind of mathematics.) that some time in the future some helpful soul complains that the method is incomplete because it lacks this block or its comment.

elfjes · 2024-12-10T10:18:35Z

src/argus/incident/views.py

+                    repaired = incident.repair_end_time()
+                    if repaired:
+                        raise AttributeError("end_time mismatch repaired, see logs")
+                    # should never happen
+                    LOG.error("Something weird happened, see other logs")
+                    raise AttributeError("end_time mismatch was in error, see logs")


AttributeError seems like a strange exception to raise here. Perhaps a custom Exception class instead?

See commit add new exceptions

src/argus/incident/models.py

elfjes · 2024-12-10T13:04:28Z

src/argus/incident/views.py

+                        raise SuccessfulRepairException("end_time mismatch repaired, see logs")
+                    # should never happen, insufficent preceeding logic construct?
+                    LOG.error("Something weird happened, see other logs")
+                    raise InconceivableException("Found end_time mismatch was in error, see logs")


Exception message seems to have a grammatical error, not sure what you wanted to say here

See new commit

Uh, which exception message?

"Found end_time mismatch was in error, see logs"

(The last line is hidden by a scroll bar.)

oh wow 🤦

src/argus/util/exceptions.py

hmpf · 2024-12-10T13:14:16Z

For some reason I'm not allowed to reply to @elfjes ' comments on line 507 to 510 in incident/views:

            if event_type in Event.CLOSING_TYPES and not incident.open:
                self._abort_due_to_type_validation_error("The incident is already closed.")
            if event_type == Event.Type.REOPEN and incident.open:
                self._abort_due_to_type_validation_error("The incident is already open.")

I presume these lines are also here for completeness sake just like the stateless checks in repair_end_time..

A source system is not allowed to reopen but that is handled earlier in the process, in validate_event_type_for_user.

If a client tries to reopen something that is correctly reopened already, or close something that is correctly closed already, then the client has an outdated view of the world. The server cannot fix that, or repair anything in itself to fix that, it can only report.

All we can do is either:

fail quietly and either ship back
- nothing
- the original event that the client doesn't know about
what we are currently doing: fail loudly and ship back a ValidationError.
fail loudly with something else than ValidationError and I haven't found any good status codes

sonarcloud · 2024-12-10T13:24:01Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

elfjes · 2024-12-10T13:24:40Z

For some reason I'm not allowed to reply to @elfjes ' comments on line 507 to 510 in incident/views. I presume these lines are also here for completeness sake just like the stateful checks in repair_end_time..

A source system is not allowed to reopen but that is handled earlier in the process, in validate_event_type_for_user.

If a client tries to reopen something that is correctly reopened already, or close something that is correctly closed already, then the client has an outdated view of the world. The server cannot fix that, or repair anything in itself to fix that, it can only report.

All we can do is either:
* fail quietly and either ship back
  
  * nothing
  * the original event that the client doesn't know about

* what we are currently doing: fail loudly and ship back a ValidationError.

Yes, in case there was nothing to repair and it's all the clients fault, a ValidationError would be ok, even though i think failing quietly is a nicer experience. Not sure what would be the value of sending back of the original event; imho the current/actual state of the incident (ie the incident itself) would be more useful.

My comment was on 510 only (keep forgetting that GH adds the other lines and that I thus need to be more explicit in my comments): the text "Found end_time mismatch was in error, see logs" is not gramatically correct

hmpf · 2024-12-10T13:49:28Z

All we can do is either:

fail quietly and either ship back

nothing

the original event that the client doesn't know about

what we are currently doing: fail loudly and ship back a ValidationError.

Yes, in case there was nothing to repair and it's all the clients fault, a ValidationError would be ok, even though i think failing quietly is a nicer experience. Not sure what would be the value of sending back of the original event; imho the current/actual state of the incident (ie the incident itself) would be more useful.

The endpoint returns a serialized copy of the newly created event if everything is alright, having it return a serialized incident after a repair makes no sense. After the repair it needs to re-fetch the incident, yes, so it needs to know that the repair happened.

We could change the type of the ValidationError: Now it sends {"type": message}. We could make it send {"repair": message}.

My comment was on 510 only (keep forgetting that GH adds the other lines and that I thus need to be more explicit in my comments): the text "Found end_time mismatch was in error, see logs" is not gramatically correct

hmpf requested review from johannaengland, stveit, lunkwill42 and elfjes November 15, 2024 11:24

hmpf force-pushed the fix-wrong-end-time branch from d6468ba to 95a3ba7 Compare November 15, 2024 12:20

elfjes suggested changes Nov 18, 2024

View reviewed changes

hmpf self-assigned this Nov 28, 2024

hmpf force-pushed the fix-wrong-end-time branch 2 times, most recently from bd6b87d to e23c3b9 Compare December 5, 2024 08:30

hmpf added 5 commits December 9, 2024 14:19

Fix wrong end_time

a75a351

/.. also, attempt to clean up/document event types, and event type validation in the API.

fix typo

f837230

fix typo

234cc22

fix test problem

240a44f

add changelog fragment

aa56697

hmpf force-pushed the fix-wrong-end-time branch from e23c3b9 to aa56697 Compare December 9, 2024 13:19

hmpf added 3 commits December 9, 2024 14:25

lint

c0975fb

fixup: use constants

ed86365

not finished: improved/safer repair logic and reporting

4849416

hmpf requested a review from elfjes December 10, 2024 09:42

hmpf added the ddn Design decision needed label Dec 10, 2024

elfjes reviewed Dec 10, 2024

View reviewed changes

add new exceptions

1fdac45

elfjes reviewed Dec 10, 2024

View reviewed changes

hmpf added 2 commits December 10, 2024 14:19

improve exception comments

5cd1754

fix comment where Inconceivable! is used

9f5c925

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wrong end_time #946

Fix wrong end_time #946

hmpf commented Nov 15, 2024

elfjes Nov 18, 2024

hmpf Dec 10, 2024

elfjes Dec 10, 2024

hmpf Dec 10, 2024

elfjes Nov 18, 2024

hmpf Dec 10, 2024 •

edited

Loading

elfjes Dec 10, 2024

elfjes Nov 18, 2024

hmpf Dec 10, 2024 •

edited

Loading

elfjes Dec 10, 2024

hmpf Dec 10, 2024

hmpf Dec 11, 2024

elfjes Dec 10, 2024

elfjes Dec 10, 2024

hmpf Dec 10, 2024

elfjes Dec 10, 2024

hmpf Dec 10, 2024

elfjes Dec 10, 2024

hmpf Dec 10, 2024

hmpf Dec 10, 2024

elfjes Dec 10, 2024

hmpf Dec 10, 2024

elfjes Dec 10, 2024

hmpf commented Dec 10, 2024 •

edited

Loading

sonarcloud bot commented Dec 10, 2024

elfjes commented Dec 10, 2024

hmpf commented Dec 10, 2024

		incident.repair_end_time()
		abort_due_to_too_many_events(incident, event_type)

Fix wrong end_time #946

Are you sure you want to change the base?

Fix wrong end_time #946

Conversation

hmpf commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmpf Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmpf Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmpf commented Dec 10, 2024 • edited Loading

sonarcloud bot commented Dec 10, 2024

Quality Gate passed

elfjes commented Dec 10, 2024

hmpf commented Dec 10, 2024

hmpf Dec 10, 2024 •

edited

Loading

hmpf Dec 10, 2024 •

edited

Loading

hmpf commented Dec 10, 2024 •

edited

Loading