Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing event "Empty Notification" in "hono.event.${tenant_id}" Kafka topic when a larger number of devices are disconnected from MQTT adapter at the same time #3655

Closed
petr-cada opened this issue Aug 21, 2024 · 8 comments
Labels
C&C Command and Control MQTT Adapter

Comments

@petr-cada
Copy link

We are using Eclipse Hono in our product for connecting devices via MQTT (hono-adapter-mqtt). We are installing Eclipse Hono to our Kubernetes cluster via Helm chart:

dependencies:

We are also using Eclipse Ditto in our product. Hono and Ditto are communicating with each other via Kafka.

We found out that sometimes devices have wrong data in ConnectionStatus feature in Ditto. Value in readyUntil not corresponding the real state of MQTT connection of device (device is disconnected but it looks like connected). We are using ConnectionStatus mapper https://eclipse.dev/ditto/connectivity-mapping.html#connectionstatus-mapper to automatically set ConnectionStatus (based on Eclipse Hono device notifications).

After some investigation we found out following.
When we disconnect larger number of devices (e.g. 100) from Hono MQTT adapter at the same time the "Empty Notification" https://eclipse.dev/hono/docs/api/event/#empty-notification with "ttd":"0" is not present in "hono.event.${tenant_id}" Kafka topic for some devices (it happens only sometimes).
Just for curriosity we enabled "Connection Event" https://eclipse.dev/hono/docs/api/event/#connection-event for Hono MQTT adapter. The interesting is that "Connection Event" with "cause": "disconnected" is always present in "hono.event.${tenant_id}" Kafka topic.
As you can see in the picture bellow the "Empty Notification" with "ttd":"0" is missing after "Connection Event" with "cause": "disconnected".

The consequence of this is that "ConnectionStatus" feature in Ditto does not contain proper values. To put it simply the device in Ditto seems as connected even though it is disconnected.

We did not found any error in Hono MQTT adapter and Hono command router services which will lead us to cause of this problem.

not-working-connected-disconnected-empty-events

@calohmn calohmn added MQTT Adapter C&C Command and Control labels Aug 21, 2024
@calohmn
Copy link
Contributor

calohmn commented Aug 21, 2024

The "ttd":"0" notifications get sent by the Command Router (see #2760). (In contrast to that, the "Connection Event" messages get sent by the MQTT adapter directly.)
Missing ttd notifications could mean that there was an issue with the connection between MQTT adapter and CommandRouter ("no-credit" errors maybe), or it could be that the device command subscription in the command router had already been removed before. Or it could happen if before the MQTT device disconnect, there was already another connect/subscribe from the same device id (log message in that case: "command consumer not removed - handler already replaced or removed").

@petr-cada It would be helpful, if you could provide DEBUG log output of the MQTT adapter and Command Router pods of such a case. (In the Hono Helm chart, DEBUG logs are enabled if you haven't set the quarkusLoggingProfile value to "prod".)

@petr-cada
Copy link
Author

You can find the logs from MQTT adapter and Command Router pods at the end of this post. The logs are from today's reproduction of the issue.

Just summary how I reproduced the issue today.

  • I connected 200 devices to our product.
  • Then I disconnected all devices at the same time.
  • After that I repeated the connection and disconnection of all devices a few times.
  • Everything worked fine.
  • At 13:49 CEST (11:49 UTC - in logs) I connected all devices. Everything looked fine in our product.
  • Then at 13:51 CEST (11:51 UTC - in logs) I disconnected all devices.
  • After the last disconnection, 44 devices still appeared connected in our system.
  • I investigated Kafka topic "hono.event.${tenant_id}".
  • I compared entries for disconnected and "connected" (in reality also disconnected) devices and I found that
    • disconnected devices had a last entry in Kafka "Empty Notification" with "ttd":"0"
    • "connected" (in reality also disconnected) devices did not have this entry in Kafka (last entry was "Connection Event" with "cause": "disconnected")

Look at the images

The provided log files contains logs from MQTT adapter and Command Router pods from that time.

Just don't be confused by the timestamps.
Logs contain UTC timestamps and screenshots from Kafka contain CEST (UTC+2) timestamps.

Log files
command-router-44-appeared-not-disconnected.txt
mqtt-adapter-44-appeared-not-disconnected.txt

List of all devices which appeared as connected (in reality they were disconnected)
list-of-devices-appeared-not-disconnected

@calohmn
Copy link
Contributor

calohmn commented Aug 22, 2024

  • Then at 13:51 CEST (11:51 UTC - in logs) I disconnected all devices.

In the MQTT adapter logs I only see corresponding messages for 6 devices:

2024-08-22 11:51:09,891 [...] connection to device closed [[...], device-id: yunex:load-test-device-194, client ID: yunex:load-test-device-194]
2024-08-22 11:51:09,894 [...] connection to device closed [[...], device-id: yunex:load-test-device-178, client ID: yunex:load-test-device-178]
2024-08-22 11:51:09,895 [...] connection to device closed [[...], device-id: yunex:load-test-device-181, client ID: yunex:load-test-device-181]
2024-08-22 11:51:09,897 [...] connection to device closed [[...], device-id: yunex:load-test-device-131, client ID: yunex:load-test-device-131]
2024-08-22 11:51:09,899 [...] connection to device closed [[...], device-id: yunex:load-test-device-149, client ID: yunex:load-test-device-149]
2024-08-22 11:51:09,900 [...] connection to device closed [[...], device-id: yunex:load-test-device-168, client ID: yunex:load-test-device-168]
------- last log line at 2024-08-22 11:51:10,399 -------

How are you performing the disconnect? Are you sending an MQTT DISCONNECT Packet or just closing the socket? In the latter case, it may take a while until the MQTT adapter notices that the client isn't connected any more (we had such scenarios in #2244 for example). This also depends on the MQTT keep-alive value reported by your MQTT client.

Another thing:
In the Command Router logs, I see no log messages at all concerning the device disconnects from 11:51:09 UTC, although there should have been unregister command consumer log messages there (corresponding to the logged CommandRouter responses in the MQTT adapter) and although there are other log messages there until the last one at 11:51:13,276.
Is it that you have another Command Router pod running?

@petr-cada
Copy link
Author

Regarding section "How are you performing the disconnect? ..."

In this test scenario MQTT DISCONNECT Packet should not be sent from a device. But it should not matter.
Just a context for better understanding of situation. We need to test all possible scenarios which include e.g. power outage on device. We need to guarantee that connection status of device (connected/disconnected) is correct under all circumstance. Users quite heavily depend on this functionality.

I think the wrong detection of device disconnection is not problem of mqtt adapter because as you wrote in your first response "In contrast to that, the "Connection Event" messages get sent by the MQTT adapter directly.".
The "Connection Event" with "cause": "disconnected" is always present in "hono.event.${tenant_id}" Kafka topic (as I wrote before) so disconnection of device was detected correctly by mqtt adapter.
The problem is that "Empty Notification" with "ttd":"0" missing in "hono.event.${tenant_id}" Kafka topic (ttd is used by Ditto ConnectionStatus mapper https://eclipse.dev/ditto/connectivity-mapping.html#connectionstatus-mapper).

Regarding section "In the Command Router logs, I see no log messages ..."

There is only one instance of Command Router service (one pod) in testing environment where the test was executed.

@calohmn
Copy link
Contributor

calohmn commented Aug 26, 2024

Regarding section "In the Command Router logs, I see no log messages ..."

There is only one instance of Command Router service (one pod) in testing environment where the test was executed.There is only one instance of Command Router service (one pod) in testing environment where the test was executed.

Then it seems to me that the interesting parts of the log output possibly came after the time period covered in the attached log files.

A possible cause for the issue here could be that there are errors in communication between the MQTT adapter and the command router because of the high volume of requests at a time.

You could try adapting some configuration values:
In the command router deployment YAML, you could add an env variable with

name: HONO_COMMANDROUTER_AMQP_RECEIVERLINKCREDIT
value: 300

The corresponding property of the Helm chart values.yaml would be commandRouterService.hono.commandRouter.amqp.receiverLinkCredit.

In the MQTT adapter deployment YAML, you could add an env variables with

name: HONO_COMMANDROUTER_REQUESTTIMEOUT
value: 500

The corresponding property of the Helm chart values.yaml would be
adapters.commandRouterSpec.requestTimeout.

As a general fix in Hono, it should be considered to add some sort of retry mechanism in case of "no credit" errors when unsubscribing devices and/or adding batch requests for unsubscribing (see #3445).

@petr-cada
Copy link
Author

After a testing it seems that increasing value of HONO_COMMANDROUTER_AMQP_RECEIVERLINKCREDIT from 100 to 300 fixed the issue (sudden disconnection of 200 devices is now working correctly).

I guess that increasing of HONO_COMMANDROUTER_AMQP_RECEIVERLINKCREDIT can also lead to increased consumption of memory. So I increased hono.commandRouterService.resources.requests.memory (in values.yaml) from default value 256Mi to 512Mi.

Do you have any idea how far can I go with HONO_COMMANDROUTER_AMQP_RECEIVERLINKCREDIT value (whether e.g. 500 is safe) based on this newly "guaranteed" value (512Mi) of memory.

Thank you

@calohmn
Copy link
Contributor

calohmn commented Sep 5, 2024

I guess that increasing of HONO_COMMANDROUTER_AMQP_RECEIVERLINKCREDIT can also lead to increased consumption of memory. So I increased hono.commandRouterService.resources.requests.memory (in values.yaml) from default value 256Mi to 512Mi.

Do you have any idea how far can I go with HONO_COMMANDROUTER_AMQP_RECEIVERLINKCREDIT value (whether e.g. 500 is safe) based on this newly "guaranteed" value (512Mi) of memory.

On the AMQP link between a protocol adapter and the command router, only messages concerning the management of the command subscriptions get sent (not command(response) messages itself). These management messages are usually very small. Therefore, I wouldn't expect much of a memory usage increase with the 500 link credit value. You could check the memory metrics of the pods during a test run to get actual values.
With the increased link credit and the possible higher number of in-flight messages because of that, the time that it takes for individual requests from the protocol adapter to be processed by the command router could go up. Therefore there could be the need to increase the request-timeout (see previous message above), if you see timeout-errors in the protocol adapter. Such errors would come up as request timed out after [x]ms log messages, e.g.:

DEBUG [org.ecl.hon.cli.amq.RequestResponseClient] (vert.x-eventloop-thread-1) canceling request [target: cmd_router/DEFAULT_TENANT, correlation ID: cmd_router-client-9a4bf87f-a0f0-4465-96a9-9d579f3a6797]: request timed out after 200ms
INFO  [org.ecl.hon.cli.com.ProtocolAdapterCommandConsumerFactoryImpl] (vert.x-eventloop-thread-1) error unregistering consumer with the command router service [tenant: DEFAULT_TENANT, device: testdeviceA50]: org.eclipse.hono.client.ServerErrorException: request timed out after 200ms

@petr-cada
Copy link
Author

Thank you for the information. We had already set HONO_COMMANDROUTER_REQUESTTIMEOUT to "10000" before we reported the issue, which I believe is more than enough (we didn't see any timeout error with this setting). In our case, increasing HONO_COMMANDROUTER_AMQP_RECEIVERLINKCREDIT resolved the issue. I will close this issue as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C&C Command and Control MQTT Adapter
Projects
None yet
Development

No branches or pull requests

2 participants