-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: 0.11.1 : More CPU usage for the database server #1066
Comments
The new version has improvements managing distributed queues. In other deployments version 0.11 was actually faster. |
Here is the feedback from my database provider: A few days ago, the database was handling less than 2,000 queries per second, including around 100 insertions per second. Currently, they are observing about 20,000 queries per second, including more than 5,000 insertions per second. |
Try disabling the metrics store to see if it helps |
For information : The change in behavior seems to have started around 10:30 AM (upgrade to 0.11.1), with a gradual increase in the number of queries. The queries mainly involve your DATABASE_NAME database. These queries follow a similar pattern: |
For your information, at 12 PM, I switched from a MySQL instance with 2 CPUs to an instance with 4 CPUs. The overall CPU usage remains stuck at 90% regardless of the number of CPUs in the instance. I suspect an issue related to the queue, as mentioned in my previous messages. In any case, there seems to be a problem since version 0.11.1, and the looping logs are related to the queue. Sorry to insist, but this is causing a massive slowdown for all my users, whereas in version 0.10.7 everything was perfect. I am currently stuck and therefore relying on you to resolve this issue. |
See #1069 for solution. |
Try to obtain the top queries that are causing most of the load. That will tell us what is going on. |
There are multiple issues open, I'm travelling at the moment and it's hard to follow them from my phone, let's stick to this one thread please. Here is a reply to each on of the problems you reported:
|
No problem, I’ll stay here. How do you explain that when I disable all the options in the spam filter, I no longer get any warning messages about the concurrent connection limit? I also see a significant drop in the number of MySQL connections, and the entire infrastructure becomes smooth again. I also no longer get any queue locked messages. Disabling the spam filter features resolves all these messages, whether they are informational or errors. Their excessive presence was not normal. I’m bringing all this up and insisting a bit to help Stalwart and assist you in identifying an issue, not to bother you. I dealt with the problem all day and I’m glad I resolved it by disabling the spam filter features. However, if I can help you figure out what is causing all the issues encountered (hence the multiple bugs reported, as initially I didn’t think they were related), it would be my pleasure. |
The spam filter does not trigger any outbound messages and does not interact with the queue in any way, so it is strange that the concurrency errors disappear when disabling it. The filter is executed at the SMTP DATA stage before a message is accepted in the system. Perhaps in 0.10 you had the spam filter disabled and you were not used to seeing all those extra queries? |
I’m completely lost. The issue has reappeared after an hour without problems. I have no outgoing emails, nothing is being sent. How can the limit be reached if nothing is going out? |
Check how many records do you have in tables 'e' and 'q', that is the message queue. |
Okay, so all the emails in the queue are looping. Please take a look at the following video. Enregistrement.de.l.ecran.2025-01-08.a.19.45.40.mov |
And do you see these messages from the webadmin? |
The 52 messages are visible in the web admin. I performed the following test: So, it seems that all emails in the queue are continuously looping rather than being sent once every 30 minutes, for example. I experience a heavy load as soon as there is an email in the queue because it loops and overloads everything |
Please export the queue by running stalwart with --console and I'll look into it soon. |
What is the command for export the queue, please ?
In the meantime, I used the CLI and sent it to you via Discord |
I'll look into this tomorrow evening. |
Alright. I sent you a video on Discord to illustrate the retry action that’s not working. CLI : Queue message détail : |
Do you have another method to force a retry? |
Try downgrading to 0.10 or use a single server for outgoing mail. This can be achieved by pausing the queue on all servers except one. |
If I revert to version 0.10.7, I encounter corrupted data errors.
I redirected port 25 to a single machine instead of the 4 machines, but how can I pause the queue on the other machines? Currently, the inbound message queues are looping at a rate of 3 to 4 times per second for each queue ID. You can see this in the videos, etc. |
Ok, I clicked on ‘Pause’ on 3 out of 4 servers, and my CPU load is almost back to normal. |
You won't lose any messages, one server can handle the delivery of the entire cluster. The issue seems to be related to the distributed locks so letting just one server handle the queue should alleviate that. You might want to clear the in memory counter table which contains the locks. Also, you seem to be the only distributed setup affected by this. This release is currently being used by an Enterprise client running an AlloyDB cluster handling hundreds of thousands of deliveries a day. The only difference is that they are using Redis as the memory store and of course AlloyDB but your issue seems to be unrelated to the backend type. |
Thank you for your response. To summarize: The 3 nodes with stopped queues no longer loop through the queued messages. To await resolution of the issue, I have:
This is therefore a temporary solution, but it generates an error for every received email. In version 0.10.7, I did not encounter this type of issue. The queue did not loop at all, and messages were delivered on time. MySQL is handling the workload well at the moment, and I am not sure if switching to a different database mid-operation is feasible without data loss, so I cannot replace MySQL. The videos I sent you clearly show the infinite looping on queued messages and the relationship between this infinite loop and the significant CPU spike on the database server. I hope the error in version 0.10.7 does not lead to another disaster, but for now, I have no other choice while waiting for your analysis and response. |
What happened?
Since the update to version 0.11.1 of the cluster machines (4 machines), I have noticed a 20-30% increase in CPU usage on the database server.
Is this an expected increase following this update?
How can we reproduce the problem?
Not simple to reproduce. The red bar represents the update.
Version
v0.11.x
What database are you using?
mySQL
What blob storage are you using?
S3-compatible
Where is your directory located?
Internal
What operating system are you using?
Linux
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: