[bug]: 0.11.1 : More CPU usage for the database server #1066

NathanHbt · 2025-01-08T10:31:02Z

What happened?

Since the update to version 0.11.1 of the cluster machines (4 machines), I have noticed a 20-30% increase in CPU usage on the database server.
Is this an expected increase following this update?

How can we reproduce the problem?

Not simple to reproduce. The red bar represents the update.

Version

v0.11.x

What database are you using?

mySQL

What blob storage are you using?

S3-compatible

Where is your directory located?

Internal

What operating system are you using?

Linux

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

mdecimus · 2025-01-08T12:19:07Z

The new version has improvements managing distributed queues. In other deployments version 0.11 was actually faster.
Can you check which queries are generating the load?
Also it could be that a purge task is running at the moment.

NathanHbt · 2025-01-08T12:41:34Z

For your information, since the update, the CPU usage is at 95%, and the number of connections has increased from 50 to a constant 250.
I have a lot of “waiting for handler commit” and activity in the “m” table.

NathanHbt · 2025-01-08T12:42:34Z

NathanHbt · 2025-01-08T12:55:16Z

NathanHbt · 2025-01-08T13:37:24Z

Here is the feedback from my database provider:

A few days ago, the database was handling less than 2,000 queries per second, including around 100 insertions per second.

Currently, they are observing about 20,000 queries per second, including more than 5,000 insertions per second.

mdecimus · 2025-01-08T14:22:21Z

Try disabling the metrics store to see if it helps

NathanHbt · 2025-01-08T14:24:36Z

You speak about that ?

NathanHbt · 2025-01-08T14:36:16Z

For information :

The change in behavior seems to have started around 10:30 AM (upgrade to 0.11.1), with a gradual increase in the number of queries.

The queries mainly involve your DATABASE_NAME database.

These queries follow a similar pattern:
INSERT INTO m (k, v) VALUES (…) ON DUPLICATE KEY UPDATE v = VALUES(v).

NathanHbt · 2025-01-08T16:09:58Z

For your information, at 12 PM, I switched from a MySQL instance with 2 CPUs to an instance with 4 CPUs. The overall CPU usage remains stuck at 90% regardless of the number of CPUs in the instance.

I suspect an issue related to the queue, as mentioned in my previous messages. In any case, there seems to be a problem since version 0.11.1, and the looping logs are related to the queue.

Sorry to insist, but this is causing a massive slowdown for all my users, whereas in version 0.10.7 everything was perfect. I am currently stuck and therefore relying on you to resolve this issue.

NathanHbt · 2025-01-08T17:03:21Z

See #1069 for solution.

mdecimus · 2025-01-08T17:05:11Z

Try to obtain the top queries that are causing most of the load. That will tell us what is going on.

mdecimus · 2025-01-08T17:23:26Z

There are multiple issues open, I'm travelling at the moment and it's hard to follow them from my phone, let's stick to this one thread please. Here is a reply to each on of the problems you reported:

Concurrent limit exceeded: This is not an error message, the server is just telling you that there is a message due for immediate delivery but it's not possible to process it at the moment because the node has exceeded the maximum number of concurrent outbound connections. This is a new feature to protect the server from crashing trying to deliver more messages than it can handle. However the default log level should be changed to debug as it can generate a massive amount of events as you have seen.
Queue event locked: Also not an error, it means that another server is processing a certain message.
Spam filter causing database load: it's probably the Bayes filter storing the weights and reputation data. I recommend that you use Redis or similar as the in-memory store. Even if you don't use the spam filter you should be using an in-memory store rather than SQL. Read the updated documentation that explains what is the in-memory store used for.

NathanHbt · 2025-01-08T17:27:26Z

No problem, I’ll stay here.

How do you explain that when I disable all the options in the spam filter, I no longer get any warning messages about the concurrent connection limit? I also see a significant drop in the number of MySQL connections, and the entire infrastructure becomes smooth again.

I also no longer get any queue locked messages.

Disabling the spam filter features resolves all these messages, whether they are informational or errors. Their excessive presence was not normal.

I’m bringing all this up and insisting a bit to help Stalwart and assist you in identifying an issue, not to bother you. I dealt with the problem all day and I’m glad I resolved it by disabling the spam filter features. However, if I can help you figure out what is causing all the issues encountered (hence the multiple bugs reported, as initially I didn’t think they were related), it would be my pleasure.

mdecimus · 2025-01-08T17:52:18Z

The spam filter does not trigger any outbound messages and does not interact with the queue in any way, so it is strange that the concurrency errors disappear when disabling it. The filter is executed at the SMTP DATA stage before a message is accepted in the system.
What the spam filter will generate is a lot of queries required to run the Bayes classifier and the reputation modules, that is why I was suggesting to use Redis (which you should do in any case).

Perhaps in 0.10 you had the spam filter disabled and you were not used to seeing all those extra queries?

NathanHbt · 2025-01-08T18:35:08Z

I’m completely lost. The issue has reappeared after an hour without problems.
Where could this message come from:
“Concurrency limit exceeded (queue.concurrency-limit-exceeded) details = Outbound concurrency limit exceeded, limit = 25”?

I have no outgoing emails, nothing is being sent. How can the limit be reached if nothing is going out?

mdecimus · 2025-01-08T18:40:08Z

Check how many records do you have in tables 'e' and 'q', that is the message queue.
But you should also see them from the queue management page in the webadmin.

NathanHbt · 2025-01-08T18:42:35Z

Only 52 messages in these tables.

NathanHbt · 2025-01-08T18:46:24Z

Okay, so all the emails in the queue are looping. Please take a look at the following video.

Enregistrement.de.l.ecran.2025-01-08.a.19.45.40.mov

mdecimus · 2025-01-08T18:47:19Z

And do you see these messages from the webadmin?
Also, are you experiencing high load also at this moment?

NathanHbt · 2025-01-08T18:49:23Z

The 52 messages are visible in the web admin.
Watch the video. You can see that a message in the queue on the web admin is continually being re-pushed.

I performed the following test:
Deleting the email from the queue via the interface.
Restarting the server.
The log no longer loops on this queue ID.

So, it seems that all emails in the queue are continuously looping rather than being sent once every 30 minutes, for example.

I experience a heavy load as soon as there is an email in the queue because it loops and overloads everything

mdecimus · 2025-01-08T18:50:00Z

Please export the queue by running stalwart with --console and I'll look into it soon.

NathanHbt · 2025-01-08T18:51:11Z

What is the command for export the queue, please ?
i have only :
Enter commands (type 'help' for available commands).

help
Available commands:
scan <from_key> <to_key>
delete <from_key> [<to_key>]
get
put []
help
exit/quit

In the meantime, I used the CLI and sent it to you via Discord

NathanHbt · 2025-01-08T19:26:26Z

I just sent you a private video on Discord that clearly shows the issue.

By clearing the queue of 50 messages and restarting the cluster nodes, the database CPU load returns to normal (80% => 30%). This suggests there is an infinite loop in the process handling queued messages, which could explain the thousands of log entries and the limits being reached.

I am not sharing the video here for confidentiality reasons.

mdecimus · 2025-01-09T08:36:04Z

I'll look into this tomorrow evening.

NathanHbt · 2025-01-09T08:47:35Z

Alright.
For your information, in the meantime, I am forced to manually clear the message queue.
I tried creating a script to force a “retry” on the messages in the queue, but neither the interface nor the CLI works for the retry. The email remains in the queue, and no server error message is displayed. The only solution is to “cancel” a message, which results in losing it.

I sent you a video on Discord to illustrate the retry action that’s not working.

CLI :
queue retry 307D7024CA03859
Successfully rescheduled 1 message(s).

Queue message détail :
Status | Scheduled |
+--------------+----------------------------------------------+
| Details | |
+--------------+----------------------------------------------+
| Retry # | 0 |

NathanHbt · 2025-01-09T09:15:33Z

Do you have another method to force a retry?
This heavily impacts my infrastructure, and managing 3 days of issues will be challenging.

mdecimus · 2025-01-09T10:34:19Z

Try downgrading to 0.10 or use a single server for outgoing mail. This can be achieved by pausing the queue on all servers except one.

NathanHbt · 2025-01-09T10:36:47Z

If I revert to version 0.10.7, I encounter corrupted data errors.

ERROR Data corruption detected (store.data-corruption) causedBy = crates/store/src/write/mod.rs:365, causedBy = crates/store/src/dispatch/store.rs:125, causedBy = crates/jmap/src/services/index.rs:122, details = Failed to iterate over index emails

I redirected port 25 to a single machine instead of the 4 machines, but how can I pause the queue on the other machines?

Currently, the inbound message queues are looping at a rate of 3 to 4 times per second for each queue ID. You can see this in the videos, etc.

NathanHbt · 2025-01-09T10:40:47Z

Ok, I clicked on ‘Pause’ on 3 out of 4 servers, and my CPU load is almost back to normal.
Only server #1 continues looping through the queue IDs. Acceptable for now while fixing the issue. I’m afraid I might lose quite a few emails.

NathanHbt · 2025-01-09T11:11:15Z

I also notice that many emails remain in the queue without any specific reason displayed.
Take this email, for example; the ‘Server Response’ field remains blank, yet the email is not delivered.

mdecimus · 2025-01-09T11:27:23Z

Ok, I clicked on ‘Pause’ on 3 out of 4 servers, and my CPU load is almost back to normal.

Only server #1 continues looping through the queue IDs. Acceptable for now while fixing the issue. I’m afraid I might lose quite a few emails.

You won't lose any messages, one server can handle the delivery of the entire cluster. The issue seems to be related to the distributed locks so letting just one server handle the queue should alleviate that. You might want to clear the in memory counter table which contains the locks.

Also, you seem to be the only distributed setup affected by this. This release is currently being used by an Enterprise client running an AlloyDB cluster handling hundreds of thousands of deliveries a day. The only difference is that they are using Redis as the memory store and of course AlloyDB but your issue seems to be unrelated to the backend type.

NathanHbt · 2025-01-09T11:39:45Z

Thank you for your response.

To summarize:
• 3 nodes with the queue stopped
• 1 node with the queue active receiving emails via port 25

The 3 nodes with stopped queues no longer loop through the queued messages.
The node with the active queue continuously loops. Each queue ID appears in the logs multiple times per second, causing delivery delays and a significant increase in CPU usage on the MySQL database.

To await resolution of the issue, I have:
• Set up a server running version 0.10.7
• Redirected incoming SMTP traffic to this server
• Encountered the following error for each message, but the mail is successfully received, and indexing is properly performed on the other nodes.

ERROR Data corruption detected (store.data-corruption) causedBy = crates/store/src/write/mod.rs:365, causedBy = crates/store/src/dispatch/store.rs:125, causedBy = crates/jmap/src/services/index.rs:122, details = Failed to iterate over index emails

This is therefore a temporary solution, but it generates an error for every received email.

In version 0.10.7, I did not encounter this type of issue. The queue did not loop at all, and messages were delivered on time. MySQL is handling the workload well at the moment, and I am not sure if switching to a different database mid-operation is feasible without data loss, so I cannot replace MySQL.

The videos I sent you clearly show the infinite looping on queued messages and the relationship between this infinite loop and the significant CPU spike on the database server.

I hope the error in version 0.10.7 does not lead to another disaster, but for now, I have no other choice while waiting for your analysis and response.

NathanHbt added the bug Something isn't working label Jan 8, 2025

NathanHbt mentioned this issue Jan 8, 2025

[bug]: 0.11.1 : Queue event is locked by another process (queue.locked) #1069

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: 0.11.1 : More CPU usage for the database server #1066

[bug]: 0.11.1 : More CPU usage for the database server #1066

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025 •

edited

Loading

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025 •

edited

Loading

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025 •

edited

Loading

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025 •

edited

Loading

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025 •

edited

Loading

NathanHbt commented Jan 8, 2025 •

edited

Loading

mdecimus commented Jan 9, 2025

NathanHbt commented Jan 9, 2025 •

edited

Loading

NathanHbt commented Jan 9, 2025

mdecimus commented Jan 9, 2025

NathanHbt commented Jan 9, 2025 •

edited

Loading

NathanHbt commented Jan 9, 2025

NathanHbt commented Jan 9, 2025

mdecimus commented Jan 9, 2025

NathanHbt commented Jan 9, 2025 •

edited

Loading

[bug]: 0.11.1 : More CPU usage for the database server #1066

[bug]: 0.11.1 : More CPU usage for the database server #1066

Comments

NathanHbt commented Jan 8, 2025

What happened?

How can we reproduce the problem?

Version

What database are you using?

What blob storage are you using?

Where is your directory located?

What operating system are you using?

Relevant log output

Code of Conduct

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025 • edited Loading

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025 • edited Loading

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025 • edited Loading

NathanHbt commented Jan 8, 2025

NathanHbt commented Jan 8, 2025

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025 • edited Loading

mdecimus commented Jan 8, 2025

NathanHbt commented Jan 8, 2025 • edited Loading

NathanHbt commented Jan 8, 2025 • edited Loading

mdecimus commented Jan 9, 2025

NathanHbt commented Jan 9, 2025 • edited Loading

NathanHbt commented Jan 9, 2025

mdecimus commented Jan 9, 2025

NathanHbt commented Jan 9, 2025 • edited Loading

NathanHbt commented Jan 9, 2025

NathanHbt commented Jan 9, 2025

mdecimus commented Jan 9, 2025

NathanHbt commented Jan 9, 2025 • edited Loading

NathanHbt commented Jan 8, 2025 •

edited

Loading

NathanHbt commented Jan 8, 2025 •

edited

Loading

mdecimus commented Jan 8, 2025 •

edited

Loading

NathanHbt commented Jan 8, 2025 •

edited

Loading

NathanHbt commented Jan 8, 2025 •

edited

Loading

NathanHbt commented Jan 8, 2025 •

edited

Loading

NathanHbt commented Jan 9, 2025 •

edited

Loading

NathanHbt commented Jan 9, 2025 •

edited

Loading

NathanHbt commented Jan 9, 2025 •

edited

Loading