Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: 0.11.1 : More CPU usage for the database server #1066

Open
1 task done
NathanHbt opened this issue Jan 8, 2025 · 32 comments
Open
1 task done

[bug]: 0.11.1 : More CPU usage for the database server #1066

NathanHbt opened this issue Jan 8, 2025 · 32 comments
Labels
bug Something isn't working

Comments

@NathanHbt
Copy link

What happened?

Since the update to version 0.11.1 of the cluster machines (4 machines), I have noticed a 20-30% increase in CPU usage on the database server.
Is this an expected increase following this update?

How can we reproduce the problem?

Not simple to reproduce. The red bar represents the update.

Capture d’écran 2025-01-08 à 11 29 09
Capture d’écran 2025-01-08 à 11 29 14

Version

v0.11.x

What database are you using?

mySQL

What blob storage are you using?

S3-compatible

Where is your directory located?

Internal

What operating system are you using?

Linux

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NathanHbt NathanHbt added the bug Something isn't working label Jan 8, 2025
@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

The new version has improvements managing distributed queues. In other deployments version 0.11 was actually faster.
Can you check which queries are generating the load?
Also it could be that a purge task is running at the moment.

@NathanHbt
Copy link
Author

For your information, since the update, the CPU usage is at 95%, and the number of connections has increased from 50 to a constant 250.
I have a lot of “waiting for handler commit” and activity in the “m” table.
Capture d’écran 2025-01-08 à 13 41 01

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

Capture d’écran 2025-01-08 à 13 42 26

@NathanHbt
Copy link
Author

Capture d’écran 2025-01-08 à 13 54 58

@NathanHbt
Copy link
Author

Here is the feedback from my database provider:

A few days ago, the database was handling less than 2,000 queries per second, including around 100 insertions per second.

Currently, they are observing about 20,000 queries per second, including more than 5,000 insertions per second.

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

Try disabling the metrics store to see if it helps

@NathanHbt
Copy link
Author

Capture d’écran 2025-01-08 à 15 24 19

You speak about that ?

@NathanHbt
Copy link
Author

For information :

The change in behavior seems to have started around 10:30 AM (upgrade to 0.11.1), with a gradual increase in the number of queries.

The queries mainly involve your DATABASE_NAME database.

These queries follow a similar pattern:
INSERT INTO m (k, v) VALUES (…) ON DUPLICATE KEY UPDATE v = VALUES(v).

@NathanHbt
Copy link
Author

Capture d’écran 2025-01-08 à 17 07 49

For your information, at 12 PM, I switched from a MySQL instance with 2 CPUs to an instance with 4 CPUs. The overall CPU usage remains stuck at 90% regardless of the number of CPUs in the instance.

I suspect an issue related to the queue, as mentioned in my previous messages. In any case, there seems to be a problem since version 0.11.1, and the looping logs are related to the queue.

Sorry to insist, but this is causing a massive slowdown for all my users, whereas in version 0.10.7 everything was perfect. I am currently stuck and therefore relying on you to resolve this issue.

@NathanHbt
Copy link
Author

See #1069 for solution.

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

Try to obtain the top queries that are causing most of the load. That will tell us what is going on.

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

There are multiple issues open, I'm travelling at the moment and it's hard to follow them from my phone, let's stick to this one thread please. Here is a reply to each on of the problems you reported:

  • Concurrent limit exceeded: This is not an error message, the server is just telling you that there is a message due for immediate delivery but it's not possible to process it at the moment because the node has exceeded the maximum number of concurrent outbound connections. This is a new feature to protect the server from crashing trying to deliver more messages than it can handle. However the default log level should be changed to debug as it can generate a massive amount of events as you have seen.
  • Queue event locked: Also not an error, it means that another server is processing a certain message.
  • Spam filter causing database load: it's probably the Bayes filter storing the weights and reputation data. I recommend that you use Redis or similar as the in-memory store. Even if you don't use the spam filter you should be using an in-memory store rather than SQL. Read the updated documentation that explains what is the in-memory store used for.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

No problem, I’ll stay here.

How do you explain that when I disable all the options in the spam filter, I no longer get any warning messages about the concurrent connection limit? I also see a significant drop in the number of MySQL connections, and the entire infrastructure becomes smooth again.

I also no longer get any queue locked messages.

Disabling the spam filter features resolves all these messages, whether they are informational or errors. Their excessive presence was not normal.

I’m bringing all this up and insisting a bit to help Stalwart and assist you in identifying an issue, not to bother you. I dealt with the problem all day and I’m glad I resolved it by disabling the spam filter features. However, if I can help you figure out what is causing all the issues encountered (hence the multiple bugs reported, as initially I didn’t think they were related), it would be my pleasure.

Capture d’écran 2025-01-08 à 18 08 18

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

The spam filter does not trigger any outbound messages and does not interact with the queue in any way, so it is strange that the concurrency errors disappear when disabling it. The filter is executed at the SMTP DATA stage before a message is accepted in the system.
What the spam filter will generate is a lot of queries required to run the Bayes classifier and the reputation modules, that is why I was suggesting to use Redis (which you should do in any case).

Perhaps in 0.10 you had the spam filter disabled and you were not used to seeing all those extra queries?

@NathanHbt
Copy link
Author

I’m completely lost. The issue has reappeared after an hour without problems.
Where could this message come from:
“Concurrency limit exceeded (queue.concurrency-limit-exceeded) details = Outbound concurrency limit exceeded, limit = 25”?

I have no outgoing emails, nothing is being sent. How can the limit be reached if nothing is going out?

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

Check how many records do you have in tables 'e' and 'q', that is the message queue.
But you should also see them from the queue management page in the webadmin.

@NathanHbt
Copy link
Author

Only 52 messages in these tables.
Capture d’écran 2025-01-08 à 19 42 18
Capture d’écran 2025-01-08 à 19 42 23

@NathanHbt
Copy link
Author

Okay, so all the emails in the queue are looping. Please take a look at the following video.

Enregistrement.de.l.ecran.2025-01-08.a.19.45.40.mov

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

And do you see these messages from the webadmin?
Also, are you experiencing high load also at this moment?

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

The 52 messages are visible in the web admin.
Watch the video. You can see that a message in the queue on the web admin is continually being re-pushed.

I performed the following test:
Deleting the email from the queue via the interface.
Restarting the server.
The log no longer loops on this queue ID.

So, it seems that all emails in the queue are continuously looping rather than being sent once every 30 minutes, for example.

I experience a heavy load as soon as there is an email in the queue because it loops and overloads everything

@mdecimus
Copy link
Member

mdecimus commented Jan 8, 2025

Please export the queue by running stalwart with --console and I'll look into it soon.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

What is the command for export the queue, please ?
i have only :
Enter commands (type 'help' for available commands).

help
Available commands:
scan <from_key> <to_key>
delete <from_key> [<to_key>]
get
put []
help
exit/quit

In the meantime, I used the CLI and sent it to you via Discord

@NathanHbt
Copy link
Author

NathanHbt commented Jan 8, 2025

I just sent you a private video on Discord that clearly shows the issue.

By clearing the queue of 50 messages and restarting the cluster nodes, the database CPU load returns to normal (80% => 30%). This suggests there is an infinite loop in the process handling queued messages, which could explain the thousands of log entries and the limits being reached.

I am not sharing the video here for confidentiality reasons.

Capture d’écran 2025-01-08 à 20 26 43

@mdecimus
Copy link
Member

mdecimus commented Jan 9, 2025

I'll look into this tomorrow evening.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 9, 2025

Alright.
For your information, in the meantime, I am forced to manually clear the message queue.
I tried creating a script to force a “retry” on the messages in the queue, but neither the interface nor the CLI works for the retry. The email remains in the queue, and no server error message is displayed. The only solution is to “cancel” a message, which results in losing it.

I sent you a video on Discord to illustrate the retry action that’s not working.

CLI :
queue retry 307D7024CA03859
Successfully rescheduled 1 message(s).

Queue message détail :
Status | Scheduled |
+--------------+----------------------------------------------+
| Details | |
+--------------+----------------------------------------------+
| Retry # | 0 |

@NathanHbt
Copy link
Author

Do you have another method to force a retry?
This heavily impacts my infrastructure, and managing 3 days of issues will be challenging.

@mdecimus
Copy link
Member

mdecimus commented Jan 9, 2025

Try downgrading to 0.10 or use a single server for outgoing mail. This can be achieved by pausing the queue on all servers except one.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 9, 2025

If I revert to version 0.10.7, I encounter corrupted data errors.

ERROR Data corruption detected (store.data-corruption) causedBy = crates/store/src/write/mod.rs:365, causedBy = crates/store/src/dispatch/store.rs:125, causedBy = crates/jmap/src/services/index.rs:122, details = Failed to iterate over index emails

I redirected port 25 to a single machine instead of the 4 machines, but how can I pause the queue on the other machines?

Currently, the inbound message queues are looping at a rate of 3 to 4 times per second for each queue ID. You can see this in the videos, etc.

@NathanHbt
Copy link
Author

Ok, I clicked on ‘Pause’ on 3 out of 4 servers, and my CPU load is almost back to normal.
Only server #1 continues looping through the queue IDs. Acceptable for now while fixing the issue. I’m afraid I might lose quite a few emails.

@NathanHbt
Copy link
Author

I also notice that many emails remain in the queue without any specific reason displayed.
Take this email, for example; the ‘Server Response’ field remains blank, yet the email is not delivered.

Capture d’écran 2025-01-09 à 12 09 44

@mdecimus
Copy link
Member

mdecimus commented Jan 9, 2025

Ok, I clicked on ‘Pause’ on 3 out of 4 servers, and my CPU load is almost back to normal.

Only server #1 continues looping through the queue IDs. Acceptable for now while fixing the issue. I’m afraid I might lose quite a few emails.

You won't lose any messages, one server can handle the delivery of the entire cluster. The issue seems to be related to the distributed locks so letting just one server handle the queue should alleviate that. You might want to clear the in memory counter table which contains the locks.

Also, you seem to be the only distributed setup affected by this. This release is currently being used by an Enterprise client running an AlloyDB cluster handling hundreds of thousands of deliveries a day. The only difference is that they are using Redis as the memory store and of course AlloyDB but your issue seems to be unrelated to the backend type.

@NathanHbt
Copy link
Author

NathanHbt commented Jan 9, 2025

Thank you for your response.

To summarize:
• 3 nodes with the queue stopped
• 1 node with the queue active receiving emails via port 25

The 3 nodes with stopped queues no longer loop through the queued messages.
The node with the active queue continuously loops. Each queue ID appears in the logs multiple times per second, causing delivery delays and a significant increase in CPU usage on the MySQL database.

To await resolution of the issue, I have:
• Set up a server running version 0.10.7
• Redirected incoming SMTP traffic to this server
• Encountered the following error for each message, but the mail is successfully received, and indexing is properly performed on the other nodes.

ERROR Data corruption detected (store.data-corruption) causedBy = crates/store/src/write/mod.rs:365, causedBy = crates/store/src/dispatch/store.rs:125, causedBy = crates/jmap/src/services/index.rs:122, details = Failed to iterate over index emails

This is therefore a temporary solution, but it generates an error for every received email.

In version 0.10.7, I did not encounter this type of issue. The queue did not loop at all, and messages were delivered on time. MySQL is handling the workload well at the moment, and I am not sure if switching to a different database mid-operation is feasible without data loss, so I cannot replace MySQL.

The videos I sent you clearly show the infinite looping on queued messages and the relationship between this infinite loop and the significant CPU spike on the database server.

I hope the error in version 0.10.7 does not lead to another disaster, but for now, I have no other choice while waiting for your analysis and response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants